NIH’s Big Data to Knowledge (BD2K) initiative

Awards2

This is a long awaiting news from NIH about announcing the first round of awardees for BD2K (sum of $32M).

These NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative, which is projected to have a total investment of nearly $656 million through 2020, pending available funds.

I browsed through the awards; they divided into four categories. The largest grant is a $3M (for 2014 FY) BioCADDIE, BIOMEDICAL AND HEALTHCARE DATA DISCOVERY AND INDEXING ENGINE CENTER, (btw, for those who never played golf as a rich one, caddie is a person who carries a golfer’s clubs and provides other assistance during a match). The other grants in that category are much smaller. One that caught my eye is about immunosequencing, Computational Tools for the analysis of high-throughput immunoglobulin sequencing.

11 grants are for the Centers of Excellence for Big Data Computing. Among those is the Center of Excellence for Mobile Sensor Data to Knowledge. My favorite one is for The Center for big data in translational genomics, given to UCSC’s PI David Haussler, developer of the UCSC Genome Browser, one of the most popular tools in genomics.

This one, A Framework for Integrating Multiple Data Sources for Modeling and Forecasting of Infectious Diseases, sounds very actual, considering the current Ebola outbreak, but has a ridiculously small budget of $100K (2014 FY), which would basically fund just one PI’s (this is K01 grant) training.

Another group of grants concerns education. Two for Big Data Educational efforts and development of Bioinformatics MOOC courses went to UCSD. Among other awardees for MOOCs are Johns Hopkins University, for developing courses in neuroimaging and genomics; Harvard University, for a modular online education program that brings together concepts from Statistics, Computer Science and Software Engineering; and initiatives from the Oregon Health and Science University, UCLA, and others. MOOCs are becoming a Big business, and soon there will be courses in various flavors.

Ebola: surprisingly Little immune data

www.nbcnews.com

http://www.nbcnews.com – A Transmission Electron Micrograph (TEM) of the Ebola virus RNA

Ebola is a clearly under-studied disease as no vaccine or treatment exists. We, at IEDB, in turn, were also surprised to see how little data on Ebola epitopes were reported to date. It took us just a couple of days to report on available data and a bit longer to get an attention (well, I am not good at publicity as I write this in a month since we published the first report, while Ebola is not getting away, and we are working on a larger publication).

To summarize, the EBOV genome encodes seven proteins: envelope glycoprotein (GP), nucleoprotein (NP), polymerase cofactor (VP35), transcription activator (VP30), matrix protein (VP40), secondary matrix protein (VP24), and RNA polymerase (L). There were reported 44 T cell and 15 B cell epitopes for GP (the surface protein and the most important target for antibodies) and 10 T cell and 20 B cell epitopes for NP, with a far fewer epitopes reported for other proteins. Unexpectedly, human and non-human primate hosts were absent in the T and B-cell assays, with murine hosts predominated and a smaller number of rabbit studies reported only for antibodies.

What was also surprising to us that of the monoclonal antibodies (mAbs) described to date, only one was derived from a human survivor of Ebola disease of the 1995 Kikwit outbreak (mAb KZ52). Only nine mAbs showed to be protective in mice (just recently, a new study using macaque was published in Nature) or in in vitro virus neutralization assays. Six of these mAbs comprise the three cocktails, ZMab, ZMapp, and MB-003, that have being evaluated as treatment options for Ebola disease, and ZMapp was presumably contributed to the successful treatment of two patients in the US. There were reported efforts to increase the Zmapp production after the already mentioned publication in Nature (as FDA clears an Ebola therapeutic approval for human if it was successfully tested in macaque).

It should be expected that the current attention to Ebola will result in discovering new protective human antibodies and immunodominant epitopes from survivors of the current outbreak, which can be used as therapeutics.

[When I was about to click submit, my husband asked me what this my post had to do with BigBioData. So I capitalized “Little” in the title — sometimes we wish to have Big Data, but we don’t realize that acquiring them might cost lives as it was with collecting and sequencing Ebola strains in current outbreak.]

 

Is it time for sequencing entire antibody repertoires?

antibody

Continuing on the theme of immunosequencing, this post is about Atreca, Inc., founded in 2010 in San Carlos, California, and funded by the Bill & Melinda Gates Foundation. I first heard about this company a year ago from its co-founder Prof. Robinson of Stanford. As he explained in his talk given at Scripps, the company utilizes a novel HT technology, called Immune Repertoire Capture™, allowing to isolate B cells (plasmablasts, or plasma B cells producing antibodies, precisely), to barcode and sequence their cDNA, and finally to perform bioinformatics analysis (building a tree from comparison of antibody chain sequences). In the result, in two weeks, the whole antibody repertoire of an individual is decoded, and most importantly the pairing between light and heavy chains of each antibody is established because the technology allows to barcode each cell individually. That allows to get the whole makeup of an in antibody repertoire, frozen in time, and to see the rare clone families of antibodies, immunodominant (recognizing specific antigens) antibodies and also memory B cells, which would look like single branches on the dendogram of all antibodies (or B-cells, which is the same here as each B cell produces one type of antibody).

Applied to human disease, Immune Repertoire Capture™ is an engine for the discovery and development of antibody-based therapeutics, vaccines, diagnostics, and research reagents in therapy areas including cancer, infectious disease, and autoimmune disease.

How could it be? One example, take a cancer patient who is a long term non-progressor and look into which unique antibodies she produces that progressors do not — those antibodies can be studied to become antibody therapeutics against that form of cancer.

How is the company doing today? Unfortunately, not much info can be found in the news for the last year, except about 3 rounds of financing, two of which are debt financings, and a job ad on BioSpace for a Research Associate, versed in PCR and NGS. The company website is dated by 2012. Searching for Atreca, LinkedIn returns 20 profiles. The concept looks right and timely. Or, overwhelmed by genomics data, the world is not ready to deal with yet another deluge of big data?

ImmunoSequencing holds great future for immunotherapy

adaptive

Adaptive Biotechnologies is doing something unimaginable 4-5 years ago by adding on top of genomics, proteomics, microbiomics and metabolomics data yet another layer of big biological data.

“Adaptive” refers to the adaptive immune system, the major players of which are B and T cells as they are the major types of lymphocytes. The human body has about 2 trillion lymphocytes, constituting 20–40% of white blood cells and weighting about the same as the brain or liver (!). The peripheral blood contains 2% of circulating lymphocytes; the rest move within the tissues and lymphatic system. All vertebrate animals (or Chordata) — amphibiansreptiles,mammals, and birds — have an adaptive immune system, while other living organisms do not.

Our body, specifically bone marrow and thymus, generate every day 10^8 – 10^12 T and B cells, respectively, each carrying a unique B and T cell receptor (called BCR and TCR), made by selecting and splicing together V, D and J genes from an individual’s genome. At the junctions between V-D and D-J segments, a varying number of nucleotides are deleted and a special enzyme inserts random nucleotides, creating a unique TCR or BCR. B and T cells circulate throughout the lymphatic system for a month or two recognizing pathogens and becoming in the result memory B and T cells so next time the body encounter the same pathogen it can be killed immediately. Every individual has a unique repertoire, or “makeup”, of B and T cells circulating in her body, which is also unique to the current immune status of the individual. Thus it will differ when the person becomes sick. It is not hard to imagine the medical applications: for example, B cells produce antibodies, and thus the B cell repertoire of patients who fought HIV or cancer present an interest for finding new antibody therapeutics against HIV and cancer.

Adaptive Biotechnologies developed a technology, based on massive parallel sequencing of T cell receptors, that allows to  identify 10-15 million unique TCRs in one individual. “The immunoSEQ assay utilizes a multiplex PCR strategy to amplify the CDR3 region of the T cell receptor, spanning the variable region formed by the junction of the V, D and J segments and their associated non-templated insertions. In most cases, the identity of each segment is also captured. The resulting 60 base pair nucleotide sequence may be used as an identifier or “tag” for a particular clone across different samples.” (from the company website)

Current focus of the company on a drug therapy targeting the specific TCR combinations associated with prevalent auto-immune diseases, when T cells mistakenly attack the proteins of the individual. Thus, in collaboration with the Benaroya Research Institute at Virginia Mason Hospital in Seattle, Adaptive is currently screening blood samples from patients with either Type 1 Diabetes or Multiple Sclerosis for public TCR sequences. Over the next 12 months, Adaptive plans to develop general diagnostics to use as clinical measurements, with an initial focus in oncology. The company already offers ClonoSEQ, a set of CLIA-standardized assays developed for highly sensitive (10^6) detection of Minimal Residual Disease in leukemia and lymphoma patients.

The company provides the full service, so researchers can ship DNA and cDNA samples to Adaptive’s headquarters in Seattle, Washington, and then interpret and manipulate produced data on their computers using a suite of cloud-based software tools called the immunoSEQ Analyzer.

 

Human Longevity, Inc. plans to overrun BGI

It is already three months as Human Longevity, Inc. was officially announced. The company now has a beautifully designed website and announcements about new executive hires.  The caliber of executives points at the company plans to grow huge and grow fast; thus CIO comes from AstraZeneca, where he was the Vice President, R&D IT responsible for the global IT organization services, analytics and infrastructure supporting drug discovery and development, leading a global team of approximately 300 and was accountable for the more than $120 million R&D IT budget. The company is also building a computing and informatics program and facility in Singapore.

BioIT World was one of the first to write about the company’s launch (for more news check the company’s website):

In a move that would be shocking from almost anyone else, Venter declared that his brand-new company’s sheer sequencing power will be leapfrogging the world’s best-established genomic research centers, such as the Broad Institute.

The company has acquired 20 the latest Illumina’s HiSeq X Ten machines ($1 million a piece), which would allow for sequencing full genomes of 40,000 people a year. For comparison, BGI by the end of 2013 had already sequenced 57,000 individuals. HLI doesn’t even need to compare itself with BGI as it plans to rapidly scale to 100,000 human genomes a year (considering that Illumina is among investors).

Human Longevity will also be characterizing at least some participants’ microbiomes, and, in partnership with Metabolon of North Carolina, their metabolomes, or the constantly changing array of small molecules present in the body. On top of that, said Venter, “we will be importing clinical records of every individual we’re sequencing,” in order to bring on board crucial phenotypic data.

The goal is to integrate this mass of data for new discoveries that can wed individuals’ own genetic variants, the composition of their bacteria, the molecules in their blood, and most importantly, their medical histories. Venter stressed that his aim is to enable predictive and preventative medicine for healthy aging, discovering early warning signs for susceptibility to chronic illnesses like cancer, Alzheimer’s, and heart disease, as well as new interventions tailored to each individual’s distinct profile. “We think this will have a huge impact on changing the cost of medicine,” added Venter.

A longer-term goal is to translate some of this information into stem cell therapies, an application that ties Human Longevity to Venter’s existing company, Synthetic Genomics.

But the goal of this year is sequencing genomes of cancer patients in collaboration with the UCSD Moores Cancer Center.

What exactly the company is going to do with all these data? Do research and publish papers? Yes, and some of principal scientists hired by the company got appointments at Craig Venter Institute. Sell data and the results of analysis? Yes. “Venter and his colleagues also held out the possibility of other commercial products and properties emerging from the company’s basic research.” The company is also actively hiring, and not only computational professionals but clinical and wet lab scientists as well. Here are some more about the company mission from job ads:

HLI will develop the most comprehensive gene-phenotype database in the world, with phenotype information deriving from molecular, physiologic, clinical, microbiome and longitudinal data assays. This database will be mined for biological meaningful patterns that can lead to better diagnostics, therapeutic targets and next-generation cell-replacement therapies.

 

 

23andMe doesn’t want people to wait around for 10 years for personalized medicine

Human Chromosomes

Eric Topol interviews Anne E. Wojcicki, co-founder and CEO of 23andMe.

Ms. Wojcicki: [We have] 650,000 [genotyped individuals] now. We are by far the largest. It is phenomenal. When you look at some of our papers, and we say that we had 40,000 people with asthma, and 150,000 controls, our numbers are genuinely huge. My inspiration was my father. He’s a particle physicist, and they collect really big data. He used to laugh at clinical trials. He would say, “Three hundred people — what is this?” So my goal was always to get huge numbers to really understand how things work. The price point has dramatically dropped, and that has really spurred the volume.

… everything I see the Obama administration doing is pushing individuals to take more control of their health.

You can already see what the Beijing Genomics Institute is doing. It is the largest in the world. They have massive interest in getting everybody genotyped or sequenced. Saudi Arabia announced plans [to genotype] 100,000 individuals. The United Kingdom is doing 100,000; Scotland has a big program. The rest of the world is moving forward aggressively with this, but we are somewhat stuck. It’s going to happen, and overwhelmingly it is going to improve healthcare. So how do we do that?

Ms. Wojcicki: The way I run the company is to think about if I were sick with a disease; what would I want to happen? If you have a child with sarcoma, you don’t care whether Pfizer or Glaxo or Hopkins or Harvard gets the data. You just want someone to do something with the data. Rather than saying that we are going to monetize and do all of these things, I point the finger at all of the pharma companies and groups who are just sitting on frozen piles of data because they don’t want to do anything with it yet. I want everybody to start to use the data to do something good. Otherwise, for this child with sarcoma, what’s going to happen?

Dr. Topol: When you have a million or even 10 million people, and you can find these rare variants that a lot of other people can’t find, that’s an exciting opportunity. We are going to watch this and follow along with you.

Ms. Wojcicki: That is definitely the direction in which we are going.

See this my post about BGI.

Is Beijing Genomics Institute to establish Chinese dominance in genomics market?

Illumina_Hiseq_2000_sequencers,_BGI_Hong_Kong_sequencing_room

I am well past due to post on the New Yorker’s article about B.G.I. The article is by subscription only, and here I am citing the most interesting parts.

B.G.I., formerly called Beijing Genomics Institute, the world’s largest genetic-research center. With a hundred and seventy-eight machines to sequence the precise order of the billions of chemicals within a molecule of DNA, B.G.I. produces at least a quarter of the world’s genomic data—more than Harvard University, the National Institutes of Health, or any other scientific institution.

the company has already processed the genomes of fifty-seven thousand  (57000) people. B.G.I. also has sequenced many strains of rice, the cucumber, the chickpea, the giant panda, the Arabian camel, the yak, a chicken, and forty types of silkworm.

The company was founded in 9/9/1999 at 9:19 am in Beijing, China. It has now 4000 employees of an average age of 26, is located in Shenzhen nearby to the infamous Foxconn factory, and operates on a $1.58-billion loan from the China Development Bank, including multiple nonprofit and commercial projects, such as DNA sequencing 10000 people from the families with autism in the US and a thousand of obese and healthy people in Denmark. The BGI’s plans include the Million Human Genome Project, the Million Plant and Animal Genomes Project, the Million Microecosystem Genomes Project, and the controversial Cognitive Genomics project, also millet (very drought-tolerant crop) and cassava projects, both holding a big promise for feeding China and Africa.

BGI is the biggest customer of Illumina, which has sold BGI 130 sequencers for half a million dollars each (my guess, that would be HiSeq 2000 and HiSeq 2500; the latest and the most powerful HiSeq X Ten, released in 2014, costs about a million). When in 2013 BGI bought the main Illumina’s competitor Complete Genomics, Jay Flatley, Illumina’s CEO, said: “It is one thing to sell Coke and another to sell the formula for Coke. … when they bought Complete Genomics … they were allowed to … buy the formula.”

The article concludes discussing the Cognitive Genomics project, which goals are to select intelligent high-IQ embryos, to find cure for Alzheimer’s and to map the brain: At some point … people will look back and wonder what all the fuss was about [Chris Chang, a visiting scholar at BGI].”

 

To GitHub or not to GitHub?

bart_os

I wanted to write about open source for a while. Then I saw in several bioinformatics job postings an experience with open source software being listed among desired skills — something I never saw before. Then I bumped into an article in a paper version of Computerworld “Saying Yes to open source” and the following: Netflix is built on open source — since its so low priced service; Kaiser Permanente has been using GitHub since 2011.

Well, what is open source? Wikipedia says: “open source is a development model that promotes a) universal access via free license to a product’s design or blueprint, and b) universal redistribution of that design or blueprint, including subsequent improvements to it by anyone” and adds that Wikipedia is an example of applying open source outside the realm of computer software. Inside, the best example of an open source software would be Red Hut Enterprise Linux; while Unix, which I for example use on my MacBook, is not open source. RStudio is an open source IDE for R programming language. Eclipse, one of my favorite IDEs for whatever language, is also open source allowing to anyone to edit existing and to  write new plug-ins. A lot of commonly used bioinformatics software  is open source: Jmol and RasMol viewers, BioPython, BioJava, Bioconductor (and other R packages in CRAN), Galaxy — a workflow web-based system for analysis of genomics data and integration of data from UCSC Genome Browser and user data, and many other.

On another side of this free open source paradise is code hosting web repositories. And while many of them have been set up in the last 15 years, only the oldest, SourceForge (hosts more than 430,000 projects and has more than 3.7 million registered users, by Wikipedia), and the youngest, GitHub (hosting over 11.7 million repositories makes it the largest code host in the world), are the two everyone can name on-the-spot.

GitHub seems to be one to go with, for now. I first encountered GitHub last summer when my summer high school students refused to use SVN and sent me a link to GitHub. Now they have something to show up to the world :). We haven’t embraced GitHub at work yet, while are considering to use it for showing up the code (it is not free for private developments). While according to some posts it is worthwhile to pay and transition to GitHub for private developments as well.

The platform of tomorrow is the Web

hp-chromebook-11-thumb

While some still argue if Chromebook is a legitimate laptop, Auberge Resorts (the chain of seven luxury hotels) is migrating to Chromebooks running Google’s cloud-based services, — reports Computerworld. It is a bold move, considering that the Chrome OS-based laptops accounted for just 1% of the world PC market in 2013. The main reason? The price: Chromebook can be bought for as little as $200 — a great deal for those who are mostly the MS Office’s users. And Microsoft is going to offer its free Office online apps — Word, Excel, PowerPoint and OneNote — to Google’s Chrome Web Store.

Companies to watch: DNAnexus

DNAnexus

This is an excerpt from the job ad at DNAnexus:

At DNAnexus we are solving the most challenging computer science problems you’re ever likely to see. 
In the last few years, there has been a dramatic development in the world of genomics that has created a huge new opportunity. The price to sequence the full human genome (all of your DNA, not just a sample of it) has fallen to the point were it will soon be affordable for a patient to have multiple samples of their whole genome sequenced to help treat their disease. Want to know what specific gene mutation caused a patient’s cancer? We are building the platform to answer that kind of question. One of the many challenges is the huge amount of data. Think you’ve seen big-data problems? Think again – with each genome comprising 100 GB and months of CPU time to crunch the information, DNA is the next big-data problem, requiring exabytes of storage and parallel workloads distributed across 100,000 servers. We are tackling this by combining web technologies, big-data analytics, and scalable systems on cloud computing infrastructure. 

We are a well-funded start-up backed by Google Ventures, TPG Biotech, and First Round capital. Our founders, Andreas Sundquist, Arend Sidow, Serafim Batzoglou are world-renowned genomics and bioinformatics experts from Stanford University.

We are looking for smart motivated people. Ideal candidates will likely know several of the following technologies: C, C++, Boost, Ruby, Rails, HAML, HTML, CSS, JavaScript, ECMAScript, V8, jQuery, Flash, Flex/ActionScript, Node.js, Python, Perl, PHP, Amazon Web Services (AWS), SQL, MySQL, PostgreSQL, MongoDB, Solr, Sphinx, Hadoop, MapReduce, ZooKeeper, Hive, Pig, Oozie, HDFS, ZFS, MCache. 

And some more about the company from the company’s website:

No proposal is considered too large, as demonstrated through the CHARGE project, a collaboration between Baylor College of Medicine’s Human Genome Sequencing Center (HGSC), DNAnexus, and Amazon Web Services that enabled the largest genomic analysis project to have ever taken place in the cloud. We worked with HGSC to deploy its Mercury variant-calling pipeline on the DNAnexus cloud platform, processing 3,751 whole genomes and 10,940 exomes using 3.3 million core-hours with a peak of 20,800 cores and making Mercury available to 300 researchers participating in the global CHARGE consortium.

Here is the service model and pricing DNAnexus provides, effective May 2014:

DNAnexus_pricing_MAy_2014