NIH’s Big Data to Knowledge (BD2K) initiative


This is a long awaiting news from NIH about announcing the first round of awardees for BD2K (sum of $32M).

These NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative, which is projected to have a total investment of nearly $656 million through 2020, pending available funds.

I browsed through the awards; they divided into four categories. The largest grant is a $3M (for 2014 FY) BioCADDIE, BIOMEDICAL AND HEALTHCARE DATA DISCOVERY AND INDEXING ENGINE CENTER, (btw, for those who never played golf as a rich one, caddie is a person who carries a golfer’s clubs and provides other assistance during a match). The other grants in that category are much smaller. One that caught my eye is about immunosequencing, Computational Tools for the analysis of high-throughput immunoglobulin sequencing.

11 grants are for the Centers of Excellence for Big Data Computing. Among those is the Center of Excellence for Mobile Sensor Data to Knowledge. My favorite one is for The Center for big data in translational genomics, given to UCSC’s PI David Haussler, developer of the UCSC Genome Browser, one of the most popular tools in genomics.

This one, A Framework for Integrating Multiple Data Sources for Modeling and Forecasting of Infectious Diseases, sounds very actual, considering the current Ebola outbreak, but has a ridiculously small budget of $100K (2014 FY), which would basically fund just one PI’s (this is K01 grant) training.

Another group of grants concerns education. Two for Big Data Educational efforts and development of Bioinformatics MOOC courses went to UCSD. Among other awardees for MOOCs are Johns Hopkins University, for developing courses in neuroimaging and genomics; Harvard University, for a modular online education program that brings together concepts from Statistics, Computer Science and Software Engineering; and initiatives from the Oregon Health and Science University, UCLA, and others. MOOCs are becoming a Big business, and soon there will be courses in various flavors.

Ebola: surprisingly Little immune data – A Transmission Electron Micrograph (TEM) of the Ebola virus RNA

Ebola is a clearly under-studied disease as no vaccine or treatment exists. We, at IEDB, in turn, were also surprised to see how little data on Ebola epitopes were reported to date. It took us just a couple of days to report on available data and a bit longer to get an attention (well, I am not good at publicity as I write this in a month since we published the first report, while Ebola is not getting away, and we are working on a larger publication).

To summarize, the EBOV genome encodes seven proteins: envelope glycoprotein (GP), nucleoprotein (NP), polymerase cofactor (VP35), transcription activator (VP30), matrix protein (VP40), secondary matrix protein (VP24), and RNA polymerase (L). There were reported 44 T cell and 15 B cell epitopes for GP (the surface protein and the most important target for antibodies) and 10 T cell and 20 B cell epitopes for NP, with a far fewer epitopes reported for other proteins. Unexpectedly, human and non-human primate hosts were absent in the T and B-cell assays, with murine hosts predominated and a smaller number of rabbit studies reported only for antibodies.

What was also surprising to us that of the monoclonal antibodies (mAbs) described to date, only one was derived from a human survivor of Ebola disease of the 1995 Kikwit outbreak (mAb KZ52). Only nine mAbs showed to be protective in mice (just recently, a new study using macaque was published in Nature) or in in vitro virus neutralization assays. Six of these mAbs comprise the three cocktails, ZMab, ZMapp, and MB-003, that have being evaluated as treatment options for Ebola disease, and ZMapp was presumably contributed to the successful treatment of two patients in the US. There were reported efforts to increase the Zmapp production after the already mentioned publication in Nature (as FDA clears an Ebola therapeutic approval for human if it was successfully tested in macaque).

It should be expected that the current attention to Ebola will result in discovering new protective human antibodies and immunodominant epitopes from survivors of the current outbreak, which can be used as therapeutics.

[When I was about to click submit, my husband asked me what this my post had to do with BigBioData. So I capitalized “Little” in the title — sometimes we wish to have Big Data, but we don’t realize that acquiring them might cost lives as it was with collecting and sequencing Ebola strains in current outbreak.]


Human Longevity, Inc. plans to overrun BGI

It is already three months as Human Longevity, Inc. was officially announced. The company now has a beautifully designed website and announcements about new executive hires.  The caliber of executives points at the company plans to grow huge and grow fast; thus CIO comes from AstraZeneca, where he was the Vice President, R&D IT responsible for the global IT organization services, analytics and infrastructure supporting drug discovery and development, leading a global team of approximately 300 and was accountable for the more than $120 million R&D IT budget. The company is also building a computing and informatics program and facility in Singapore.

BioIT World was one of the first to write about the company’s launch (for more news check the company’s website):

In a move that would be shocking from almost anyone else, Venter declared that his brand-new company’s sheer sequencing power will be leapfrogging the world’s best-established genomic research centers, such as the Broad Institute.

The company has acquired 20 the latest Illumina’s HiSeq X Ten machines ($1 million a piece), which would allow for sequencing full genomes of 40,000 people a year. For comparison, BGI by the end of 2013 had already sequenced 57,000 individuals. HLI doesn’t even need to compare itself with BGI as it plans to rapidly scale to 100,000 human genomes a year (considering that Illumina is among investors).

Human Longevity will also be characterizing at least some participants’ microbiomes, and, in partnership with Metabolon of North Carolina, their metabolomes, or the constantly changing array of small molecules present in the body. On top of that, said Venter, “we will be importing clinical records of every individual we’re sequencing,” in order to bring on board crucial phenotypic data.

The goal is to integrate this mass of data for new discoveries that can wed individuals’ own genetic variants, the composition of their bacteria, the molecules in their blood, and most importantly, their medical histories. Venter stressed that his aim is to enable predictive and preventative medicine for healthy aging, discovering early warning signs for susceptibility to chronic illnesses like cancer, Alzheimer’s, and heart disease, as well as new interventions tailored to each individual’s distinct profile. “We think this will have a huge impact on changing the cost of medicine,” added Venter.

A longer-term goal is to translate some of this information into stem cell therapies, an application that ties Human Longevity to Venter’s existing company, Synthetic Genomics.

But the goal of this year is sequencing genomes of cancer patients in collaboration with the UCSD Moores Cancer Center.

What exactly the company is going to do with all these data? Do research and publish papers? Yes, and some of principal scientists hired by the company got appointments at Craig Venter Institute. Sell data and the results of analysis? Yes. “Venter and his colleagues also held out the possibility of other commercial products and properties emerging from the company’s basic research.” The company is also actively hiring, and not only computational professionals but clinical and wet lab scientists as well. Here are some more about the company mission from job ads:

HLI will develop the most comprehensive gene-phenotype database in the world, with phenotype information deriving from molecular, physiologic, clinical, microbiome and longitudinal data assays. This database will be mined for biological meaningful patterns that can lead to better diagnostics, therapeutic targets and next-generation cell-replacement therapies.



Is Beijing Genomics Institute to establish Chinese dominance in genomics market?


I am well past due to post on the New Yorker’s article about B.G.I. The article is by subscription only, and here I am citing the most interesting parts.

B.G.I., formerly called Beijing Genomics Institute, the world’s largest genetic-research center. With a hundred and seventy-eight machines to sequence the precise order of the billions of chemicals within a molecule of DNA, B.G.I. produces at least a quarter of the world’s genomic data—more than Harvard University, the National Institutes of Health, or any other scientific institution.

the company has already processed the genomes of fifty-seven thousand  (57000) people. B.G.I. also has sequenced many strains of rice, the cucumber, the chickpea, the giant panda, the Arabian camel, the yak, a chicken, and forty types of silkworm.

The company was founded in 9/9/1999 at 9:19 am in Beijing, China. It has now 4000 employees of an average age of 26, is located in Shenzhen nearby to the infamous Foxconn factory, and operates on a $1.58-billion loan from the China Development Bank, including multiple nonprofit and commercial projects, such as DNA sequencing 10000 people from the families with autism in the US and a thousand of obese and healthy people in Denmark. The BGI’s plans include the Million Human Genome Project, the Million Plant and Animal Genomes Project, the Million Microecosystem Genomes Project, and the controversial Cognitive Genomics project, also millet (very drought-tolerant crop) and cassava projects, both holding a big promise for feeding China and Africa.

BGI is the biggest customer of Illumina, which has sold BGI 130 sequencers for half a million dollars each (my guess, that would be HiSeq 2000 and HiSeq 2500; the latest and the most powerful HiSeq X Ten, released in 2014, costs about a million). When in 2013 BGI bought the main Illumina’s competitor Complete Genomics, Jay Flatley, Illumina’s CEO, said: “It is one thing to sell Coke and another to sell the formula for Coke. … when they bought Complete Genomics … they were allowed to … buy the formula.”

The article concludes discussing the Cognitive Genomics project, which goals are to select intelligent high-IQ embryos, to find cure for Alzheimer’s and to map the brain: At some point … people will look back and wonder what all the fuss was about [Chris Chang, a visiting scholar at BGI].”


Companies to watch: DNAnexus


This is an excerpt from the job ad at DNAnexus:

At DNAnexus we are solving the most challenging computer science problems you’re ever likely to see. 
In the last few years, there has been a dramatic development in the world of genomics that has created a huge new opportunity. The price to sequence the full human genome (all of your DNA, not just a sample of it) has fallen to the point were it will soon be affordable for a patient to have multiple samples of their whole genome sequenced to help treat their disease. Want to know what specific gene mutation caused a patient’s cancer? We are building the platform to answer that kind of question. One of the many challenges is the huge amount of data. Think you’ve seen big-data problems? Think again – with each genome comprising 100 GB and months of CPU time to crunch the information, DNA is the next big-data problem, requiring exabytes of storage and parallel workloads distributed across 100,000 servers. We are tackling this by combining web technologies, big-data analytics, and scalable systems on cloud computing infrastructure. 

We are a well-funded start-up backed by Google Ventures, TPG Biotech, and First Round capital. Our founders, Andreas Sundquist, Arend Sidow, Serafim Batzoglou are world-renowned genomics and bioinformatics experts from Stanford University.

We are looking for smart motivated people. Ideal candidates will likely know several of the following technologies: C, C++, Boost, Ruby, Rails, HAML, HTML, CSS, JavaScript, ECMAScript, V8, jQuery, Flash, Flex/ActionScript, Node.js, Python, Perl, PHP, Amazon Web Services (AWS), SQL, MySQL, PostgreSQL, MongoDB, Solr, Sphinx, Hadoop, MapReduce, ZooKeeper, Hive, Pig, Oozie, HDFS, ZFS, MCache. 

And some more about the company from the company’s website:

No proposal is considered too large, as demonstrated through the CHARGE project, a collaboration between Baylor College of Medicine’s Human Genome Sequencing Center (HGSC), DNAnexus, and Amazon Web Services that enabled the largest genomic analysis project to have ever taken place in the cloud. We worked with HGSC to deploy its Mercury variant-calling pipeline on the DNAnexus cloud platform, processing 3,751 whole genomes and 10,940 exomes using 3.3 million core-hours with a peak of 20,800 cores and making Mercury available to 300 researchers participating in the global CHARGE consortium.

Here is the service model and pricing DNAnexus provides, effective May 2014:



Why Google Flu Trends (GFT) was trapped in Big Data

Science March 2014

Science March 2014

Science published a curious article “Big data. The parable of Google Flu: traps in big data analysis.” (pdf is available here).

Here is a bit of history. GFT was launched in 2008, and in 2009 Nature published an article (not publicly available) Detecting influenza epidemics using search engine query data that described a method behind GFT, which is based on monitoring queries to online search engines and, in essence, finds the best matches among 50 million search terms to fit 1152 data points. Everything went well (more history in the current Science article) until 2013 When Google got flu wrong (Nature, Feb 2013) and gravely overestimated the Christmas national peak of flu.

Current article points to two issues contributed to GFT’s mistakes— big data hubris and algorithm dynamics.

Firstly, “The odds of finding search terms that match the propensity of the flu but are structurally unrelated, and so do not predict the future, were quite high. … the big data were overfitting the small number of cases—a standard concern in data analysis.” At the same time, the algorithm overlooked considerable information that could be extracted by traditional statistical methods. The second issue concerns the Google search algorithm and which data Google collects as both constantly and significantly change:

“The most common explanation for GFT’s error is a media-stoked panic last flu season . Although this may have been a factor, it cannot explain why GFT has been missing high by wide margins for more than 2 years. The 2009 version of GFT has weathered other media panics related to the fl u, including the 2005–2006 influenza A/H5N1 (“bird flu”) outbreak and the 2009 A/H1N1 (“swine fl u”) pandemic. A more likely culprit is changes made by Google’s search algorithm itself.”

The authors reasonably conclude:

“Big data offer enormous possibilities for understanding human interactions at a societal scale, … However, traditional “small data” often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection. … Instead of focusing on a “big data revolution,” perhaps it is time we were focused on an “all data revolution,” where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.”



Injectable material for targeted drug delivery

Nanotechnology "Pill Bot"

Texas A&M University reports (with a video) on developing a carrier system that can deliver medicines and biosensors to targeted areas of the body. It is made off a hydrogel (a biocompatible material that doesn’t trigger the immune response) embedded with porous microparticles made from clusters of calcium carbonate nanoparticles that are formed around a specific material to trap it inside. Once a drug or biosensor is trapped within such microsphere, multiple layers of polymers are wrapped around the particles, thus allowing for a precise and customizable control over how the microcapsule will release its contents when it interacts with its surrounding environment.