NIH’s Big Data to Knowledge (BD2K) initiative


This is a long awaiting news from NIH about announcing the first round of awardees for BD2K (sum of $32M).

These NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative, which is projected to have a total investment of nearly $656 million through 2020, pending available funds.

I browsed through the awards; they divided into four categories. The largest grant is a $3M (for 2014 FY) BioCADDIE, BIOMEDICAL AND HEALTHCARE DATA DISCOVERY AND INDEXING ENGINE CENTER, (btw, for those who never played golf as a rich one, caddie is a person who carries a golfer’s clubs and provides other assistance during a match). The other grants in that category are much smaller. One that caught my eye is about immunosequencing, Computational Tools for the analysis of high-throughput immunoglobulin sequencing.

11 grants are for the Centers of Excellence for Big Data Computing. Among those is the Center of Excellence for Mobile Sensor Data to Knowledge. My favorite one is for The Center for big data in translational genomics, given to UCSC’s PI David Haussler, developer of the UCSC Genome Browser, one of the most popular tools in genomics.

This one, A Framework for Integrating Multiple Data Sources for Modeling and Forecasting of Infectious Diseases, sounds very actual, considering the current Ebola outbreak, but has a ridiculously small budget of $100K (2014 FY), which would basically fund just one PI’s (this is K01 grant) training.

Another group of grants concerns education. Two for Big Data Educational efforts and development of Bioinformatics MOOC courses went to UCSD. Among other awardees for MOOCs are Johns Hopkins University, for developing courses in neuroimaging and genomics; Harvard University, for a modular online education program that brings together concepts from Statistics, Computer Science and Software Engineering; and initiatives from the Oregon Health and Science University, UCLA, and others. MOOCs are becoming a Big business, and soon there will be courses in various flavors.


I got Stat Learning Certificate from Stanford!


That was the best MOOC I took so far — fun and engaging. Statistics and R from Trevor and Rob and Stanford — a dream coming true!

Binge studying “Statistical Learning with R” from Stanford


After finishing “House of Cards” I am on a binge studying of the Stanford online course “Statistical Learning”. That is the best ever class on the subject: comprehendible explanations, no rush, appropriate jokes, no excessive material, the free book for download that closely follows the material of the class, and lovely R sessions running by Trevor Hastie himself.

I have learned new tricks in R. For example, a handy matplot() function, which output is shown above. On one plot, it shows values for each variable (column) for each row (X-axes) in the data frame. It can be seen that there is strong autocorrelation between consecutive rows, or about 10-20 repeats of every data point — so the sample size is in effect much smaller than the number of rows. That would result in a serious underestimate of the standard error (s.e.) for regression coefficients, if to model y (values are shown in black on the plot) as a linear function of x1 (in red) and x2 (in blue): y = b0 + b1*x1 + b2* x2. Thus, for b1, the standard bootstrap gave s.e. of 0.028, while the block bootstrap (by blocks of 100 rows), 0.196.

While I was satisfying my mental curiosity and was busy with launching a new project at work (implementation of a new analytical tool for to be launched in summer), a lot has happened in the world of Big Data that I crave for comprehension and reporting.

Stanford receives $3M for Big Data in biomedicine

Li Ka Shing Foundation gave a grant to the Stanford University School of Medicine to boost the Big Data in Biomedicine initiative in collaboration with the University of Oxford in England via recruiting new faculty, establishing new educational programs and a major conference on big data in May 2014 at Stanford.

“In the world of medicine, we have a tsunami of data crashing over us, including electronic patient records, DNA sequencing, biological data on disease mechanisms, clinical trials, medical imaging and pharmaceutical records. We can put all these large data sets to work to identify innovative approaches to treatment and to improving access to care,” said Lloyd Minor, MD, dean of the School of Medicine.

The University of Oxford faculty members are leaders in one of the largest patient databanks in the world, the UK biobank, which has biomedical information on some 500,000 individuals. (About mining patient databanks for developing new medicines, see this post.) 

The Stanford arm of the effort will be directed by Euan Ashley, MD, PhD, whose research focuses on developing methods for interpreting genome-sequencing data to improve diagnosis of genetic disease and to develop targeted therapies for patients.

Statistical Learning online class from Stanford just started

The course has officially started. It is free for everyone and is available at It is in R. A new book from the authors An Introduction to Statistical Learning, with Applications in R (James, Witten, Hastie, Tibshirani – Springer 2013) is provided for free download on the course website.

Everyone completing all problems by March 21 can get a Statement of Accomplishment. I am thinking about getting one, while I am not new to the subject and took a somewhat similar graduate course at UCSD given by Charles Elkan, but I found out that setting up the goal of getting a certificate makes it easier to keep with the course.

Hastie and Tibshirani start the introduction (of the actual class, not on the above video) joking how they were statisticians, then became machine learning guys, and now they proudly call themselves Data Scientists. Watch this! I am looking forward to learn from these gurus. I am sure it will be fun!