Why Google Flu Trends (GFT) was trapped in Big Data

Science March 2014

Science March 2014

Science published a curious article “Big data. The parable of Google Flu: traps in big data analysis.” (pdf is available here).

Here is a bit of history. GFT was launched in 2008, and in 2009 Nature published an article (not publicly available) Detecting influenza epidemics using search engine query data that described a method behind GFT, which is based on monitoring queries to online search engines and, in essence, finds the best matches among 50 million search terms to fit 1152 data points. Everything went well (more history in the current Science article) until 2013 When Google got flu wrong (Nature, Feb 2013) and gravely overestimated the Christmas national peak of flu.

Current article points to two issues contributed to GFT’s mistakes— big data hubris and algorithm dynamics.

Firstly, “The odds of finding search terms that match the propensity of the flu but are structurally unrelated, and so do not predict the future, were quite high. … the big data were overfitting the small number of cases—a standard concern in data analysis.” At the same time, the algorithm overlooked considerable information that could be extracted by traditional statistical methods. The second issue concerns the Google search algorithm and which data Google collects as both constantly and significantly change:

“The most common explanation for GFT’s error is a media-stoked panic last flu season . Although this may have been a factor, it cannot explain why GFT has been missing high by wide margins for more than 2 years. The 2009 version of GFT has weathered other media panics related to the fl u, including the 2005–2006 influenza A/H5N1 (“bird flu”) outbreak and the 2009 A/H1N1 (“swine fl u”) pandemic. A more likely culprit is changes made by Google’s search algorithm itself.”

The authors reasonably conclude:

“Big data offer enormous possibilities for understanding human interactions at a societal scale, … However, traditional “small data” often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection. … Instead of focusing on a “big data revolution,” perhaps it is time we were focused on an “all data revolution,” where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.”



Russia’s leading search engine Yandex upgrades its Big Data devision


The Moscow times reports on changes in the Yandex’s management re Big Data.

“If Yandex decides to seriously pursue big data, there is a good chance that Russia will enter the international big-data market.”

So far, according to the report, the projects pursued by Yandex concerned mobile communications to predict which customers were most likely to change providers, services to an unnamed bank interested in reducing the number of ATMs that reject bank cards, and services to Russian oil company Rosneft and Norwegian oil company Statoil.

Btw, TechCrunch reported earlier on adding the Facebook and Big Data search capabilities to Yandex.

Above is a screenshot of Russian version of Yandex. While I can tell that when last time I visited Russia my friends did not understand what I meant by searching the Internet, they always spelled me the Russian names (in latin letters!) of websites where I should go for this and that. And truly, ignoring their advice to remember exact names, I could not find anything they mentioned in any search engine when I searched in Russia.