Science published a curious article “Big data. The parable of Google Flu: traps in big data analysis.” (pdf is available here).
Here is a bit of history. GFT was launched in 2008, and in 2009 Nature published an article (not publicly available) Detecting influenza epidemics using search engine query data that described a method behind GFT, which is based on monitoring queries to online search engines and, in essence, finds the best matches among 50 million search terms to fit 1152 data points. Everything went well (more history in the current Science article) until 2013 When Google got flu wrong (Nature, Feb 2013) and gravely overestimated the Christmas national peak of flu.
Current article points to two issues contributed to GFT’s mistakes— big data hubris and algorithm dynamics.
Firstly, “The odds of finding search terms that match the propensity of the flu but are structurally unrelated, and so do not predict the future, were quite high. … the big data were overfitting the small number of cases—a standard concern in data analysis.” At the same time, the algorithm overlooked considerable information that could be extracted by traditional statistical methods. The second issue concerns the Google search algorithm and which data Google collects as both constantly and significantly change:
“The most common explanation for GFT’s error is a media-stoked panic last flu season . Although this may have been a factor, it cannot explain why GFT has been missing high by wide margins for more than 2 years. The 2009 version of GFT has weathered other media panics related to the fl u, including the 2005–2006 influenza A/H5N1 (“bird flu”) outbreak and the 2009 A/H1N1 (“swine fl u”) pandemic. A more likely culprit is changes made by Google’s search algorithm itself.”
The authors reasonably conclude:
“Big data offer enormous possibilities for understanding human interactions at a societal scale, … However, traditional “small data” often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection. … Instead of focusing on a “big data revolution,” perhaps it is time we were focused on an “all data revolution,” where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.”