Google Flu Trends

Google Flu Trends was a web service operated by Google. It provided estimates of influenza activity for more than 25 countries. By aggregating Google search queries, it attempted to make accurate predictions about flu activity. This project was first launched in 2008 by Google.org to help predict outbreaks of flu.

Google Flu Trends is now no longer publishing current estimates. Historical estimates are still available for download, and current data are offered for declared research purposes.

The idea behind Google Flu Trends (GFT) is that, by monitoring millions of users’ health tracking behaviors online, the large number of Google search queries gathered can be analyzed to reveal if there is the presence of flu-like illness in a population. Google Flu Trends compared these findings to a historic baseline level of influenza activity for its corresponding region and then reports the activity level as either minimal, low, moderate, high, or intense. These estimates have been generally consistent with conventional surveillance data collected by health agencies, both nationally and regionally.

Roni Zeiger helped develop Google Flu Trends.

Google Flu Trends was described as using the following method to gather information about flu trends.

First, a time series is computed for about 50 million common queries entered weekly within the United States from 2003 to 2008. A query's time series is computed separately for each state and normalized into a fraction by dividing the number of each query by the number of all queries in that state. By identifying the IP address associated with each search, the state in which this query was entered can be determined.

A linear model is used to compute the log-odds of Influenza-like illness (ILI) physician visit and the log-odds of ILI-related search query:

P is the percentage of ILI physician visit and Q is the ILI-related query fraction computed in previous steps. β₀ is the intercept and β₁ is the coefficient, while ε is the error term.

Each of the 50 million queries is tested as Q to see if the result computed from a single query could match the actual history ILI data obtained from the U.S. Centers for Disease Control and Prevention (CDC). This process produces a list of top queries which gives the most accurate predictions of CDC ILI data when using the linear model. Then the top 45 queries are chosen because, when aggregated together, these queries fit the history data the most accurately. Using the sum of top 45 ILI-related queries, the linear model is fitted to the weekly ILI data between 2003 and 2007 so that the coefficient can be gained. Finally, the trained model is used to predict flu outbreak across all regions in the United States.

...
Wikipedia