Global Pulse is looking for signals that could be used as proxy indicators for collective changes in human behaviors in response to the early impacts of crises. In addition to our hypothesis about mobile services as sensors, another potentially valuable type of data is that generated by information seeking behaviors - specifically, by what people search for online.
An individual search is a simple action that might reflect -among other things- a need, an intention, a sentiment or just curiosity. Although the statistical representativeness of this universe of searches is yet to be formalized (and depends on factors such as Internet penetration, computer literacy and search engine usage), the aggregation of all these searches gives a big picture that has already been successfully used in the real world.
Google has developed simple models that aggregate searches for a set of related terms around critical health issues: Flu Trends and Dengue Trends. Despite search being subjective and context-dependent, these tools have potential as early warning mechanisms as their outputs have been correlated with official statistics of disease cases. Based on this idea, Google recently introduced Google Correlate, a tool enabling anyone to identify historical search terms that most closely correlate in occurrence with time-series data or statistics uploaded by the user. We’ve been keen to learn whether online search can reveal crisis impacts beyond the public health sector.
In this context, we have begun experimenting with the tool to assess the potential value of information seeking behavior as a fast, inexpensive approximation of other statistical indexes – or as a way to add color (context and meaning) to classical indicators. The procedure comprises uploading monthly indicators from the UN and elsewhere into the Google Correlate tool and finding interesting terms whose volume of searches correlate with these indicators. The tool is currently limited to US-based searches, and the results to date have yet to reveal anything truly game-changing, but what we have found is intriguing...
Experimenting with Correlate: Love Poems & Consumer Confidence
Using the monthly statistics for unemployment in the US, we find that searches for the term “collect unemployment” have a very high correlation– this idea was previously explored by H. Choi and H. Varian. Interestingly, there is also a very high correlation between the US national unemployment rate and searches for “unemployment” plus specific locations such as Colorado or Washington.
When looking at statistics on the number of US passengers flying outside the US, the most closely correlated search term is “beach towel,” with a yearly periodical signature characterized by two distinct phases- people flying internationally and going to the beach during the summer and a few during the winter holidays. The annual periodicity of this signal highlights the importance of subtracting the periodical behaviors to understand the fundamental trends of the signal. It is also important to keep in mind that the observed correlation does not imply any causal relation; it means only that the shape of the curve representing the quantity of people’s searches is similar to the shape of the function representing the quantity of passengers – there is no relation between them in absolute numbers. In mathematical terms, the functions’ first derivatives are highly similar.
In another experiment, we uploaded the monthly US consumer price index (CPI) which reflects changes in the prices paid by urban consumers for a representative basket of goods and services. While there is no term that perfectly matches the CPI index, searches for the banks “Banco de America” and “chase com” align consistently with it.
The business confidence index (BCI) is produced from a survey that measures the level of optimism of private sector managers and executives about the performance of the economy and how they feel about their own organizations’ prospects. Remarkably, one of the most highly correlated search terms to this is “buying vs leasing”- suggesting that the optimism in the economy make companies (or consumers?) think about making longer-term investments and buying capacities. In this case, the signal is modulated by a high frequency monthly signal, as the question of whether to buy or lease arises mostly at the end of each month, when planning decisions are taken.
The consumer confidence index (CCI) conveys consumers’ optimism on the state of the US economy, as expressed through their saving and spending decisions. Lauder's Lipstick Theory suggests that when consumers are feeling especially threatened, sales of small beauty products such as lipstick spike - a product that consumers feel comfortable buying even if they turn away from bigger retail indulgences. We asked if there could be an equivalent for web searches, and found high correlation between the CCI and searches for “love poems and quotes,” except during the week around Valentine’s Day (14th of February), where searches for love poems spike way above the CCI function! This raises the question not only of how to compensate for seasonality – high and low frequency modulations – while pointing to the necessity of dealing with special events that produce high search volumes, while not being of special interest (note: signal spikes and other anomalies can also be interesting!).
An extended version of the Correlate tool that allows matching frequency filtered versions of the function that aggregates search queries would be extremely useful, as it would allow the analysis of phenomena at different temporal scales.
Other interesting improvements could be the possibility of restricting the correlation to a certain period of time or finding piecewise functions. This would make it possible to explore coping mechanisms such as substitution: when an individual is under economic stress, he or she might substitute a good, service or a behavior to others that better suit their current situation, e.g. they might switch to cheaper food. It has been established that fast food sales grows during tough economic times. Is there is an analogy in the world of information seeking behavior?
Several examples have shown us the potential for understanding collective behaviors by reverse engineering information seeking behaviors and making models able to predict (or at least track), in real-time, indicators which would otherwise be based on surveys with lower time latency. The concept of predicting the present, or "nowcasting."
How long it will take before Internet access is available to everyone remains an open question. In the meantime, we’re working to discover approaches that will be useful both now and when we get there.
Miguel Luengo-Oroz is Global Pulse's chief data scientist.
This post is part of a series on the Global Pulse Data & Analysis team’s experiments and research investigations on how new data can contribute to crisis impact monitoring and international development. The preliminary ideas represented here are synthesized from internal brainstorming sessions, ongoing research projects and discussions with experts within and outside the UN. We hope this blog can be the beginning of a wider conversation, and welcome your comments and substantial feedback in this ideation process.