Mining the Web for Digital Signals: Lessons from Public Health Research

4 min read


A number of recently published studies and articles suggest that mining the Web may impact public health in ways that bear significance on Global Pulse’s approach to social impact monitoring in the digital age. Here is a recap of some of the most relevant studies and articles:

Syndromic Surveillance and More

One potential application is syndromic surveillance as part of an emerging field known as “infodemiology”. According to the US Center for Disease Control and Prevention (CDC), mining vast quantities of health-related online data can help detect disease outbreaks “before confirmed diagnoses or laboratory confirmation”. Because people tend to search online for health or sympotm information when they are sick, the resulting Google and Yahoo! queries have been found to be strongly correlated with—and thus serve as indicators of—the outbreak or prevalence of the flu, dengue, Lyme disease, and other contagious diseases in close to real time.

A screenshot of Google’s new Dengue Trends which tracks relevant Google search terms in real-time

Twitter may serve a similar purpose. In a recent research project, two computer scientists at John Hopkins University used a sophisticated proprietary algorithm to analyze over 1.6 million health-related tweets (out of over 2 billion) posted in the US between May 2009 and October 2010. They found a 0.958 correlation between the flu rate they modeled based on their data and the official flu rate. Another team, out of the University of Iowa, used tweets to study the outbreak of the H1N1 epidemic. Other types of ailments can also be analyzed via Twitter data, including dental pain as reported in Scientific Daily. According to another study summarized by Time Magazine, updates, photos and posts on Facebook—a closed network indeed—can also help identify drinking problems among college students, a major public health concern.

Syndromic surveillance can even be done at a finer level of granularity. Information about Twitter users’ location that they freely provide can be used to study the geographic spread of a disease or virus. Other data streams can also be brought in to provide geographical information: the Healthmap project, for example, compiles disparate data from online news, eyewitness reports and expert-curated discussions, as well as validated official reports, to “achieve a unified and comprehensive view of the current global state of infectious diseases” that can be visualize on a map. In the words of one of its creators as quoted in a recent article, “[i]t’s really taking the local-weather forecast idea and making it applicable to disease.” This may help cities and communities prepare for potential outbreaks as they do for storms.

Another application for which Twitter is especially suited is to provide richer information on health-related behaviors, perceptions, concerns, and beliefs. Important sociological insights were gleaned from the John Hopkins study. These included detailed information about the use and misuse of medication (such as self-prescribed antibiotics to treat the flu, which are by nature irrelevant, and other over-the-counter misuse) and exercise routines among various socioeconomic groups—relevant for both research and policy.

Challenges with Perception Data

There are also challenges with using such data. First, a large share is based on perceptions of illnesses, which can be wrong and thus misleading. A study of the performance of Google Flu Trends found that while the tool did a good job at predicting nonspecific flu-like respiratory illnesses, it did not predict actual flu very well. The reason is the presence of infections causing symptoms that resemble those of influenza combined with the fact that influenza is not always associated with influenza-like symptoms. In the words of a researcher on the team, “influenza-like illness is neither sensitive nor specific for influenza virus activity—it’s a good proxy, it’s a very useful public-health surveillance system, but it is not as accurate as actual nationwide specimens positive for influenza virus.”

Similar caveats related to the tension between sensitivity and specificity (i.e. not missing a signal vs. not picking up a false signal) have been mentioned in other studies: Michael J. Paul, a co-author of the John Hopkins study, cited in Scientific American, stressed that this data “was not accurate enough to replace traditional methods” falling under sentinel surveillance. Another challenge is limited penetration and the associated selection bias when it comes to the use of Web-based tools and services emitting such digital signals, which might be even greater in developing countries. But Twitter usage can be expected to grow in developing regions in the years ahead; for example, Indonesia is already the top Tweeting country in the world.

The Take-Away

What lessons can be gleaned from these examples and experiences for Global Pulse’s work?

One is that there is evidence that Web-based data can help detect anomalies and trigger a process of investigation and verification. This is best done when analysts act as “sophisticated users” shying away from simplistic conclusions.  It is also worth stressing that the nature and detection of what constitutes socioeconomic ‘anomalies’ may differ—indeed be less clear-cut—from those in the realm of public health. This suggests the need to develop specific methodologies to characterize and detect excessive variations from socioeconomic trends in context.

Second and relatedly, bringing together several data streams—including more traditional data such as official reports—adds complexity and richness to the picture and can reveal sociological insights, notably through the analysis and visualization of unstructured data. More data is not always better data, but more data types usually provide better information. As summarized by Michael J. Paul in the case of the flu, “we can use [this data] to see if the flu rate is going up suddenly and we should investigate this. There is a lot of potential to learn so much about people that they don’t share with their doctors”.

This points to a third lesson: different Web-based tools and systems are used for different purposes and have  different relevance and implications for research and policy that have been insufficiently studied so far. These findings also point to a larger and (by the token of the Web) older conceptual debate about the “public good”-ness and externalities of the Web in general and social media in particular. The underlying tension between privacy concerns and the social benefits of using digital data will need to fare more predominantly on the agendas of researchers as well as policy and lawmakers in the coming years to ensure privacy-preserving data mining for the greater good.

By Emmanuel Letouzé

Did you enjoy this blog post? Share it with your networks!

News, thoughts and ideas about big data and AI, data privacy and ethics from across the Pulse Lab Network. Read more on the blog.

Pulse Lab Kampala

Dialogue is the Key- Shaping AI for Africa

At UNGP, we believe in dialogue. This is why we participated in the Conference on the State of Artificial Intelligence in Africa (COSAA) held in March 2023. The conference was

Scroll to Top