In late 2011, Global Pulse completed a series of collaborative research projects exploring the utility of various new, digital data sources to answer traditional development related questions (“proof of concept”).
To do these projects, Global Pulse collaborated with five separate partners – some from academia, and some from the private sector:
- With Crimson Hexagon, we analyzed public Tweets to understand how people in the US and Indonesia perceive the impacts of stress – particularly around food, fuel and financial crises.
- With SAS, we analyzed blogs, forums and online news and correlated the online conversations against the official unemployment rates.
- With l’Institut des Systèmes Complexes de Paris Ile-de-France (ISC-PIF), we clustered and categorized 7 years worth of French-language online news articles to see the changing dominant trends in coverage of the food crisis in Africa.
- With JANA we tested the feasibility of deploying rapid mobile-phone based surveys about well-being in multiple countries.
- We collaborated with PriceStats to monitor online prices of bread in six Latin American countries in order to gain real-time insights of price dynamics.
But what is a “collaborative research” project exactly?
Check out this Q&A with Miguel Luengo-Oroz who was part of Global Pulse’s Data Science & Reearch team leading on the projects to find out more about the process:
1. How did you pick these projects and these partners?
We wanted to explore what digital data can tell us about real-life situations of vulnerable populations – the mission of UN Global Pulse. Our idea was to examine a big spectrum of possibilities of Big Data for Development. We were inspired from great case studies and research, such as Google Flu Trends or Sweden’s Karolinska Institute and Columbia University’s study on using mobile phone data to track migration after the earthquake in Haiti.
Our two main approaches were to visit some of the best quantitative social science labs in order to understand what has been done in the field thus far, and to attend conferences and workshops in transversal topics such as Smart Cities and Data Mining. With all these ideas, we worked with our UN colleagues, through several brainstorm activities to gain a sense of which development related topics would be the most interesting and relevant to explore through new data sources and analysis techniques.
Then, we approached our new partners and began to establish broad objectives (note that in research sometimes it is complicated to set very precise goals when you do not know exactly where you will end up). While we ultimately showcased five projects, we started off with many others that, for different reasons, did not evolve like the others. All these experiments and “semi”-experiments allowed us to create a network of trusted partners with whom we have and will collaborate with in the future.
2. How did you go about designing the research questions?
Once we had established the main topics for exploration, and the data we wanted to dig into, we contacted experts in the field in order to refine the questions and hypotheses. For instance, in the news mining & categorization project with the ISC-PIF, we asked our UN colleagues for help in defining a taxonomy of topics related to food security that later were used to perform network analysis and data mining on French news databases.
One of our missions at Global Pulse is to be the nexus between UN experts and quantitative scientists, data mining experts and digital-savvy professionals. In designing our research questions, this required a lot of iterations and meetings with both parties in order to translate priorities and goals into the different discipline “languages.” Not only did we have to develop common goals amongst both parties, but we also had to go through several iterations within our own small team at Global Pulse through lengthy discussions and fights (aka – “learning experiences”!) that inevitably result from mixing the points of view of experts from diverse backgrounds such as development, economics or data science. Similarly, in order to conduct research in other disciplines that combine different traditional scientific branches – for instance bioinformatics with biology and computer science- we needed to merge two different worlds both in-house and with the our research partners. For instance in the SAS project, the statistical methodologies and social media world had to be combined with an understanding of the socio-economics of unemployment.
3. What data sources did you use, and why?
We tested a wide range of data types in order to both learn what we could expect from each data source and to understand what types of technical tools we would need to manage and analyze the data. We considered data with a range of properties: in some cases, we looked at public, or open data, such as blogs, but we also looked at “closed” data sources such as answers to mobile phone surveys. In some cases the data was directly obtained through access to a Twitter data feed, while in some circumstances it was necessary to conduct a very complex data scraping methodology (i.e., collecting bread prices from across many websites, for the PriceStats project). Sometimes we looked at unstructured text data such as Tweets or blogs, and sometimes we analysed “meta-structured” text data such as news articles.
The “data question” is not only about what data source to use, but is also how to collect it, whether you are able to access it in real-time; how to organize it for your purposes; how to clean the data; how to process it; how to analyze it and how to visualize it. Sometimes these three steps (process, analysis and visualization) are mixed, and ensuring a work flow which differentiates them might be more efficient. In summary, the data challenge is not only about the data source, but also how to make sense of the data and make the data “yours” to understand.
4. What was the process of working together with your research partners like, from beginning to end?
Each project had a slightly different process, but I will use Global Pulse’s collaboration with Crimson Hexagon as an example. After setting up the main research question, we assigned two multidisciplinary groups to work on the project. There was one management team at Global Pulse, which was composed of a data scientist and a development expert – with the support of many experts that were consulted during the project development for specific questions. The other team was composed of social media experts, data analysts and a specialist in Indonesian culture at Crimson Hexagon . Each week we had a phone meeting, where both groups reviewed the status of the project and set goals for the following week. We also met in person every other month in order to set milestones and make important decisions.
One interesting part of this continuous communication flow was that during the research, we were able to pivot and change our project focus. Initially, we were interested in looking at how people perceive the future—what they perceived their risks to be, and what they were worried or excited about. However, as we moved forward, we discovered that we could get much more refined analysis from the Twitter data if we based the study on particular sectors which the public targets as key issues of concern such as food, fuel, housing, and financing/debt (the balance between specificity and volume is key).
The last part of the project was to co-write a methods paper (which also took many iterations) and launch some new analysis. A final dimension that should not be forgotten is the legal one. Working on a digital-data based, collaborative research project inside a large international organization like the UN required effort and patience from all the project partners and the UN legal department to develop new legal concepts and frameworks that are still in process of consolidation.
5. Did you correlate the “new data sources” against some kind of official statistical information on a given issue?
There is a basic hypothesis at the heart Global Pulse: that real-life events leave digital signatures and thus it is possible to infer what is happening, as it happens, from this data.
Depending on the project, the bridge between real-life and data can be established differently. For example, a simple bridge is when traditionally useful information is acquired from an alternative source that can be faster or cheaper that before (e.g.- mobile phone surveys, or checking prices online). A different type of reality mirror is looking at a perception-rich data source, like Indonesian Tweets, that provides a picture of reality offered by a certain demographic of people, about certain issues (such as Tweets reporting energy blackouts, or about the rising prices of rice). This type of data can reveal, qualitatively, worries and concerns of a certain demographic of the population. (However, in the case of our research project examining Indonesian Tweets, we did not conduct a deep correlation between the perceptions and official statistics.)
Finally, in our project with SAS, we adopted another approach (one similar to the Google Flu Trends methodology): we compared the quantification of the qualitative information offered by social media with unemployment figures. To this end, we selected online job-related conversations from blogs, forums and news from the US and Ireland and used a quantitative “mood score” based on the tone of the conversations (for instance, levels of happiness, depression or anxiety were assigned to each post). The number of unemployment-related documents that also dealt with other topics such as housing and transportation was also quantified. Finally, the quantified mood scoring and the volume of documents related to coping mechanisms were correlated to the official unemployment rate in those countries, in order to discover leading indicators that forecasted rises and falls in the unemployment rate.
While this type of data does not substitute traditional statistical methodologies, it is complementary and might give some “color” to better understand coping mechanisms and changes in the well-being of populations.