Some reflections on the analytical challenges in big data

Guest post by Sriganesh Lokanathan
Jun 19, 2014

This article is part of a special Global Pulse Guest Blogger Series: “Data Mining for Development: Methodological Innovations & Challenges.” Sriganesh Lokanathan leads LIRNEasia‘s research on Big Data for Development.  He is responsible for developing partnerships as well as the big data analytics needed to harness mobile network big data to answer broad questions related to human development. This blog post was originally published on the Lirneasia website.

The work that we have been doing using mobile network big data over the last year, has been challenging on many fronts. I’ve spent some time reflecting on some of the analytical challenges that are faced in the Big Data paradigm and the common fallacies that I sometimes find in the broader discussion I see on the subject. What is below are some of my preliminary thoughts, which I am working up into a paper. Still a work in progress. Comments welcome.
“Garbage in, Garbage out” or GIGO in short form is a computer science concept that refers to the fact that the veracity of the output of any logical process depends on the veracity of the input data. In the Big Data paradigm it is easy to overlook that concept given that it is expected that when dealing with vast volumes of data (that are often unstructured) and that can come from a multitude of sources, “messiness” is to be expected. As Mayer-Schönberger & Cukier (2013) note, “What we lose in accuracy at the micro level we gain in insight at the macro level.” This common conception can often be misleading. Data quality and its provenance do matter and the question is important in establishing generalizability of the Big Data findings.

Data provenance and data cleaning

Understanding data provenance involves tracing the pathways taken by the data from the originating source through all the processes that may have mutated, replicated, comingled the data that feeds into the Big Data analyses. This is no simple feat and as is to be expected, given the varied sources of data that are utilized not always feasible to the extent that would be desired by the scientific community. But at the very least it is important to understand some aspects related to the origin of the data. For example some mobile network operators choose to include the complete route of call that has been forwarded. This means there may be multiple records in the CDRs for the same call. If that is not taken into consideration, the subsequent social network analysis could have errors (overstating or understanding tie strength for example). Whilst it may not be possible to establish data provenance as envisaged by scientists, at the very least it is important to understand the underlying processes that may be created the data.
Data cleaning remains an important part of the process to ensure data quality. The first is to verify that the quantitative and qualitative (i.e. categorical) variables have been recorded as expected. The second involves removing outliers, which in the Big Data paradigm means the use of decision tree algorithms. But data cleaning itself is a subjective process (e.g. deciding which variables to consider) and not truly agnostic as would be desired and thus open to philosophical debate (Bollier, 2010).

Is the data representative?

Related to the question of data provenance is the issue of understanding the underlying population whose behavior has been captured. The large data sizes may make the sampling rate irrelevant, but it doesn’t necessarily make it representative. Everybody does not use Twitter, Facebook or even Google searches. For example ITU estimates suggest that Internet usage is still limited to only 40 per cent of the world population. In other words, more than four billion people globally are not yet using the Internet, and 90 per cent of them are from the developing world.  Of the world’s three billion Internet users, two-thirds are from the developing countries. At the other end of the spectrum, even though mobile cellular penetration is close to 100%, this does not mean that every person in the world is using a mobile phone.  This issue of representativeness is an issue of high relevance when considering how telecommunication data may be used for monitoring and development. Whilst the promise in leveraging data from mobile network operators for monitoring and development hinges on its large coverage, nearing the actual population, it is still not the whole population. Questions such as the extant of coverage of the poor, or the levels of gender representation amongst telecom users are all valid questions. Whilst the registration information might provide answers, the reality is that the demographic information on telecom subscribers for example is not always accurate. With pre-paid subscriptions being the norm in the majority of the developing world, demographic information contained in the mobile operator records is practically useless, even with mandated registration.  
The issue of sampling bias is best illustrated by the case of Street Bump, a mobile app developed by the Boston City Hall. Street Bump uses a phone’s accelerometer to detect potholes and notify City Hall, whilst the app users drive around Boston. The app however introduces a selection bias since it biased towards the demographics of the app users, who often hail from affluent areas with greater smartphone ownership (Harford, 2014). Hence the “Big” in Big Data does not automatically mean that issues such as measurement bias and methodology, internal and external data validity, and inter-dependencies among data can be ignored. These are foundational issues not just for “small data” but also for “Big Data” (Boyd & Crawford, 2012).

Behavioral change

For that matter, digitized online behavior can be subject to self-censorship and the creation of multiple personas, further muddying the waters. Thus studying the data exhaust of people may not always give us insights into real-world dynamics. This may be less of an issue with TGD, where in essence the data artifact is itself a byproduct of another activity. Telecom network Big Data, which mostly falls under this category, may be less susceptible to self-censorship and persona development. But it doesn’t exclude the possibility either. It is not inconceivable that users may not use their mobiles or even turn it off in areas, where they do not wish their digital footprint to be left behind. In a way, Big Data analyses of behavioral data are subject to a form of the Heisenberg Uncertainty principle: as soon as the basic process of an analysis is known, there may be concerted efforts to exhibit different behavior and/or actions to change the outcomes (Bollier, 2010). For example the famous Google page rank algorithm has spawned an entire industry of organizations that claim to enhance page ranks for websites. Search Engine Optimization (SEO) is now an established practice when developing websites.
Change in behavior could also partly attribute to the declining veracity of Google Flu Trends. Researchers found that influenza-like-illness rates as exhibited by Google searches did not necessarily correlate with actual influenza virus infections (Ortiz et al., 2011). Recent research has shown that after 2009 (when it failed to catch the non-seasonal influenza outbreak of 2009), infrequent updates, have not improved the results. In fact Google Flu Trends has persistently overestimated flu prevalence since 2009 (Lazer, Kennedy, King, & Vespignani, 2014). Google Flu Trends does not and cannot know what factors contributed to the strong correlations found in their initial work. The point is that the underlying real world actions of the population that turned to Google for its health queries and which contributed to the original correlations discovered by GFT, may have in-fact changed over time, diminishing the robustness of the original algorithm. For example the hoopla surrounding GFT could have even created rebound effects, with more and more people turning to Google for their broader health questions and thereby introducing additional search terms (due to different cultural norms and/or ground conditions), which can collectively introduce biases that GFT has not been able account for. Such possible problems could have been caught and resolved had the GFT method been more transparent.

Ground context

Hence knowing and understanding real world context remains important when considering Big Data analyses for monitoring. Nathan Eagle, a pioneer in utilizing cell phone records for development, stresses the importance of weeding out false assumptions by a priori surveying even a few people. For example in one instance, upon discovering low mobility after a flood using CDR data from Rwanda, he theorized that this was due to the outbreak of cholera. However quick ground survey revealed the true cause of the low mobility to be washed out roads (David, 2013).  Knowledge of ground conditions and context, are also applicable to the issue of generalizability of Big Data based analyses of telecom data. For example prior research had established a power-law distribution between frequency of airtime recharges and average recharge amount. It was further found that the poor tended to top-up more frequently but at smaller amounts when compared to those higher in the socio-economic ladder (UN Global Pulse, 2012). When researchers working with Sri Lankan mobile datasets attempted to utilize these findings to help them segregate their analyses for different socio-economic groups, they were not able to do so. A quick survey of local context based on interviews with operators provided the reason. Nearly two-thirds of prepaid customers generally chose to reload using scratch cards. Higher denomination scratch cards were not as easily available as lower level denominations. Hence anyone wanting to reload a higher amount often bought multiples of lower-denomination cards. After recharging with one card, the rest were kept aside for when the need arose. Not having known this local context would have made researchers falsely assume that the differential purchasing patterns of airtime credit by different socio-economic groups were not prevalent amongst the Sri Lankan population.

Causation versus correlation

It is easy to confuse correlation with causation in the Big Data paradigm leading to the discovery of misleading patterns. As Google’s Chief Economist Hal Varian notes, “there are often more police in precincts with high crime, but that does not imply that increasing the number of police in a precinct would increase crime” (Varian, 2013b). Big Data draws a lot of its techniques from machine learning, which is primarily about correlation and predictions. Big Data are by its very nature observational and can only measure correlation and not causality. Evangelists of Big Data, have predicted the end of theory and hypotheses-testing with correlation trumping causality as the most relevant method du jour (Anderson, 2008; Mayer-Schönberger & Cukier, 2013). But such prediction might be too premature. Noted behavioral economist Sendhil Mullainathan notes that inductive science (i.e. algorithmically mining Big Data sources) will not drown out traditional deductive science (i.e. hypothesis testing) even in a Big Data paradigm. Amongst the three Vs in the traditional Big Data definition, Volume and Variety produce countervailing forces. More Volume makes Big Data induction techniques easier and more effective whilst more Variety makes it harder and less effective. It is this variety problem that will ensure the need for explaining behavior (i.e. deductive science) rather than just predicting behavior (Mullainathan, 2013).
Causal modeling is possible in a Big Data paradigm by conducting experiments. Google for example is famously known for conducting about a 1000 experiments at any given instance (Varian, 2013a). Telecom network operators themselves utilize such techniques when rolling out new services or for that matter for figuring out pricing. The question then becomes how would third-party researchers be able to leverage the operator systems to conduct such experiments. There is no clear simple answer for this, since these are propriety systems and probably something to be considered at a later stage, once data access issues has been worked out.

The role of traditional “small data” in verification

The documented failures of GFT also point to the importance of traditional statistics as collaborating evidence. For example the true value of GFT is realized only by its pairing with “small data,” in this case the statistics collected by the Centers for Disease Control and Prevention (CDC). Infact as Lazer et al. (2014) themselves note, when combined with small data, “Greater value can be obtained by combining GFT with other near–real time health data.” When data from mobile network operators is used for syndromic surveillance like in the case of malaria in Kenya (Wesolowski et al., 2012), it is most useful is to spur timely investigation rather than replace existing measures of disease activity. Even when engaging with the broader question of how telecommunication network data could be used for monitoring, surveys and supplemental datasets will remain important to sharpen the analyses and especially to verify the underlying assumptions. For example Blumenstock & Eagle (2012) ran a basic household survey against a randomized set of phone numbers prior to data anonymization to build training dataset. This allowed them for example to understand variations in mobility, social networks and consumption amongst men and women, and between different socio-economic groups which wouldn’t have been possible using just the call records. Similarly Frias-Martinez & Virseda (2012) needed census data to bootstrap their algorithms and provide training data for their algorithms to reverse-engineer approximate survey maps. Hence official statistics will continue to be important to build the Big Data models and for periodic benchmarking so that the models can be fine-tuned to reflect ground realities

Transparency and replicability

The issues with GFT also illustrate transparency and replicability problems with the extant Big Data research. This is aggravated by the fact that the original private data may not often be available to everybody. This underscores the importance of opening up these private data sources (in a manner that address potential privacy concerns) so to be able to avail of the benefits of proper peer-review that can hone and improve the analyses. Instead consumers of such research are dependent on taking the analysis and the results on faith.  In the case of GFT for example, the researchers in their original Nature paper (Ginsberg et al., 2009) did not even publish the original 45 search terms that were used to make the correlation, rendering replicability impossible.  Indeed transparency of methods can also ensure that they can more effectively updated when ground realities change, something that could have prevented the current problems with Google Flu Trends.


Anderson, C. (2008). The end of theory: the data deluge makes the scientific method obsolete. Wired Magazine, 16(7).
Blumenstock, J. E., & Eagle, N. (2012). Divided We Call : Disparities in Access and Use of Mobile Phones in Rwanda. Information Technologies & International Development, 8(2), 1–16.
Bollier, D. (2010). The Promise and Peril of Big Data. Washington DC.
Boyd, D., & Crawford, K. (2012). CRITICAL QUESTIONS FOR BIG DATA. Information, Communication & Society, 15(5), 662–679. doi:10.1080/1369118X.2012.678878
David, T. (2013). Big Data from Cheap Phones. Technology Review, 116(3), 50–54.
Frias-Martinez, V., & Virseda, J. (2012). On the relationship between socio-economic factors and cell phone usage. In Proceedings of the Fifth International Conference on Information and Communication Technologies and Development – ICTD ’12 (p. 76). New York, New York, USA: ACM Press. doi:10.1145/2160673.2160684
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–4. doi:10.1038/nature07634
Harford, T. (2014, March 28). Big data: are we making a big mistake? Financial Times, pp. 7–11. 
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). Big data. The parable of Google Flu: traps in big data analysis. Science (New York, N.Y.), 343(6176), 1203–5. doi:10.1126/science.1248506
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
Mullainathan, S. (2013). What Big Data Means For Social Science. HeadCon ’13 Part I. 
Ortiz, J. R., Zhou, H., Shay, D. K., Neuzil, K. M., Fowlkes, A. L., & Goss, C. H. (2011). Monitoring influenza activity in the United States: a comparison of traditional surveillance systems with Google Flu Trends. PloS One, 6(4), e18687. doi:10.1371/journal.pone.0018687
UN Global Pulse. (2012). Taking the Global Pulse: Using New Data to Understand Emerging Vulnerability in Real-Time.
Varian, H. R. (2013a). Beyond Big Data. In NABE Annual Meeting. San Francisco, CA.
Varian, H. R. (2013b). Big Data: New Tricks for Econometrics.
Wesolowski, A., Eagle, N., Tatem, A. J., Smith, D. L., Noor, A. M., Snow, R. W., & Buckee, C. O. (2012). Quantifying the impact of human mobility on malaria. Science (New York, N.Y.), 338(6104), 267–70. doi:10.1126/science.1223467