This is the first in a special Global Pulse Guest Blogger Series: “Data Mining for Development: Methodological Innovations & Challenges.”
Jukka-Pekka Onnela is an Assistant Professor in the Department of Biostatistics, Harvard University School of Public Health. His work focuses on network theory analysis, with particular emphasis on social networks.
In a recent weekly meeting with Lisa, a Harvard undergraduate whose senior thesis I supervise, our discussion turned to the available sources of digital data on human behavior. In order to study the growth of an organization, and to subsequently model that process mathematically, which is her thesis project, I suggested we think about possible data sources as broadly as possible. For a moment, we decided to pretend that there were no limits on the data that might be collected. Of course, there might be ethical and practical considerations that could stop us from pursuing such data. Perhaps such data did not even exist. But this quixotic thought experiment helped us think about what aspect of the problem we would really like to tackle. And if that specific aspect could not be addressed—and here is the value of the exercise—we wanted to think about the next most important aspect of the problem, or the next best question, that we might be able to answer.
Unveiling the structure of social networks has traditionally been constrained by the practical challenge of mapping social interactions among numerous individuals. Whether we call it big data, massive-passive data, or something else, enormous quantities of data are now collected on the behavior and interactions of people. Many of these data sets chronicle interactions between people in myriad constellations, and hence the data naturally lend themselves to a network description. In that description, network nodes correspond to the agents or elements in the system, and a tie connects any two interacting agents. These data are especially interesting because of their ability to inform us about the kinds of social structures that people form. Obvious examples include the social networking service Facebook and the microblogging service Twitter.
In addition to the above examples, there are many systems that comprise a social network aspect that may not be explicit, but can nevertheless be inferred. For example, email logs can be used to quantify the structure of communication flows within an organization; medical claims connecting patients and physicians can be used to infer informal patient-sharing networks of physicians to better understand the coordination of medical care; and cell phone call records, collected for billing purposes, can be leveraged to study communication patterns of large groups of people. To be sure, each of these data sets has its limitations. Nonetheless, if these data are combined with context-specific understanding of human behavior, computational capacity, and appropriate mathematical and statistical tools, each data set can provide powerful insights about the different ways people interact with one another.
The use of any data on human subjects always involves serious ethical considerations. However, the goal of most studies leveraging large-scale data is not to learn about the individual. One of the great promises of these types of approaches lies in their potential to connect individual behavior to collective behavior, or as is sometimes said, to relate the microscopic properties of a system to its macroscopic properties. One of the distinguishing features of these types of complex, interacting systems has to do with the concept of emergence. In a nutshell, behavior that is simple at one scale can lead to complex and unanticipated outcomes at another scale.
One illustrative example of collective behavior is the shoaling and schooling of fish, where the former refers to somewhat independent, interactive aggregation of fish, and the latter to a tightly organized group of fish swimming in synchrony. Schools of fish appear to have a collective existence of their own, which makes it meaningful to study a school of fish as its own distinct entity. This implies that studying the behavior of an individual sardine requires different kinds of data than studying schools of them. Similarly, the large-scale data on human behavior are best leveraged for studying questions about collective, rather than individual, behavior.
There are many reasons why we would like to understand the structure of social networks. In general, there is an interesting symbiosis between network structure and the types of behaviors the network can sustain. Different network structures give rise to different kinds of dynamics, and a structure that is appropriate for one purpose might be undesirable for another. For example, in any network where the number of ties connected to a node, a quantity often referred to as node degree, has a fat-tailed distribution (i.e. a majority of nodes have few connections, while a small minority has a huge number of connections), the system is structurally robust to random failures, or the random removals of nodes. However, this sort of network is very vulnerable to targeted attacks, which are achieved through the removal of high degree nodes, or hubs. A small-world network, where everyone is connected to everyone else through just a small number of steps, propagates information quickly, but the same is true also for disinformation and germs. Therefore, understanding structural properties of social networks can improve our understanding, for example, of health phenomena and the design of healthcare interventions.
We can also use networks to understand how social groups are formed. It is well known that cognitive and temporal constraints limit the number of people to whom we can maintain stable social relationships. This number, suggested by the British anthropologist Robin Dunbar, stems from limits in our neocortical processing capacity, and it is usually taken to be around 150. While this limit predicts the number of connections an individual can have, it says nothing about how your circle of friends is connected with the rest of the network.
We found in a recent study that cohesive groups, defined as sets of tightly connected nodes, can emerge from very simple microscopic behaviors (1). In the mathematical model used to study this phenomenon, we endowed each individual, represented by a node, with two mechanisms. First, we allowed for the formation of a tie, with a given probability, between any two randomly chosen neighbors (friends) of a node. Second, we allowed each node to form a tie to any other node in the network (regardless of the network distance between the two nodes). The purpose of the model was to provide a conceptual framework for studying the emergence of groups; it was certainly not intended to be a detailed model of human social interaction. Although they may seem abstract at first, the two mechanisms have their counterparts in sociology. The first mechanism means that you get to know a friend of a friend (known as “triadic closure”). The latter mechanism refers to the formation of a tie between two individuals who have similar interests or engage in similar activities, such as getting to know someone in your class or the person sitting next to you on a flight (known as “focal closure”).
While the relative frequencies of the two mechanisms can vary, the outcome is typically a somewhat messy network structure with little resemblance to human social networks. But there is a catch. If we introduce a small amount of memory to the system, meaning that having interacted with someone in the past makes it slightly more likely for you to interact with the same person in the future, a different picture emerges. Suddenly, we begin to see the nucleation of people into groups. As we increase the imprint that past interactions leave in the memories of our synthetic actors, the groups become tighter and more exclusive to outsiders. Interestingly, there is considerable variation in the sizes of these groups, even though each node is following the same exact rules. What this means is that even when social resources are distributed uniformly through these processes, which are constructed to be democratic, significant inequalities emerge at the group level.
This phenomenon is not unlike the one encountered in the work of the economist Thomas Schelling, who studied unorganized spatial segregation resulting from individual choices (2). The model deals with a population of two types of members, where the distinction is recognizable, such as black and red checkers on a board. In this case, unlike in checkers, we want to fill most of the board with pieces, such that most squares on the board are occupied, and we initially shuffle the pieces around the board. Each piece has its neighborhood, which is defined as the four adjacent squares (left, right, top, bottom) to the current square. The rules of the game are simple. A red piece will move to a new location (or more accurately, exchange locations with a black piece) if it is surrounded by, say, three or more black pieces; a corresponding rule applies to the black pieces.
Different board (or lattice) geometries allow for different kinds of neighborhoods, and hence allow for experimenting with different strengths of preference, but the outcome is almost always the same. If one continues to play this game until no piece wants to move, which is when the system is said to have achieved a stable equilibrium, the black pieces will be on one side of the board and the red pieces on another. No piece has a strong preference, yet the outcome is complete segregation. In the words of Schelling, “small and seemingly meaningless decisions and actions by individuals often lead to significant unintended consequences for a large group.” The phenomenon can be easily reproduced in a computer simulation, but even more easily, and certainly more tangibly, by moving eggs around in a carton (3).
Spatial segregation is obviously an important problem, but what is perhaps less obvious is that networks can be used to address a closely related problem, that of social segregation. Blondel and collaborators identified network groups in a graph consisting of 2.6 million mobile phone users in Belgium connected by the calls they made in a 6-month period (4). In order to guarantee anonymity, surrogate keys, as opposed to actual phone numbers, were used to represent individuals. The language of service (most commonly French or Flemish) of each individual was known. The authors computed the percentage of people speaking the dominant language in each group, finding that the network was clearly segregated. In the large groups, more than 85% had chosen the same language of service and, hence, likely spoke the same first language.
The above results give an interesting indication of the linguistic fragmentation of a country, and they received a remarkable degree of media coverage at the time of publication. However, one needs to be careful when making inferences about individual preferences or biases from observed macroscopic outcomes. Nevertheless, the results do highlight some of the potential of the new massive data sets on human behavior—when coupled with appropriate conceptual and analytical frameworks—to improve our understanding of how people interact with one another.
Several studies in the last few years have highlighted the potential these approaches have for the study of human behavior but, most likely, this is just the beginning. It is to be expected that many societal events are associated with changes in population-level behavioral patterns, ranging from communication patterns to movement patterns of people. The potential future applications are many, such as predicting disease outbreaks and epidemics; learning about the migration patterns of people following natural disasters, like floods and earthquakes; anticipating political crises, mass violence, and riots; detecting changes in commodity prices, availability of resources, and stability of the economy; and gauging the efficiency and efficacy of government responses to such adverse circumstances.
Sources of available data are increasingly becoming so diverse, and the quality of those data so rich, that we are, in many ways, beginning to be able to ask and answer some of our biggest questions about the large-scale structure of human interactions and the accompanying modes of collective behavior. At the end of our meeting, Lisa also seemed convinced that there were certainly enough rich data for her to carry out the thesis project. In passing, I mentioned that, as an undergraduate, I had only used email for a few years; web pages were primitive and slow; there was no sign of Facebook or Twitter; and I had yet to purchase my first cellular phone. “Wow! I find it hard to imagine a world like that,” she exclaimed. A question of perspective, I suppose. I would have found it hard to imagine, as an undergraduate, that only 15 years later, we would be entering an unprecedented period in history in terms of our ability to learn about human behavior. It probably would have seemed unbelievable—and, in a way, it still does. Nevertheless, we’re already there.
Figure 1. Emergence of social groups, or communities, from microscopic interactions. As we increase the effect of memory of past interactions (from the top left panel to the bottom right panel), the groups become tighter and more exclusive. See reference (1) for details.
Figure 2. Division of 2.6 million people into social groups. The color of a circle corresponds to the majority language of the group. Predominantly French-speaking groups are almost exclusively connected to other predominantly French-speaking groups, and the same applies to Flemish-speaking groups. There is only one group that mixes French and Flemish speakers. See reference (4) for details.
(1) “Emergence of communities in weighted networks” by J. M. Kumpula, J.-P. Onnela, J. Saramäki, K. Kaski, and J. Kertész, Physical Review Letters, 99, 228701, 2007.
(2) “Models of segregation” by T. Schelling, American Economic Review, 59(2), 488, 1969.
(3) Video using eggs to demonstrate the model: http://www.youtube.com/watch?v=JjfihtGefxk
(4) “Fast unfolding of communities in large networks” by V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Journal of Statistical Mechanics: Theory and Experiment, 1742, P10008, 2008.