Global Pulse is currently working with USAID on a feasibility study to understand what social data can tell us about loans, and people’s access to finance in Kenya. To start, we’re looking more generally at Kenya’s digital landscape to create a map of where relevant digital data might lie. This includes doing some base-level research on social media use in Kenya, and an overview of potentially relevant mobile services.
For the feasibility study, we have access to all of Tweets originating from Kenya, over a span of about two years, which we will analyze for relevant keywords, using some of the most advanced social media analytics technology on the market today. But in order to responsibly harness big data for global development decision-making, programme planning or impact monitoring, we must understand whose voices are being represented in this data. This blog post focuses on the challenges of establishing the demographic breakdown of Twitter use, although many similar problems exist for other social media data.
Why having social media demographic data matters for “Big Data for Development”
Over the past 5 years, Kenya has emerged as a tech leader in Africa. The rise of mobile phone use in Kenya is a well-told story, and the mobile money service mPesa has changed the way many entrepreneurs globally think about providing services. The oft-quoted anecdotes – farmers communicating directly with market agents, the rapid uptake of mobile loans – are certainly impressive.
All of these services create data. However, despite the unprecedented amount of digital data that is “out there,” details on who is represented in this data and who is not remains murkey. People generally associate digital activity with capturing detailed user information and metrics, so it may come as a surprise to some that it is difficult to confidently establish the demographics of social media users. Many popular channels, such as Twitter or Facebook, do not require people to state their gender, age or location, so figures may be skewed by who is willing to volunteer this information. In places where mobile phone contracts are common, it may be more straightforward to establish demographics, but pay-as-you-go phone services do not require any sign-up information.
This gap means that even after we have access to an archive of social media data relating to loans in Kenya, contextualizing it will be difficult.
Certainly there is no need for this information on the individual level, but if big data is to be used for designing policy or informing global development programs, we need to be able to parse trends according to demographic information. Do urban entrepreneurs face different constraints than their rural counterparts? Are dairy farmers better served than energy entrepreneurs? Are loan instruments less accessible in Western Kenya than in areas closer to Nairobi? Without better demographic information, our ability to use social data to help answer these questions will be limited.
Understanding the digital landscape in Kenya: use of existing demographic sources
One source of ‘ground truth data’ about the digital landscape in a given country are statistics on overall ICT use. Each year, the ITU publishes data on ICT penetration rates around the world. In Kenya, the Communications Commission of Kenya publishes detailed quarterly reports on the ICT sector. These statistics form an important piece of the picture, showing growth in mobile communications. However, they don’t disaggregate the data by geography, gender, income level, etc. In order to understand who is represented in big data, and who is not, it is important to have much more detailed data.
There have also been some impressive studies undertaken to understand demographic trends in mobile phone usage by a variety of researchers.
For example, Financial Sector Deepening Kenya surveyed 646 communities across Kenya. The survey found that there was some level of mobile ownership in every income bracket, although as might be expected there was higher concentration of ownership at higher income levels . Among those who had never used a phone, the vast majority (81%) were women. Except for those under 17 and those over 60, the survey found comparable rates of phone ownership across age brackets. Those with little education or who were effectively illiterate disproportionately do not own mobile phones; however when other demographic data is added in, education levels and reading ability proved to have no predictive ability.
This snapshot helps give shape to mobile trends in Kenya, but the surveys were done in 2009. In mobile years this is hopelessly out of date. While over the past year the explosive growth in mobile subscriptions in Kenya has been leveling off (the market may be nearing saturation), new frontiers of growth are only beginning—including smart phones and tablets.
Behind the big numbers: who’s online in Kenya? Exploratory methods for identifying demographics for social data
Ad-hoc research to meet particular needs has been done by social media companies, and groups like iHub Research. From research done by Portland Communications, for example, we know that in 2011 Kenya was the second most prolific country in terms of Tweets in sub-Saharan Africa. We also know that many users rely on Twitter as a primary source of information for news, and that across Africa 60% of Twitter users are between 21-29 years old. We don’t know, however, how many people in Kenya are on Twitter, much less any reliable demographic breakdown.
Regarding Twitter profile information, it’s worth noting that only 2.02% of users register with a specified geography (as it requires opting-in to Twitter’s location services). Information on gender is similarly sparse.
There are some data mining techniques which can be employed to solve these problems. For example some “Twitter data resellers” have the ability to extract more metadata from each Twitter post, including some of the self-declared information from profiles. In other instances, location-related tweets can be contextualized using algorithms. For example, if I tweet frequently about London – based on the size of that city, it can be assumed that I mean London, England. However if I also mention Ontario a number of times, it can be inferred that I mean City of London, Ontario, Canada instead. An algorithm can quickly detect that pattern by scanning archives of tweets.
To identify gender from Twitter profiles is not so straightforward either, since the service does not require you to specify when creating a profile. One technique to work around this challenge includes compiling a database of of names, by gender, and training machines to identify women versus men by matching the names in the database against the self-reported names of Twitter users.
For some countries, the language of the tweet is a key clue. In Kenya, most tweets are likely to be in a mix of English and Swahili. While there might be some Kenya-specific language clues, both of these languages are spoken in more than one country.
Other researchers have begun looking at how to infer demographic information based on data that is available—for example inferring profession or income level based on the location of the nearest cell phone tower.
But at this stage, most of these approaches are still exploratory.
Share your ideas on establishing demographics for social data
As the field of big data analytics progresses, we expect to see more and better way of working with this data, similar to the way that our econometric brethren have advanced the frontiers of empirical economics research. We’d love to hear about measurement techniques that have been tried by others. What have you tried? What worked, sort of worked, or didn’t work at all?
Eva Kaplan is a consultant with Global Pulse and USAID as the principle researcher for the joint feasibility study on “Digital Signals and Access to Finance in Kenya.” Details about the project can be found in our research repository.
Image credit: Lina Limo by Hodag under Creative Commons