How do I become a Data Scientist?
Jun 5, 2014
I often get asked by people who are coming to the end of their Masters degree or PhD, or who are looking to change career direction; '"How can I become a Data Scientist?". This question has been answered in some form by others already (the latest from the folks at Insight is great and DataTau has some riffs on the same idea seemingly every week). However, I find the question and the different answers most interesting and find myself repeating the same information again and again so I decided to turn it into a blogpost.
A few caveats a priori;
- First and foremost, this is a personal account. If things pan out differently for you, that's just great; we are all different. This advice will best suit Physicists. As a demographic, we have our particular ways of thinking that might not suit everyone.
- The full set of material that could reasonably be in the scope of data science jobs is huge (more on that later). I hope the following will help you in your search for the foundational material, that is a shortcut I can offer you. However, you should still be prepared to put in a lot of hours working through this material in your own time.
Understand the Big Picture
You should understand what Big Data means on an intellectual level. There are two extremes of thought to consider; theoretical, model driven approaches or empirical, data-driven approaches. Generative approaches such as Cellular Automata and Agent Based Models (ABMs) are examples of the former and Machine Learning and Data Mining the latter. While there is no shortage of cheerleaders for empiricism these days, you should still get to know the work done by Stephen Wolfram, Josh Epstein and Doyne Farmer, whose discussion of the use of ABMs in finance actually argues against the use of data.
Often this tension comes down to the question "does observation of a correlation negate the need for understanding the underlying causative mechanism?" The 'right' answer is not yes, or no. The example of computational linguistics tells us that we can have much more success with an empirical, statistical approach combined with vast amounts of data to learn from compared to a hugely complex model based approach. But this debate is a very fiery one with very smart people such as Noam Chomsky and Peter Norvig (Head of Research for Google) taking opposite views (see links below).
Part of being a good Data Scientist involves knowing when to be healthily skeptical of each result which pops out of your model. In some cases all that matters is that there is a correlation, other times it matters exactly what the correlation is with; that Pearson coefficient of 0.9 between rainfall in Tasmania and comment activity on Reddit is probably a fluke. The possible criticisms of Big Data along these lines are common and you should be prepared to answer them.
- Chomsky on where AI went wrong
- On Chomsky and the two cultures of statistical learning (a riposte by Peter Norvig to the above)
- The End of Theory
- The Unreasonable Effectiveness of Data
Not all Data Scientist Roles are equal
A Data Scientist working at the World Bank Open Data division compared to one working in the Facebook Data Science Team and another at Yahoo! on Computational Advertising effectively all have different jobs. Some Data Scientists will work very hard to incrementally improve a very well defined metric (click through rate, successful movie recommendations), others will have the freedom to explore a very rich multi-dimensional dataset and some will be in between. The point is that, even within industries, the role is not very well defined and depends a lot on the company or organisation. So, do your homework; read everything you can about what the company does; reports, academic papers, blog posts and Twitter accounts.
Learn Python or R
Forget FORTRAN, C, STATA, SPSS and Mathematica. For exploratory data analysis, machine learning, reusable libraries and sharing of code; you will need to learn Python or R. But don't worry! The ROI on these languages is huge.
Python offers a very simple syntax and very broad capabilities; the ability to program and deploy web applications from data collection, to exploratory analysis all the way to database interaction and web hosting. However, the basic data structures are slow, official documentation is lacking and the use of whitespace can be downright weird. The main libraries for data science are:
- Matplotlib for plotting
- NumPy for numerical and matrix operations
- SciPy for fitting, optimisation and other common operations
- Scikit-learn for machine learning, fitting and classification operations
- Pandas for time series capabilities (Greg Reda has a good intro)
- IPython is a fantastic, rich interactive environment for Python
R on the other hand has an esoteric syntax (common gotchas here and basic tutorials here) and more complex data structures at first. However the community is very lively and can deliver extremely succinct code with practise. Good packages to look at are:
The relative merits of each are discussed in more detail here. As a biased Pythonista, I would also point out that Python is now the official teaching language of MIT.
Very often, people at the end of a long pipeline of analysis simply stop as soon as they have a plot or figure that meets the minimum requirements i. e. that the points go in the right direction and the axes are labeled (optional). Although this is understandable, since it can often take months to arrive to this point, it is a mistake. The importance of presenting information in the right way and in a straightforward and informative way is a science in itself. Nathan Yau's blog and books are essential reading and the godfather of visualisation, Edward Tufte, should also be high on your list.
But in general, take some time with your figures. They might make sense to you, but can the same be said for someone who hasn't been thinking hard about it for as long as you have? Finally, don't be afraid to simply adjust the default settings of your plots; soften colours (no-one enjoys stark, pure reds and blues) and make text labels larger.
I Studied X, Can I Become a Data Scientist?
Very few people qualify to be Data Scientists emerging from formal education programs, (although there are now some programs looking to turn out fully-functioning Data Scientists). The Insight white paper deals with this dearth of qualified data scientists very well.
Traditionally, computer scientists and electrical engineers were the hoarders of knowledge about natural language processing, dimensionality reduction and other seemingly esoteric things. More recently though, these skills have become important to academic biologists, geneticists, astrophysicists, financiers, applied mathematicians as well as others. All of these fields have a strong data science component. In general the harder the science, the better. Aspects which are best learned through a long formal education include a rigorous scientific method (hypothesising, checking, testing, reflecting, concluding) and an ability to translate complex ideas into actionable models and code. That said, data science is often not for the purist or the researcher frightened of the occasional hack or sub-optimal result. Consider the problem of geolocating an arbitrary string (i. e. going from 'West London' to 51.5 N, 0.1 W), even Google Maps, with all of its training examples cannot get this right every time. Sometimes solutions can only be so good, and not perfect. If you are used to the physical sciences where the gravitational constant has been measured with an accuracy of 10−1, a little mental adjustment is needed. Finally, an inquisitive nature is essential. The rush of finding something interesting in a data set is more important than the what you did to get there.
That said, I believe that all disciplines are capable of making the transition and your diverse background can bring a much needed new perspective to a team. Data science invariably deals with the actions of people, so for that reason data-minded social scientists are being put to good use at places such as Facebook. People who are hiring do understand that to some extent they must 'grow' their data scientists, but despite that, you will invariably need to meet them halfway and fill in some gaps yourself. The best way to show you can make the transition to applied data science is to show that you have already made the transition to doing applied data science. Which brings me to my next point...
Start a Pet Project
There are so many lessons to learn in data analysis and after a point one cannot learn these from a book or a blog post. Therefore, just start analysing something, anything, to get some practise. Learn the common operations in your own time and start to build an intuition for what to look for in different kinds of data. I am strongly of the opinion that there is value in 'reinventing the wheel' and coding up some common operations yourself, even if they exist already. You will learn a lot for yourself as well as appreciating the work that goes into making a robust and optimised boilerplate library. Here are some of my favourite free, open places to get data
Learn to Communicate
Before I left academia, if someone asked me what should be added to a data science curriculum I would answer that a particular class of classification or optimisation algorithms are essential or this suite of command line tools. After just a short time spent in 'The Real World', I realised that once you have a decent grounding, any new topic or technique can be picked up very easily in an arbitrarily short time. But what is more important and less easily learned, is how to communicate the value and insight in common data science techniques to non-experts.
In 'The Real World', people are busy, have limited budgets and their own targets to reach. Unlike academia, they don't have time to indulge in exploring something 'because it looks interesting'. Likewise, you cannot rely on people to read your carefully written report, especially if they feel that they don't understand it. So you must explain what it means that you used the first principle component of a set of metrics as the covariate rather than a simple univariate linear regression, on the spot, to a business major. What is worse is that the business major will not take your word for it. On the contrary, the business major will have a good education, will ask you tough questions and will generally be wary of you telling her/him to work in a different way.
Get Acquainted with End-to-End Development
- Startup Engineering (Coursera)
- Web Development (Udacity)
- Why Becoming a Data Scientist is Not as Easy as You Think (in response to this optimistic article)
Data Science for Social Good
It has been most interesting to see Data Science impact different sectors at different times. Advertising, logistics and marketing caught the bug early on, Google Translate changed linguistics forever, quantitative finance has steadily become more empirical in recent years, not to mention, law, bibliometrics and election campaigning. But one of the recent additions to the fold is the development sector.
Here at United Nations Global Pulse, we consider real-time data sources such as social media and anonymised mobile phone records to augment the more traditional data sources used by development practitioners such as costly surveys and censuses. We have labs in Kampala and Jakarta and we often advertise for interns as well as collaborations with industry and academia.
Further challenges that may arise in this context include potentially low volumes in places that have low internet penetration as well as questions that are being answered with data for the first time (or at least in data with high frequency). Needless to say, an interest in current affairs, the intersection of technology with society and global development are a must in addition to the above hard technical skills.
Other organisations that look at similar problems include Datakind who also deal with domestic charitable problems through the lens of Big Data. Datakind chapters are currently expanding into cities globally and have a lively community in New York. Finally, Rayid Ghani (the former Chief Scientist for the Obama campaign) is running the Data Science for Social Good program in Chicago.
Image credit: Raffi Krikorian, 2010.