Editor’s note: This is the first of a two-part interview. Read part two here.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
The term “data scientist” has not yet jumped the shark. That’s according to Michael Driscoll, the chief technology officer and co-founder of Metamarkets Group Inc., a San Francisco-based startup company that delivers predictive analytics services to digital, social and mobile media companies.
While Driscoll has embraced the term to describe an emerging role in the field of analytics and business intelligence, others are not quite ready to do so, and the title is a hotly debated one.
Driscoll likens data scientists to civil engineers.
“Civil engineers are part physicist and part construction worker,” he said. Similarly, he added, the data scientist has to be able to find a balance between the theoretical and the practical within the data landscape.
SearchBusinessAnalytics.com recently sat down with Driscoll to talk about data scientists and how they’re using predictive analytics to shed light on the future.
What is data science?
Michael Driscoll: Data science is a neologism, and thus, like all neologisms, it’s an evolving term and title. Effectively, data scientists are those who combine the theoretical expertise of mathematicians and statisticians with the hard-nosed engineering chops of software developers. In the last decade, there’s been this renaissance in the field of machine learning, which exists at the intersection of statistics, applied mathematics and computer science. But for all of this theoretical work to be used, it ultimately needs to be coded. So data scientists are a hybrid who can combine these two -- the theoretical and the practical.
When you talk about the practical piece of data science, what are you referring to?
Driscoll: I generally frame the three skills of data science as, first, "data munging," which involves the ability to slice and dice, transform, extract and work with data in a facile, fluid way. The second skill is data modeling, which basically means taking a set of data and being able to develop a hypothesis about a pattern in the data and to test that hypothesis with statistical tools. The third skill is data visualization. Once you have transformed data into a useable form -- the first skill -- and you have developed a model about how some features of the data may relate to some set of observations, some outcomes of the data -- the second skill -- you then need to convey that insight in a way decision makers understand. That requires the ability to tell a story or build a narrative visually, and that’s where data visualization comes in.
Why is building a narrative so important?
Driscoll: In this age of massive amounts of information and massive outputs of information, we need to have ways of consuming information at a commensurately high rate. Data visualization is one of those ways. In fact, it’s probably the most important way we can consume information at a very high rate.
How do predictive analytics and data science fit together?
Driscoll: Data is what data does. The goal of all of this data science ultimately is to predict the behavior of consumers, of systems. Effectively, just having data surface insights isn’t enough. You want to be able to make predictions about what’s going to happen next. According to Karl Popper, the entire goal of science is to make predictions that can be falsified. And making predictions is really the end goal of all of the work that [data scientists] do. It’s looking forward, not looking backward. One might say that business intelligence and this world of reporting is all about the past; predictive analytics is about the future.
And yet, some say predictive analytics requires looking back in order to predict the future.
Driscoll: Absolutely. The goal of predictive analytics is to study the past but ultimately to generate predictions about the future. I’ll give you an example. Facebook was trying to understand what types of user behavior on the Facebook system would lead to higher engagement with the platform, in the likelihood that [users] would stay active three months after signing up. So they looked historically, at the past [activities] of all of their users. And they looked at gender, how many friends they had, what colleges they were at. They looked at all of these different observed user features and then, for three months afterward, they studied which of those observed features corresponded most with a high level of engagement later on. What they found was the highest correlating feature that led to using Facebook more actively three months later was the number of friends you had. That was a predictive analytic insight. As a result, once people signed up on Facebook, [the company] worked hard to suggest as many people join your network as possible.
Predictive analytics is essentially about connecting observed events with outcomes; that’s probably the simplest way to put it. There’s lots of ways to slice it, but ultimately, you're building a mathematical model of a system. To test whether that mathematical model is correct, you make predictions and then you observe whether future events actually confirm or refute your hypothesis about the system.
But do you really need a data scientist to build your models?
Driscoll: Here’s an example of a predictive model: You want to look at features of credit card purchase behavior and whether or not that purchase was fraudulent or not. Let’s say your two features are the time of day and the country of the purchase. In some cases, simply visualizing the number of fraudulent credit card actions by country will jump out at you. Any purchases made in Estonia when the credit card holder is in America are fraudulent purchases. You don’t really need a statistical model to tell you that. It’s simply plotting the data. The truth is when differences become small, then you need to rely on statistics to tell you whether the trends you observe are significant. The obvious things are easy. It really comes down to the much more nuanced, smaller differences that require statistics to tease out the difference between something that’s noise and something that’s signal.