Editor’s note: This is the second of a two-part interview. Read part one here.
Data scientists are beginning to make a name for themselves in the field of analytics and business intelligence. As data grows in volume, velocity and variety, they’re certain to play an increasingly important role in teasing out what Michael Driscoll, the chief technology officer and co-founder of the analytics start-up firm Metamarkets, calls the noise from the signals.
SearchBusinessAnalytics.com recently sat down with Driscoll to talk about how data has changed in recent years and how the tools to analyze that data have also changed.
How have the kinds of data businesses are tapping into changed in recent years?
Michael Driscoll: There are a few trends underneath this. The first is the rise of sensor technology. That would be cell phones, navigation devices or point-of-sale instruments on cash registers. Increasingly, we have these sensors in our cars and in our homes, tracking actions and events and consumer choices and purchases. That’s one thing that’s causing this massive increase in volume and velocity of data. Before, we had a lot of these devices that were chirping, but no one was listening. It’s part of this trend -- the exponential decrease in the cost of bandwidth, storage and compute power has made it worthwhile to keep data that previously would have been too expensive to keep.
The biggest class of data that’s emerging as the most interesting class of data is transaction data, transaction streams. Previously, systems were designed to roll those events up into more of a summary form, but now, increasingly, it’s possible for people to do analysis at the lowest grain of data, which is at the transaction level. Transactions are everything from when you go to the credit card machine at the supermarket and swipe your card, to when you go through an E-ZPass lane on the highway, to when you make a phone call. All of these transactions have many attributes attached to them and typically, as they’re occurring or after they occur, the data from those transactions is being pulsed into servers around the world. Collectively, these transactions represent the pulse of the planet. That to me is the most interesting type of structured data out there.
Why do you find transaction data the most interesting?
Driscoll: Transactions represent facts, and when you’re building models it’s much easier to build models over factual actions than it is over sentimental speech. By analogy to my own experience, if we were building a model for customer retention when I was at this North American telco two years ago, we could have pulled the logs from all of the of customer phone calls and attempted to do an analysis of the transcripts of customers who said they were leaving this provider. We could have done that and performed some sentiment analysis. People may have claimed, and in fact, often people claim that it’s about the signal quality on their cell phones and that they were getting a lot of dropped calls. Therefore, they were upset, and that’s why they were going to cancel their contract. When we actually looked at the facts of the data, we found there wasn’t a high correlation between signal quality, the number of dropped calls and whether you canceled their contract. What was much more important was whether or not their friends, someone they spoke with frequently, had canceled their contract the month before. That’s the difference. Structured data can tell stories that are very hard to tease out of unstructured data.
How are these new data sources changing the way models are built?
Driscoll: Until recently, a lot of statistical modeling done over real-world data was typically performed over very small data sets. Or, I should say, a lot of statistical modeling was done over summarized data sets. With the rise and the availability of fine-grained transaction data on the scale of billions of events per day, it’s changed the way businesses build models about their customers. It’s made those models more complex, more powerful and more challenging. Ultimately, in terms of the time granularity of the models, it’s changed the scope of modeling, from talking about how customers behave over long periods of time -- whether that be quarters or months -- to how customers behave over the span of just minutes.
And the tools? What do those look like for a data scientist?
Driscoll: When you move from modeling relatively small, high-level summarized data to modeling over large-scale transactional logs, it no longer becomes possible to build models exogenously from the system that holds the data. So, one consequence has been that data scientists have had to increasingly find ways of moving the analytics to the data rather than moving the data to analytics. That’s because data is heavy, and analytical algorithms are light. So, there’s been a real push in the last couple of years for people to try to push analytics into the database. As far as tools go, there’s more of a requirement now than in the past for a data scientist to be able to write code that can run inside of a database or write code that can scale.
There’s so much talk about Hadoop these days. How does that fit in?
Driscoll: Hadoop is a platform for large-scale data processing, and ultimately, if you want to build models over large-scale data, you’ve got to find a way to do your modeling inside the Hadoop platform. And there’s an emerging set of tools that allows folks to do that. One is called Mahout; it’s an open source machine learning toolkit. That’s probably the one that’s got the most traction.
What do you mean by “large-scale data?”
Driscoll: Small data is data that can fit in RAM [random-access memory], in-memory, on your desktop. Medium data is data that can fit on single machine. So, small data is from 0 to 10 gigs; medium data is from 100 gigs to a terabyte and can fit on single hard drive. Big data is data that cannot fit on a single machine; it must be distributed over many machines. Ultimately, if you want to do big data analytics, you’ve got to find a way to write distributed algorithms that also run in parallel over many machines. That’s effectively what Hadoop is -- a platform for doing distributed computing.
We’ve talked about open source tools with Hadoop and Mahout. Why are data scientists drawn to them?
The most popular tool for data science -- both open source and commercial source -- these days is R, which is an environment for statistical computing and data visualization. There are a few reasons why open source has such a draw for data scientists. One is that with R, there is a large community of individuals both in academia and in industry that use R. Many of the users have created libraries that allow someone to use a new clustering algorithm or to find a better way of doing a logistic regression or a faster method for identifying statistical anomalies. All of these libraries created by users of the tool are shared freely. Right now R has thousands of these libraries that are made available through a website called CRAN -- the Comprehensive R Archive Network.
I think the draw is that data science like regular science advances most quickly when it’s done in the open. Because this field is changing so quickly, the open source community is one that is able to disseminate new ideas and new approaches so new techniques can flow quickly between practitioners. Conversely, if you look at tools such as Matlab or SAS, the time it takes for a new algorithm to be discovered and implemented in a commercial piece of software can be months. Commercial software, by its very nature, is going to move much more slowly in adoption than open source.