Facebook Inc. and LinkedIn Corp. are primarily known for their social networking prowess, but they’re also credited with helping to formalize an emerging job category known as data scientists
The data scientist position, which is also gaining ground at companies such as Google, Foursquare, Groupon and even financial services firm Zions Bancorporation, typically involves a combination of analytical, statistical, machine-learning, coding and quantitative skills, underscored by the harder-to-teach characteristics of curiosity and a willingness to explore large amounts of data.
“I used to call them data mining engineers or sometimes data miners or computational statisticians, but scientists seems apropos these days,” said Usama Fayyad, former chief data officer at Yahoo Inc. and current chairman and chief technology officer at ChoozOn Corp. “The complexity of data in volume, velocity and variety has increased enough to justify that.”
While the exact definition may vary from one organization to the next, the intimate relationship between data scientists, advanced analytics and “big data” doesn’t.
“Now that companies can reliably store massive quantities of data, the next step is what to do with the data you have,” said Monica Rogati, a senior data scientist at LinkedIn who has a doctorate in computer science from Carnegie Mellon University. “That’s where data scientists come in.”
The “vast quantities of data” Rogati is referring to means terabytes and beyond, though she wouldn’t put a number on the amount LinkedIn works with. But as Fayyad alluded to, volume is only one aspect of the multifaceted big data definition, as businesses are also facing the need to store and analyze a growing variety of data produced at an ever-quickening pace.
Tools for tapping into big data
In order to take advantage of the bigger and deeper pool of information, data scientists are tasked with wading through the noise to find the nuggets. In addition, organizations increasingly are interested in shining a brighter light on future behavior by using predictive analytics tools in an effort to uncover emerging trends and patterns. And with the larger data sets and new data sources that characterize big data, companies potentially can create better predictive models, which in turn can more accurately inform decisions on issues such as stocking store shelves and pinpointing fraudulent behavior.
Generating some predictive insights from simple data sets or easy-to-understand information might require only basic data plotting skills, said Michael Driscoll, co-founder and chief technology officer at Metamarkets Group Inc., a San Francisco-based startup that provides real-time reporting and predictive analytics services to companies in the online media market. But as data becomes larger and more granular, Driscoll added, skilled data scientists can come in handy.
“It really comes down to [understanding] the much more nuanced, smaller differences that require statistics to tease out the difference between something that’s noise and something that’s signal,” he said.
Rogati agrees that predictive analytics is a necessary tool for data scientists; other key components of the data-scientist tool belt, she said, include data mining, text mining and data visualization technologies. Referring to the latter, Rogati said, “We’re trying to make data come to life, and visualization is a big component that helps us do that.”
LinkedIn has also embraced open source software as part of its analytics strategy, having deployed Apache Hadoop, which enables distributed processing of large data sets. “We’re using everything we can get our hands on,” Rogati said. “We are on the bleeding edge, and sometimes that means we get hurt, but we’re giving them all a try.”
In addition, the company has opened up code of its own to other organizations: Project Voldemort, a database that LinkedIn says it uses for “certain high-scalability storage problems,” and Azkaban, described as a “simple batch scheduler for constructing and running Hadoop jobs or other offline processes.”
Making code openly available is a way of giving back to the open source community, Rogati said, and being active participants in that community also creates opportunities to spot technical talent that could help LinkedIn in the future.
DIY, or call in the data scientists?
These days, analytics tools aimed at a broader audience are stamped by their vendors with phrases like “user-driven,” “self-service” and “user-friendly,” providing the business intelligence (BI) and analytics industry with a kind of do-it-yourself gusto.
While Fayyad applauds vendors for providing better tools for business users, he cautions organizations as well. “It is possible now that someone who doesn’t know data mining can run algorithms,” he said. “Sure, they can run it, but garbage in, garbage out.”
In other words, the basic problem plaguing businesses that lean heavily on Excel spreadsheets carries over into more advanced analytics scenarios as well: If the data is faulty, the conclusions based on that data will be, too. That can be especially problematic as big-data analytics and real-time BI become a reality for more and more businesses.
Illustrating the potential value of data scientists in such cases, Fayyad said that when a new analytical quandary is placed in front of his team, the workers spend several days to several weeks studying the data before starting the analytics process.
“We look for problems in the data: any errors, any systemic issues, misspellings, stuff that doesn’t make sense,” he said. “You can’t automate this. It requires an understanding; it requires someone to analyze what’s going on.”
Fayyad and Rogati, while both noting their bias, said they think that every business engaged in advanced analytics could benefit by having people with data-scientist skills.
“Everybody wants a black box to replace their most technical people,” Rogati said. “[Companies] want to have a silver bullet: tools that take data, give actionable insight and tell you what to do. … But in reality, there are lots of pitfalls, gotchas or interesting ways of looking at data that may not be apparent unless you’re manually looking at it” -- and you have the ability to know what you’re looking at.