Structuring a big data strategy
A comprehensive collection of articles, videos and more, hand-picked by our editors
Forging a data science team process is more important than hiring the hard-to-fill position of a data scientist, according to a data science team leader. That makes sense, because the job title of a data scientist, which is still being defined, has expanded to the point where it has become something no one person can do.
In various views the data scientist job description has already come to include R programmer, Scala developer, Hadoop jockey, data quality expert, domain specialist, algorithmic modeler and more. What is needed instead, according to Dan Mallinger, who serves as data science team lead at Mountain View, Calif.-based consultancy Think Big Analytics, is to focus on the overall data science process, not expecting one person to take over an entire role.
"Data scientist may be the worst defined job role in recent memory," Mallinger told an audience at the recent BigData TechCon event in Boston, Mass. Poorly defined it may be, but that does not hold back the going rate of data scientist salaries. The cost of the data scientist may be hidden in big data efforts.
"Things like Hadoop are cheap from a computer standpoint, but they are really expensive from a human resources point of view," he said. "We need a breadth of people and a breadth of skills "for new styles of high-volume analytics."
In his practice, Mallinger has seen a guide for moving from a lone data scientist rock star to a holistic, team-based process.
For more on the data scientist
Read about data science buy or build issues
Learn about a Strata speaker's take on data scientist education
Pick up on Gartner's estimates for data scientist jobs
He said one of the best data science teams he had seen was "a cross-functional group" that included a business analyst and data quality engineer, as well as product managers that tied the analytical effort back to business objectives. "They have done more than I have seen most teams do," he said.
What shared interest did they have? "They all had an interest in R," Mallinger said, referring to the popular statistical programming language. In any case, there is apparent evolution underway in team skills as variety, volume and other factors change the face of data analytics.
"Big data introduces big data business cases. They are fundamentally different," Mallinger said. These cases comprise many jobs that, in his words, "people didn't think about before."
The variety of big data was also emphasized by BigData TechCon participant Adam Laiacano, a data scientist and engineer at Tumblr, the social blogging site headquartered in New York City. He described big data as "data that was never generated before … that may have value."
He called this data "exhaust," indicating it is a byproduct of operations. It is, for example, unstructured and semi structured data generated by users' Web activity. Laiacano likened the job of the data scientist to that of an engine turbocharger, which uses a motor's exhaust to boost the amount of air entering the pistons, increasing horsepower. The data science process now is about working with that ''collateral'' data.
Laiacano said data science professionals should work to make sure big data is used, not just gathered. He said when no one is using the data it is a tell-tale clue that a big data project is on the wrong track. The first user could be a data analytics team member.
"You are user number one. If you don't use it yourself, that itself is a tip off," he said. He advised that, for a big data project, people should first find data that is useful to their own work to study, and then look to see if it is useful to business users in the organization, too.