Sergey Nivens - Fotolia
Data scientists typically like to leave background technology to the engineers. File systems and resource management tools aren't all that interesting when your primary concern is improving the predictive accuracy of a model or uncovering a previously unknown correlation in a data set.
But with the growth in the number of tools for big data available and the corresponding leaps in computing power, data scientists are increasingly finding themselves in a position where they have to think about their data pipelines if they want to get the best performance from their models.
"With more compute power, it's nice to able to run many regressions," said Brendan Herger, a data scientist at banking and financial services firm Capital One. "It does help rapid development because you have more resources available to you pretty easily."
McLean, Va.-based Capital One supports a broad variety of tools, but Herger said most of the analytics work is done through the Hadoop Distributed File System and its companion YARN resource manager. On top of the Hadoop platform, he does a lot of modeling using machine learning software from H2O.ai. Other data scientists and analysts use different front-end data science tools like GraphLab, Apache Zeppelin and Tableau. According to Herger, a strong, flexible back-end system can support rapid access to large sets of data, regardless of the front-end tool.
No need for sampling data
Herger said having this kind of computational power on the back end has allowed him to analyze full data sets, eliminating the need for sampling data. He described the question of whether or not to sample as "almost a religious question," but said there are a few huge benefits to analyzing full data sets. Primarily it preserves full populations in data along with all the potential signals that might exist. Sometimes signals can be lost or muted somewhat when populations are split into samples.
"It's interesting to be able to not sample if you choose," Herger said. "[Computational power] is leading to a bit of a shift where it's more common to just run analyses on the whole data set."
For Daqing Zhao, director of advanced analytics at Macys.com Inc. in San Francisco, the main benefit to his team of having a powerful data architecture is speed. "We want very rapid prototyping," he said in a presentation at the TDWI Accelerate conference in Boston this month.
Zhao's team is responsible for optimizing the Macys.com website, which functions as the online arm of retailer Macy's Inc. The optimization effort includes everything from A/B testing of design changes to building product recommendation engines that deliver personalized recommendations for each customer. The main tools for big data his team works with are built around Hadoop and Spark systems that support a range of analysis tools from commercial vendors like SAS Institute and IBM, as well as open source tools like H2O, R and Mahout.
Data sandbox aids data analysis
Zhao also had the data engineering team at Macys.com build his group a large data sandbox in the company's data warehouse. That allows his data scientists to transform or join data in search of meaningful correlations without altering any information at the system-of-record level.
Of all these tools, Zhao said H2O is particularly useful for doing predictive modeling. He first became aware of the tool at a recent meetup organized to demo the software. He said that in 11 seconds it performed a logistic regression on a data set that held 100 million rows of data. Importantly, it integrates with the company's back-end data infrastructure, which makes it attractive all around.
Zhao said he's a big fan of the open source tools available to data scientists today. In addition to being powerful tools for big data, the more popular ones have large communities behind them, which makes it easy to find answers to problems. Integrating open source tools with data infrastructure was traditionally a sticking point because there's no tech support to call when a problem arises. But the growing popularity of such tools has blunted that problem.
"Because of the popularity of open source, you can probably Google or find an answer in a forum," Zhao said. "It's not like if you have a problem with open source, you're completely abandoned."
Being free from data management
And when a data scientist puts in a little work on the back-end systems, it can pay off by freeing him or her up to spend less time on data management going forward.
That was the case for Colin Borys, a data scientist at Riot Games Inc., maker of the popular League of Legends video game. In a presentation at Spark Summit 2016, held in June in San Francisco, Borys described how his team monitors network traffic from players to see if anyone is experiencing lag time and whether traffic can be rerouted to improve connectivity. The data science team also developed a recommendation engine to suggest different options to players.
Traditionally, most of the work was done based on ad hoc queries run against Hive tables, but Borys said that approach was inefficient and didn't scale very well. Riot Games then brought in Spark partly because it lets the data scientists query Hadoop data in SQL, a language they already knew. The Los Angeles company went with a cloud-based Spark platform from Databricks so that nobody would have to spend time managing the clusters.
Prior to bringing in Spark, Borys said, data scientists would spend a majority of their time preparing data. Now they spend it doing actual data analysis.
"We wanted to free up analysts," he said. "Playing with data is a lot better using Spark."
Big data analytics tools can help your organization in many ways
Learn which tools for big data are right for your enterprise
Enterprises have no shortage of options when it comes to big data tools