Petya Petrova - Fotolia

Machine learning projects face data prep, model building hurdles

IT and analytics managers discuss the biggest challenges of machine learning applications, with data preparation and development of algorithm-driven analytical models sharing top billing.

Machine learning has been part of the advanced analytics picture for decades, but the emergence of big data platforms and better tools for creating automated analytical algorithms is bringing it more front and center. As a result, growing numbers of IT and analytics teams face the challenges of making machine learning projects work.

In many organizations, machine learning initiatives require big investments in IT infrastructure, often involving the deployment of Hadoop clusters, the Spark processing engine and other big data technologies. New data management and analytics processes are often also needed to get data sets ready for analysis and to develop the algorithms that will be run against them. In many cases, that means adding new skills through outside hiring or retraining of existing employees.

So-called deep learning applications, an emerging further step along the artificial intelligence curve, add to the machine learning challenges for organizations looking to run even more complex analytics jobs -- for example, interpreting images in order to classify them based on their content. In particular, deep learning ratchets up the development degree of difficulty for data scientists and statisticians building predictive models powered by automated algorithms.

To get some real-world insight into the hurdles that can trip up machine learning projects, we asked experienced attendees at the Hadoop Summit 2016 conference in San Jose, Calif., about the biggest challenges they've encountered. Their answers touched on the complexity of both upfront data preparation work and using libraries of machine learning algorithms as part of the model development process. Here's what they had to say, presented in verbatim form.

Chester Chen, senior manager of data science and engineering at wearable camera maker GoPro Inc.: "The biggest challenge is really preparing the data. All this data is coming in in different forms -- getting the proper data in the right data pipelines is a pretty daunting task."

Peter Crossley, CTO at web, mobile and internet of things analytics services provider Webtrends Inc.: "Getting data that's sanitized or managed in some form. You have to have a normalized data set -- you can dump all your data in a data lake, but then you have this marsh of data that can be hard to analyze."

Murali Kaundinya, innovation engineering director at pharmaceuticals maker Merck and Co.: "The [data analysts] don't want to deal with all the machine learning libraries. The big challenge is to present them with a platform they can use without becoming a machine learning expert."

Bryan Lari, director of institutional analytics at The University of Texas MD Anderson Cancer Center: "You're starting with imprecise data -- so, it's getting to a high enough level of precision in the data that you're confident you're getting accurate results."

Sumeet Singh, senior director of cloud and big data platforms at Yahoo Inc.: "We have to make it a whole lot simpler. [Data scientists] could easily spend a month or two just to evaluate a particular library before doing anything with it. That's an impediment."

Next Steps

Why machine learning projects and tools could be transformational for businesses

Data scientists tap deep learning techniques to dig into complex data sets

How machine learning software helps eBay translate its online auction listings

Dig Deeper on Predictive analytics