IT and analytics managers struggling with all the data flooding into their organizations may find it hard to ignore the increased marketing push machine learning tools are getting from technology vendors. And for good reason: Running automated algorithms designed to learn on their own as they churn through large data sets can accelerate data mining and predictive analytics applications -- and give users information they might not get otherwise. But companies looking to take advantage of machine learning often face a substantial learning curve.
For starters, a lot of big data infrastructure technologies -- Hadoop, the Spark processing engine and related open source software in particular -- typically underlie machine learning efforts. In many cases, that means building a suitable data processing and management architecture from scratch. In addition, analytics teams frequently have to change their ways -- for example, by adding new technical skills and adopting different analytical methodologies and approaches than they've used in the past.
That all came into play at commercial property and casualty insurer Zurich North America after it began a large-scale machine learning initiative two years ago. The Schaumburg, Ill., subsidiary of Zurich Insurance Group turned to machine learning software to improve on traditional approaches to assessing business risks faced by its customers and analyzing its policy pricing while staying on top of a tidal wave of data. "Massive amounts of data are a problem we see, so we went to massive machine learning to handle it," said Conor Jensen, analytics program director at Zurich North America.
But Jensen said the company spent the first 10 months of the initiative building a Hadoop-based distributed processing architecture to support the new analytics applications. That included evaluating potential vendors, running proof-of-concept projects and deploying two interconnected data lakes on separate Hadoop clusters as well as a commercial version of the H2O open source machine learning platform from software vendor H2O.ai.
Built around the Hortonworks distribution of Hadoop, the setup includes both a production and a research data lake. The first pulls in data feeds from all of Zurich's internal systems and funnels processed data sets into the second, which the insurer's predictive analytics team uses to test and run risk analysis models. The Hadoop cluster powering the analytics applications can crunch through the models much more quickly than conventional standalone systems did previously, Jensen said. "We're typically running things in 10% of the time or less if everything is set up properly."
It took some doing, though, to put the analytics team in position to create effective models for machine learning in the new environment. Jensen said Zurich had to accumulate the kind of big data analytics expertise commonly found in the large Web companies that pioneered both the development and use of Hadoop. "We're not a PayPal or a Google -- we're on the other end of the curve," he noted.
Jensen's group did have a decade's worth of experience using languages such as R and Python to create generalized linear models, a basic statistical-analysis building block that also can be used in machine learning applications. But he said that adapting to developing programs to run in Hadoop was a challenge. Another big hurdle, according to Jensen, was converting internal mindsets from the actuarial approach to analytics that's typical in the insurance industry -- where "hard" statistics rule -- to the "more experimental" machine-learning process. Zurich also had to hire new employees with experience in machine learning and retrain existing workers on the new techniques.
Room to run
More companies are likely to face those kinds of challenges in the future. Machine learning tools are still a niche technology -- less than one-fifth of 344 respondents to a TDWI survey conducted in mid-2015 said their organizations were doing machine learning. It ranked 18th in adoption on a list of 27 business intelligence and analytics technologies, according to a report on the survey that TDWI published last October. But an additional 36% of the respondents said they expected their organizations to start using machine learning platforms within three years (see "Edging Into the Enterprise").
Despite its relatively low adoption rate, machine learning isn't really a new field. Experimental uses of machine learning methods go back to the early days of artificial intelligence. It has been a key facet of data mining and predictive analytics processes in leading-edge organizations for many years. A sweet spot for machine learning is online recommendation engines, highly visible to users of Amazon, Netflix and other websites. Other common uses include fraud detection, sales forecasting, predictive equipment maintenance, programmatic online advertising and price optimization.
Zoiner TejadaCEO and architect at Solliance
But the emergence of big data platforms such as Hadoop and Spark is making machine learning feasible and affordable for more organizations. And IT vendors of various stripes are looking to get in on the action.
Providers of machine learning tools range from established analytics vendors like IBM and SAS Institute to specialized startups such as H2O.ai, Alpine Data Labs and Skytree. Cloud-based offerings are also available from the likes of Amazon Web Services, Google and Microsoft. In addition, users can take advantage of open source machine learning technologies such as Apache Mahout and Google's TensorFlow. At the back end, Hortonworks and rival Hadoop distributors Cloudera and MapR Technologies are all touting their ability to support machine learning applications, as is Databricks, the driving force behind Spark, which includes a library of machine learning algorithms.
Unlocking machine learning
Peter Crossley, director of product architecture and technology at Webtrends Inc., sees machine learning as a natural extension of 20-plus years of analytics work at the Portland, Ore., company, which collects and analyzes user activity data from websites, mobile devices and the Internet of Things to support the online marketing programs of its corporate clients. But Webtrends has been able to substantially step up its advanced analytics efforts since putting a Hortonworks-based Hadoop cluster into full production at the start of 2015 and adding Spark eight months later.
Overall, Crossley said, Webtrends collects data on 13 billion online events per day, amounting to 500 terabytes of new information each quarter. Thanks to the new architecture, the process of analyzing the data keeps moving closer to real time -- for example, machine learning models are now used to immediately score website visitors so personalized webpage views and online offers can be served up to them. The advent of technologies like Hadoop and Spark "has unlocked machine learning," Crossley said. "Now you can take what was a batch process and make it what we call 'embarrassingly parallel.' "
The big data analytics journey at Webtrends has also included plenty of open source infrastructure building. In addition to Hadoop and the base Apache Spark software, the technologies being used by the company include the Kafka message queuing system and Samza and Storm stream processing frameworks.
And it's not a static environment. Like many other big data users, Crossley's team is open to swapping in new technologies when necessary. Webtrends, for example, is now using the combination of Spark and Samza to do some of what Storm initially handled in capturing and processing streams of data. A big data and machine learning architecture calls for flexibility as new business and analytics needs arise, Crossley said, adding that loose architectural coupling helps ensure flexibility. Kafka, he noted, has proved useful at Webtrends in moving data through the architecture without hard-wiring it to the processing operations.
Beware the cutting edge
Andrew Musselman, a Mahout project management committee member who works as chief data scientist in the global data science practice at consulting company Accenture, similarly advises prospective machine learning users to expect rapid and continuing changes in the available infrastructure technologies. "We're in a 'tool-making' period now that will take a while to settle down," Musselman explained, speaking in his Mahout role. "Tools have been written and adopted and then sometimes thrown away."
In addition, many of the open source tools surrounding Hadoop are evolving quickly, so users need to keep pace with a regular stream of new releases. Mahout is a case in point: The Apache Software Foundation released five versions of the machine learning technology during 2015.
As with other types of advanced analytics applications, users of machine learning tools also can encounter obstacles in getting their predictive models to produce accurate results. But the way it plays out in machine learning efforts can be even more challenging because of the size of the data sets typically involved and the accompanying development and processing complexity.
"It can be difficult to go from thought to model to 'training' [the model] and then to operationalize the whole thing," said Zoiner Tejada, CEO and architect at development and consulting services provider Solliance. And analytics teams aren't always rewarded for those efforts, cautioned Tejada, who also is CTO at analytics services startup Algebraix Data. "Machine learning is a powerful tool, but you can cut yourself," he said. "You may have a model that looks like it works, but later you find its predictions are bogus."
"The promise of machine learning can be overstated," agreed Zurich North America's Jensen. Even once you get through the initial learning curve on how to make it work, the process isn't easy, he warned. For one thing, he said, each of the variables built into a model requires upfront data preparation and manipulation -- work that can be harder to do than implementing the final algorithm itself.
Another pitfall lies beyond building an infrastructure, processing data and developing machine learning models, in what might be described as the last yard, where the business side makes decisions. "Having the best algorithms doesn't mean anything," Jensen said, "if you don't train end users on how to effectively use the [results]."
Big data machine learning techniques
Open source technology is key to analytics future
Keeping a platform business model current
Machine learning models needDevOps-style workflows
- Big Data Analytics –ComputerWeekly.com
- Analytics in a Big Data World –ComputerWeekly.com
- Big Data Analytics Best Practices –SearchBusinessAnalytics
- Big Data Analytics Market Study –MicroStrategy Incorporated