Guide to big data analytics tools, trends and best practices
A comprehensive collection of articles, videos and more, hand-picked by our editors
RelayHealth, a unit of McKesson Corp. that runs claims processing applications for healthcare providers, has been a Hadoop user since 2012. Its Hadoop cluster holds 150 TB of claims-related data from hospitals and health systems, which is used to meet regulatory compliance requirements, track the progress of claims, and do analytics aimed at improving the claims and billing process. But now the Alpharetta, Ga., company wants to do more with the Hadoop data for its customers -- and more quickly.
Its new goal is to provide more real-time analytics to clients so they can get immediate insight into their business operations, enabling them to make fast adjustments in an effort to improve efficiency and resolve problems before they take a big financial hit. And to make that happen, RelayHealth is expanding its Cloudera-based cluster and turning to Spark Streaming, the stream processing component of Apache Spark.
The expanded analytics initiative is being driven partly by demand from customers looking to upgrade from after-the-fact reports to real-time alerts about operational issues, according to Raheem Daya, director of product development and manager of the Hadoop platform at RelayHealth. "The expectation is that if there's information they need, they need to have it immediately available," Daya said. "They want actionable intelligence quickly, and that's the model we're moving toward."
For example, he pointed to accounts receivables. Ideally, medical claims are processed promptly by insurers so providers can send final bills to patients and get payments back in a timely manner. But Daya said claims frequently get held up at the approval stage because of eligibility questions and other issues. Predictive models can look for claims that are likely to be flagged; armed with that information, workers in a provider's billing department can take action to try to expedite the approval process.
A partial first step on better analytics
RelayHealth deployed Spark, an open source data processing engine that can work with Hadoop or on a standalone basis, in late 2013. But Daya's team initially used the software's primary batch-processing functionality to pull in transaction data from an HBase database tied to the Hadoop cluster for analysis through machine learning algorithms. That helped, he said -- but it didn't provide the real-time information the company and its customers were looking for. In batch mode, it can take two to three hours to use incoming data to score a predictive model for accuracy, Daya noted. And then the scoring process typically needs to be repeated, often multiple times, as the model is refined.
Raheem Dayadirector of product development, RelayHealth
To try to accelerate things, the company is implementing the Spark Streaming module, with an expected go-live date this month. In tests, the stream processing technology has been able to score models in seconds, Daya said. He added that if Spark Streaming is set to pull in data every five to 10 minutes, "you're potentially going from waiting an entire day to get a result to waiting minutes."
One reason for waiting to put the data streaming software to use was the need to expand the Hadoop cluster, partly to handle the increased data volumes that Spark Streaming will generate. RelayHealth pulls an average of 28 million transactions per hour into the cluster, and Daya said more processing power had become necessary to get data both into and out of the system. An increase from 10 to 45 compute nodes is due to be completed shortly before Spark Streaming is turned on.
Another prerequisite step was upgrading to a new version of Cloudera's Hadoop distribution that became available in December with support for Spark Release 1.2.0. Korin Reid, a data scientist at RelayHealth, said the machine learning library built into earlier releases of Spark was "very primitive," making it hard to build good algorithms. Reid added that she has started using the library in Spark 1.2.0 to expand the set of algorithms the company plans to employ for analyzing the claims data.
Be prepared for what's coming
Daya said creating an overall IT architecture that can take advantage of stream processing technology is a must for a successful deployment. In addition to expanding its cluster, RelayHealth is adding the Apache Kafka message queuing technology to take data from HBase and feed it into Spark. Upstream business systems also need to be able to handle the real-time analytics information coming their way from Spark, including automated updates and actions, he said.
And business processes at the operational end likely will have to be modified to some degree, which can create people issues for project teams to contend with. "Really, it's a universal shift from doing things the way we've always done them to being more data-driven," Daya said. "It requires buy-in from senior management that this is a direction you want to go in."
William McKnight, president of McKnight Consulting Group, agreed that data streaming and real-time analytics applications call for "new ways of thinking" in many organizations. "There are a lot of mind things that go along with it," he said, adding that business managers and workers may need to be convinced of the wisdom of changing internal processes to get the full benefit of the new capabilities.
Mixing Hadoop, Spark and other tools punches up big data systems
Hadoop project management essentials
Analytics and healthcare: A hot topic, but users face some hurdles