Better Together: Hadoop and Your Enterprise Data Warehouse
You’re getting started with a big data analytics project on Hadoop and are impressed by the cost savings on storage compared with your data warehouse. You’ve read that TrueCar, a company that collects vast volumes of car price data for its online car-buying business, has cut its monthly data storage cost from $19/GB to $0.23/GB.1 So you’re wondering, should you consider moving all your business intelligence efforts to Hadoop?
No. A data warehouse and Hadoop are both well-suited to different tasks and Hadoop should not be viewed as a replacement for your enterprise data warehouse. What’s more, Hadoop adopters report some challenges. Some say that Hadoop takes too much effort or is too slow for real-time analytics.
Hadoop’s MapReduce engine is optimized for batch processing thanks to its ability to distribute simple calculations. However, this method is not ideally suited to ad hoc, interactive real-time data discovery and advanced analytics. Advanced analytics requires the ability for each of the database nodes to communicate with each other to enable interactive processing, a capability lacking in MapReduce. And Hadoop requires a particular set of skills that’s different from those needed for a data warehouse and may be hard to find in the labor market.2
Since you will keep your data warehouse and deploy Hadoop alongside it, the best approach is to utilize both in complementary fashion. Your enterprise data warehouse should contain structured and curated data, while Hadoop should serve as a sandbox for experimenting with new types of data like Web logs, text, email and machine data. 3 When combined with traditional data types found in the enterprise data warehouse, these new data types can offer users new insights. Hadoop can also be used as a staging area for data to be cleansed and structured prior to populating the enterprise data warehouse. This allows the enterprise data warehouse to focus on the data that is highly valued by business users.
Once you adopt this hybrid approach, you may discover some important strategic insights. For example, if you’re in the retail business, you may combine consumer sentiment data from free-text product reviews and call center notes with structured data such as pricing and SKU numbers. The results could give you new knowledge of customer preferences, leading you to scrap products that are falling flat and add new wares for which customers are clamoring.
Big data expert Tom Davenport chronicles several game-changing discoveries in his book, “Big Data @ Work,” as he explains to John Farrelly.4 He found companies that augmented “small data” projects with Hadoop big data initiatives to achieve dramatic results. In one case, Monsanto, which already had plenty of information in the form of structured data about its seeds and plant hybrids, added big data information about climate and soil conditions and made the resulting intelligence available to farmers. Farmers obtained guidance as to what to plant, when to plant it, how much water to use, how many seeds to sow, the best time for herbicide and pesticide applications and when to harvest. Crop yields increased by 10% to 15%.
The shortcomings of Hadoop for real-time analytics can be overcome to a significant degree by the use of in-memory analytics or in-database analytics. This is exactly what SAS High-Performance Analytics (HPA) technology accomplishes. SAS HPA allows complex data exploration, model development and model deployment steps to be processed in-memory or distributed in parallel across a dedicated set of nodes.
Because data can be quickly pulled into memory, requests to run new scenarios or new analytical computations can be handled much faster and with better response times. This enables business users to make real-time decisions and to create more accurate models.
Also, because data is stored locally in Hadoop, it can be processed without having to move the data to a separate analytic platform.
In addition to flexibility with regard to data types, Hadoop is not constrained by other common database limitations, such as the number of columns in a single table. Advanced analytics software uses an analytics-based table that can consist of tens of thousands or even hundreds of thousands of columns. Because the number of variables can have significant impact on the accuracy of the results, Hadoop supports advanced analytics particularly well, because the data in it can be both wide and deep.
Why is this important? Think about an anti-money-laundering application, which a business analyst at a bank or financial services firm may use to spot patterns of illegal activity. By analyzing transaction patterns in real time, illegal activity may be discovered and stopped before large losses are incurred.
Both data warehouses and Hadoop will continue to evolve, and perhaps Hadoop may become a replacement for the enterprise data warehouse. Data warehouses may improve and offer better storage economics, lower latency, higher scalability and support for diverse data structures. But for now, there is a need for both Hadoop and an enterprise data warehouse. When used together, they each can enrich and derive value from the data contained in the other, giving you a strategic edge you could not get in any other way.
1“Tom Davenport on Hadoop, Big Data, and the Internet of Things,” SAS, October 15, 2014
2“The Current State of Hadoop in the Enterprise,” International Institute for Analytics and SAS Institute, p. 5.
3 Ibid., footnote #2, p. 6.
4 Ibid., footnote #1, SAS