For decades, data analysts of all types have used self-service analytical tools to access and manipulate data, identify trends and anomalies, and present business intelligence insights. Although the types of tools have changed over the years, the results are almost always the same: a spreadmart, or data shadow system, built with unique rules, metrics and definitions.
Most large companies have tens of thousands of spreadmarts, each built to answer important -- but localized -- questions at a point in time. While invaluable to individual business units, spreadmarts drive CEOs and CFOs crazy. When they ask a simple question -- such as, "How many customers do we have?" -- they get conflicting answers from spreadmart-toting data analysts and business-unit heads citing inconsistent data. Like Helen of Troy, whose face "launched a thousand ships," the spreadmart phenomenon has caused thousands of IT managers and corporate executives to launch data warehousing initiatives to restore data consistency and enterprise order.
That hasn't stopped people from squirreling away data in a variety of spreadmart tools, from Microsoft Excel and Access to self-service BI software and, at the high end, SAS and SPSS for statistical analysis and data mining. But there's a new technology on the block that can help organizations ameliorate the more deleterious side effects of spreadmarts: Hadoop clusters.
The open source software is free, the hardware needed to run it is cheap, and analysts don't have to know SQL or data modeling techniques to use it. They can dump data into Hadoop and then use a high-level language, such as Hive or Pig, or a Hadoop-compliant BI or data integration tool to access, manipulate and analyze the data. And although there are many reasons to implement Hadoop, a primary one is to foster self-service data analysis without IT intervention. As a result, Hadoop is fast becoming the spreadmart platform of choice for sophisticated analysts and department heads.
Governance-free zone in Hadoop clusters
Until now, there has been minimal talk about how to ensure data governance inside Hadoop environments. The terms data quality, data consistency, conformed dimensions and metadata management have yet to enter the Hadoop lexicon. That's partly because Hadoop is so new and most companies are still evaluating its ability to support production applications. It's also because its primary users -- business analysts -- have never been overly concerned with enterprise data governance and consistency and don't require high levels of data quality to generate estimates and analyze trends.
So, if Hadoop is a free-for-all self-service system where analysts and business users can dump and access data willy-nilly without governance, what's to keep the highly hyped Hadoop data lake from becoming a bunch of data puddles? In other words, will Hadoop further proliferate spreadmarts or help consolidate them?
The answer to that question is: Both.
Companies can indeed use Hadoop as a low-cost repository for all of their data -- i.e., a data lake. As such, a Hadoop system provides one-stop shopping for every analyst and business unit in an organization. Rather than hunt for data in multiple applications and systems, analysts can get everything they need by tapping into the data lake. That makes it even easier to create spreadmarts.
But instead of proliferating tens of thousands of wholly ungoverned spreadmarts on various PCs and file servers, Hadoop provides an opportunity to consolidate data analytics work in a single place: a giant analytics sandbox that offers greater economies of scale and sizable cost savings. And it gives IT and business managers visibility into what analysts are doing. One way to think about spreadmarts is to view them as instantiations of business requirements. Hidden spreadmarts make it difficult for IT managers to see what's important to the business and support those requirements in data warehouses and enterprise reports. By centralizing analytics activities in a data lake, Hadoop makes it easier for IT departments to partner with business users and proactively meet their needs.
A new host for analytical ecosystems
However, Hadoop is much more than a place to keep a collection of spreadmarts. It's a scalable, flexible data processing platform that can meet most enterprise data analysis requirements. It's like the Swiss army knife of data processing: a generic tool that can do almost anything, although nothing optimally (at least at the moment).
Hadoop can store all enterprise data, not just a subset, as data warehouses do. And with the advent of the YARN resource manager, released last fall as part of Hadoop 2, it can support a variety of data and analytical processing applications, ranging from real-time SQL querying systems to graphing, in-memory computing and streaming analysis engines. Although Hadoop 2 needs time to mature, the future is clear: Companies can store their data in Hadoop clusters and process it there, too.
This is revolutionary. Astute IT and data warehousing managers will quickly recognize the implications. With Hadoop 2 systems, their future analytics architecture revolves around Hadoop, not a relational database. By extension, their existing analytics systems become specialty databases that eventually disappear as Hadoop matures and subsumes their functionality.
At least, that's the vision. A lot of development and experimentation needs to happen before most organizations transform their current analytical ecosystem into a data lake fueled by Hadoop 2. Moreover, existing analytical systems have a long shelf life: Even after their costs have been fully depreciated, embedded skill sets and corporate inertia make it difficult for companies to jettison them. And Hadoop may never live up to its promise, or another technology may take its place as the analytics heir apparent.
But things happen fast in the Hadoop world. Today, Hadoop is quickly becoming the de facto enterprise data repository and preferred spreadmart platform (or analytics sandbox). Soon it could be the predominant platform for building analytics applications and the centerpiece of most analytical ecosystems.
About the author:
Wayne Eckerson is principal consultant at Eckerson Group, a consulting firm that helps business leaders use data and technology to drive better insights and actions. His team provides information and advice on business intelligence, analytics, performance management, data governance, data warehousing and big data. Email him at firstname.lastname@example.org.
Read more about why Wayne Eckerson thinks Hadoop 2 and YARN are set to shake things up in data management and analytics
Get Eckerson's take on the continuing need for data warehouses in the big data era
See why consultant Barry Devlin thinks the term data lake is all wet