Datameer demonstrates Hadoop-based data mart.
Stefan Groschupf, CEO of @Datameer (from the German Sea of Data), has a great way with one-liners. At the #BBBT last Friday, he suggested that doing interactive SQL on Hadoop HDFS was reminiscent of a relational database using tape drives for storage. The point is that HDFS is a sequential file access approach optimized for the large read or write batch jobs typical of MapReduce. A good point that’s often overlooked in the elephant hype.
Another great sound bite was that Datameer could be seen as the Business Objects of the Hadoop world. And it’s that thought that leads me to the actual topic of this post: data marts.
As one of the oldest and most divisive debates since the earliest days of business intelligence, it’s hardly surprising that the old time-to-value discussions of data warehouse vs. data mart should reemerge in the world of Hadoop. After all, Hadoop is increasingly being used to integrate data from a wide variety of sources for analysis. Such integration always begs the question: do it in advance to create data quality or do it as part of the analysis to reduce time to value? As seen in the image above, Datameer is clearly at the latter end of the spectrum. It’s a data mart.
And in the big data world, it’s certainly not the only data mart type of offering. A growing number of products built in the Hadoop ecosystem are touting typical data mart values: time to value, ease-of-use, focus on analysis and visualization, self-service, and so on. What’s different about Datameer is that it has been around for nearly 5 years and has an impressive customer base.
At an architectural level, we should consider how the quality vs. timeliness, mart vs. warehouse trade-off applies in the world of big data, including the emerging Internet of Things (IoT), discussed at length in my Business unIntelligence book. Are the characteristics of this world sufficiently different from those of traditional BI that we can reevaluate the balance between these two approaches? The answer boils down to the level of consistency and integrity demanded by the business uses of the data involved. Simple analytic uses of big data such a sentiment analysis, churn prediction, etc. are seldom mission-critical, so quality and integrity are less demanding. However, more care is required when such data is combined with business-critical transactional or reference data. This latter data is well-managed (or, at least, it should be) and combining it with poorly curated big data leads inevitably to a result set of lower quality and integrity. Understanding the limitations of such data is vital.
This is particularly important in the case of the growing—and, in my view, unfortunate—popularity of the Data Lake or Data Reservoir concept. In this approach, previously cleansed and integrated business data from operational systems is copied into Hadoop, an environment notorious for poor data management and governance. The opportunities to introduce all sorts of integration or quality errors multiply enormously. In such cases, the data mart approach may amount to nothing more than a fast track to disaster.