In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp?
This week’s announcement of Hitachi Data Systems’ (HDS, @HDScorp) intention to acquire @Pentaho poses some interesting strategic and architectural questions about big data that are far more important than the announcement’s bland declaration about it being “the largest private big data acquisition transaction to date”. We also need to look beyond the traditional acquisition concerns about integrating product lines, as the companies’ products come from very different spaces. No, the real questions circle around the Internet of Things, the data it produces, and how to manage and use that data.
As HDS and Pentaho engaged as partners and flirted with the prospect of marriage, we may assume that for HDS, aligning with Hitachi’s confusingly named Social Innovation Business was key. Coming from BI, you might imagine that Social Innovation refers to social media and other human-sourced information. In fact, it is Hitachi’s Internet of Things (IoT) play. Hitachi, as a manufacturer of everything from nuclear power plants to power tools, from materials and components to home appliances, as well as being involved in logistics and financial services, is clearly positioned at the coalface of IoT. With data as the major product, the role of HDS storage hardware and storage management software is obvious. What HDS lacked was the software and skills to extract value from the data. Enter Pentaho.
Pentaho comes very much from the BI and, more recently, big data space. Empowering business users to access and use data for decision making is their business for over 10 years. Based on open source, Pentaho have focused on two areas. First, they provide BI, analysis and dashboard tools for end-users. Second, they offer data access and integration tools across a variety of databases and big data stores. Both aspects are certainly of interest to HDS. Greg Knieriemen (@Knieriemen), Hitachi Data Systems Technology Evangelist, agrees and adds big data and cloud embedding for good measure. The BI and analytics aspect is straightforward: Pentaho offers a good set of functionality and it’s open source. A good match for the HDS needs and vision, job done. The fun begins with data integration.
Dan Woods (@danwoodsearly) lauds the acquisition and links it to his interesting concept of a “Data Supply Chain… that accepts data from a wide variety of sources, both internal and external, processes that data in various nodes of the supply chain, passing data where it is needed, transforming it as it flows, storing key signals and events in central repositories, triggering action immediately when possible, and adding data to a queue for deeper analysis.” The approach is often called a “data refinery”, by Pentaho and others. Like big data, the term has a range of meanings. In simple terms, it is an evolution of the ETL concept to include big data sources and a wider range of targets. Mike Ferguson (@mikeferguson1) provides perhaps the most inclusive vision in a recent white paper (registration required). However broadly or narrowly we define data refinery, HDS is getting a comprehensive set of tooling from Pentaho in this space.
However, along with Pentaho’s data integration tooling, HDS is also getting the Data Lake concept, through its cofounder and CTO, James Dixon, who could be called the father of the Data Lake, having introduced the term in 2010. This could be more problematical, given the debates that rage between supporters and detractors of the concept. I fall rather strongly in the latter camp, so I should, in fairness, provide context for my concerns by reviewing some earlier discussions. This deserves more space than I have here, so please stay tuned for part 2 of this blog!