With Cisco’s Data Virtualization Day, where I’m a panelist along with Rick van der Lans, coming up fast on 1 Oct in New York, it’s a good moment to revisit the topic.
Data virtualization has come of age. For a few, it still remains unheard of, even if I mention some of its noms de plume, most often data federation. For others, although they are now a decreasing minority, especially in the data warehouse field it remains akin to devil worship! I’m a big supporter (of virtualization, that is!), but a new challenge is emerging – the Data Lake. To understand why, let’s quickly review the history and pros and cons of data virtualization.
As the original proponent of data warehousing in 1988, I was certainly not impressed by data virtualization when it was first talked about in 1991 as Enterprise Data Access and as a product EDA/SQL from Information Builders, and as a component of IBM Information Warehouse Architecture. (I suspect there are few enough of us left in the industry who can talk about that era from firsthand experience!) Back then, and through the 1990s, I believed that virtualization was a technology with very limited potential. Data consistency and quality was still the major drivers of the BI industry, and real-time integration of data from multiple sources was still a very high-risk endeavor.
By the turn of the millennium, however, I was having a change of heart! IBM was preparing to introduce Information Integrator, a product aimed at the market then known as Enterprise Information Integration (EII). The three principle use cases – real-time data access, combining data marts and combining data and content – were gaining traction. And they continue to be the principle use cases for data virtualization from a data warehouse point of view today. My change of heart seemed obvious to me then and now: the use cases were real and growing in importance, they could not be easily satisfied in the traditional data warehouse architecture, and the data quality of the systems to be accessed was gradually improving. Still, I was probably the first data warehousing expert to accept a role for data virtualization; it was not a very popular stance back then!
Within the past five years, data virtualization has become more mainstream. There is a broader acceptance that data exists and will continue to exist on multiple platforms, and therefore there is a need to access and join them in situ.
Long may that recognition continue! For now there is another wave of platform consolidation being proposed. It’s called the Data Lake, and it’s probably one of the most insidious concepts yet proposed – well, that’s my view. The Data Lake is a new attempt to consolidate data. In that sense, it echoes data warehouse thinking. The significant difference, however, is that there is no thought of reconciling the meanings, rationalizing the relationships or considering the timings of the data pouring in. “Just store it,” is the refrain; we’ll figure it out later. To my mind, this is as dangerous as allowing all unmonitored and likely polluted sources of water to flow into a real lake and then declaring it fit to drink. Not a good idea. In my view, as explained in “Business unIntelligence”, a combination of data warehouse and virtualization is what’s needed.
I’ll return to the Data Lake in more detail soon, and I’ll also be speaking about it at Strata New York, on 16 October.
Do join me at one or the other of these events!
Image: zhudifeng / 123RF Stock Photo