Doing big data governance can save you from drowning in the Data Lake.
So, how do you design and build a reservoir? Simplistically, you actually design and build a dam, clear the area behind it of everything of value—people, animals and things—and wait for the water to fill the existing valley, drowning everything in its wake.
Of course, I really want to talk about Data Lakes and Data Reservoirs, what the concept might mean, and its implications for data management and governance. Data Lakes, sometimes called Data Reservoirs, are all the rage at the moment. They seem to provide ideal vacationing spots for marketing folks from big data vendors. We’re treated to pictures of sailboats and waterskiing against pristine mountain backdrops. But beyond exhortations to move all our data to a Hadoop-based platform and save truckloads of money by decommissioning decades of investment in relational systems, I’ve so far found little in the way of thoughtful architecture or design.
The metaphor of a lake offers, of course, the opportunity to talk about water and data flowing in freely and users able to dip in with ease for whatever cup full they need. Playful images of recreational use suggest the freedom and fun that business users would have if only they didn’t have to worry about where the data comes from or how it’s structured. Like the crystal clear water in the lake, it is suggested that all data is the same, pure substance waiting to be consumed.
Deeper thinking, even at the level of the lake metaphor, reminds us that there’s more to it. Lake water must undergo significant treatment and cleansing before it’s considered fit to drink. Many lakes are filled with effluent from the rivers that feed them. Even the pleasure seekers on the lake understand that there may be dangerous shallows or hidden rocks.
The rush to discredit the data warehouse, with its structures and rules, its loading and cleansing processes, its governance and management, has led its detractors to throw out the baby with the lake water. It is important to remember that not all data is created equal. It varies along many axes: value, importance, cleanliness, reliance, security, aggregation, and more. Each characteristic demands thought before an individual data element is put to use in the business. At even the most basic level, its meaning must be defined before it’s used. This simple fact is at the foundation of data warehousing, but it often seems forgotten in the rush to the lakeshore.
Big data governance has to start from that most simple act of naming. Much big data arrives nameless or cryptically named. Names, relationships, and boundaries of use must all be established before the data is put to business use. It should not be forgotten that in the world of traditional data, data modelers labored long and hard to do this work before the data was allowed into the warehouse. Now, data scientists must do it for themselves, on the fly with every data set that arrives.
New tools are beginning to emerge, of course, that emphasize data governance and simplify and automate the process. What these tools do is re-create meaning and structure in the data. They differentiate between data that is suitable for this purpose or totally inappropriate for that task. And once you start that process, your data is no longer undifferentiated lake water; it has been purified and processed, drawn from the lake and bottled for a specific use.
I’ll be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, author of several books on data governance, Joe DosSantos of EMC Consulting, and Jay Zaidi, Director of Enterprise Data Management at Fannie Mae, sponsored by Waterline Data Science. Do join me at both of these sessions!