Data management is back on the agenda, finally with a big data flavor.
I’d like to think that Teradata was driven by my blog of 10 July, “So, how do you eat a Hadoop Elephant?” in the acquisitions announced today of Hadapt and, of more interest here, Revelytix. Of course, I do know that the timing is coincidental. However, the move does emphasize my contention that it will be the traditional data warehouse companies that will ultimately drive real data management into the big data environment. And hopefully kill the data lake moniker in the process!
To recap, my point two weeks ago was: “The challenge was then[in the early days of data warehousing]—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures. [This demands] defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world.”
Revelytix is (or was) a Boston-based startup focusing on the problems of data scientists in preparing data for analytic use in Hadoop. The Revelytix process begins with structuring the incoming soft (or loosely structured) data into a largely tabular format. This is unsurprising to anyone who understands how business analysts have always worked. These tables are then explored iteratively using a variety of statistical and other techniques before being transformed and cleansed into the final structures and value sets needed for the required analytic task. The process and the tasks will be very familiar to anybody involved in ETL or data cleansing in data warehousing. The output—along with more structured data—is, of course, metadata, consisting of table and column names, data types and ranges, etc., as well as the lineage of the transformations applied. In short, the Revelytix tools produce basic technical-level metadata in the Hadoop environment, the initial component of any data management or governance approach.
In my book, “Business unIntelligence”, I proposed for a variety of reasons that we should start thinking about context-setting information (or CSI, for short), rather than metadata. A key driver was to remind ourselves that this is actually information that extends far beyond the limited technical metadata we usually consider coming from ETL. And if I might be so bold as to advise Teradata on what to focus on with their new baby, I would suggest that they place emphasis on the business-related portion of the CSI being created in the world of the data scientists. It is there that the business meaning for external data emerges. And it is there that it must be captured and managed for proper data governance.