When it comes to externally-sourced data, data scientists are left to pick up the pieces. New tools can help, but let’s also address the deeper issues.
Trifacta presented at the Boulder BI Brain Trust (#bbbt) last Friday, 13 March to a generally positive reaction from the members. In a sentence, @Trifacta offers a visual data preparation and cleansing tool for (typically) externally-sourced data to ease the burden on data scientists, as well as other power data users, who today can spend 80% of their time getting data ready for analysis. In this, the tool does a good job. The demo showed an array of intuitively invoked methods for splitting data out of fields, assessing the cleanliness of data within a set, correcting data errors, and so on. As the user interacts with the data, Trifacta suggests possible cleansing approaches, based on both common algorithms and what the user has previously done when cleaning such data. The user’s choices are recorded as transformation scripts that preserve the lineage of what has been done and that can be reused. Users start with a sample of data to explore and prove their cleansing needs, with the scaled-up transformations running on Hadoop within a monitoring and feedback loop.
This is clearly a useful tool for the data scientist and power user that tackles a persistent bottleneck in the journey from data to insight. It also prompts discussion on the process that should exist around the ingestion and use of external data.
There is a persistent desire to reduce the percentage (to zero if possible!) of time spent by data scientists in preparing and cleansing data. Yet, if we accept that such practitioners are indeed scientists, we should recognize that in “real” science, most of the effort goes into experiment design, construction and data gathering/preparation; the statistical validity and longer term success of scientific work depends on this upfront work. Should it be different with data scientists? I believe not. The science resides in the work of experimentation and preparation. Of course, easing the effort involved and automating reuse is always valid, so Trifacta is a useful tool. But, we should not be fooled that the oft quoted 80% can or should be reduced to even 50% in real data science cases. And among power users, their exploration of data is also, to some degree, scientific research. Preparation and discovery are iterative and interdependent processes.
What is often further missed in the hype around analytics is that after science comes engineering: how to put into production the process and insights derived by the data scientists. While there is real value in the “ah-ha” moment when the unexpected but profitable correlation (or even better, in a scientific view, causation) is found, the longer term value can only be wrought by eliminating the data scientists and explorers, and automating the findings within the ongoing processes of the business. This requires reverting to all the old-fashioned procedures and processes of data governance and management, and with the added challenge that the incoming data is—almost by definition—dirty, unreliable, changeable, and a list other undesirable adjectives. The knowledge of preparation and cleansing built by the data scientists is key here, so Trifacta’s inclusion of lineage tracking is an important step towards this move to production.