Sergej Khackimullin - Fotolia
With data quality a concern for many organizations, data wrangling software is a means of helping them access the right information.
Organizations are collecting reams of data and storing it in data warehouses, and while some of that data is structured, much of it isn't. And when that organization then wants to make use of its abundance of information and turn all that data into actionable insights, data wrangling software helps with the process.
Trifacta, founded in 2012 and based in San Francisco, is one of the vendors of data wrangling software.
In the eight years since the company formed, there's been a significant shift in the storage of data from on premises to the cloud, and along with that the transformation of analytics operations themselves to the cloud.
In response, Trifacta recently surveyed 646 data professionals about the challenges of cloud migration and on Jan. 23 issued a report: "Obstacles to AI and Analytics Adoption in the Cloud."
Key discoveries were that 66% of respondents said all or most of their analytics and augmented intelligence and machine learning initiatives were running in the cloud and 60% of executive-level respondents stated their company employs data analysis to drive decisions. But, critically, 75% reported they aren't confident in the quality of their data, and 46% said data inaccuracy is halting AI projects.
Joe Hellerstein, a professor of computer science at the University of California, Berkeley, is one of the founders of Trifacta and its CSO. He recently answered questions on topics related to analytics in the cloud.
In Part I of a two-part Q&A, he discusses data wrangling software and the migration of data storage and analytics operations to the cloud -- a trend he's witnessed in recent years. In Part II, Hellerstein goes into more detail about Trifacta's report and how data wrangling software can help alleviate the anxiety surrounding cloud migration.
First, what exactly is data wrangling?
Joe Hellerstein: From our perspective, data wrangling is all the work you do with your data prior to the analysis. It includes things like assessing data quality and data context, so discovering what's in your data and looking at the data quality, and then transforming the data into the format you want. Then it's also the operationalization of the outputs of that process -- being able to take the work that you do in order to be able to assess, understand and transform data and being able to make that a daily, hourly, every minute process that can run as data streams in. It's that whole lifecycle that we call data wrangling.
Joe HellersteinCo-founder and CEO, Trifacta
Is Trifacta data wrangling software cloud-based or does it also work with on-premises customers?
Hellerstein: Trifacta is largely cloud-based right now, but we do have a number of on-premises customers as well.
Has that been an evolution? When Trifacta first introduced its data wrangling software, were there more on-premises customers and you've since seen a migration to the cloud?
Hellerstein: Absolutely we've seen a migration. When we started the company in 2012, our explicit plan was to explore whether to go out first with a big data-centric solution or a cloud solution, and in 2012 universally the answer we got back from enterprises was that they were not going to be in the cloud for this kind of data and that we should align with the big data movement, which is where the activity was at that time. I would say that was true for us until about two years ago, so two to three years ago we started investing, with the help of Google, in cloud solutions. Over the last year, for example, over half of our net new business is coming in as cloud business, and we expect that percentage to grow in the coming year.
As for the cloud, what keeps an organization from using an on-premises platform versus migrating to the cloud -- what's holding that transformation back?
Hellerstein: Some of it is regulatory, especially with some of the largest businesses. Some of the financial and drug and healthcare businesses have regulatory requirements that make it difficult to move to the cloud. It's not that they're not doing it, but they're doing it more slowly. Obviously having a legacy installed base that's running the business successfully is part of that crossing the chasm problem and an investment has to be made to make the transition, and that transitional investment can be disruptive.
And the cost of that transition can be quite high today -- migrating certain processes into a cloud infrastructure is more expensive than other processes, which is why you see some things moving into the cloud first like net new web-based and app-based applications because there is no cost of transition. Some of it is that tech-debt legacy cost. There are examples as well of customers I know who believe they can do it cheaper than paying the cloud vendors -- that's a minority opinion, but it is out there, and there are certainly companies acting on that opinion today.
Why would a company want to migrate to the cloud if it has a successful on-premises analytics operation?
Hellerstein: I like to use an analogy to when I first started out at the university here. There was a time in the department here when we ran our own modem bank so that people could dial up from home and work on the computers, and at some point we realized that AOL was going to do this better and cheaper and faster. The next thing that happened was some of my colleagues felt it would be really cool to build a data center right here in the department that would be similar to the kinds of data centers that we saw elsewhere -- it'd be in a container off a truck, it would have cooling wired into it and we could have all this compute on site. And again, over time we realized it's not better, cheaper or faster than the cloud.
What you get from the cloud is you get the benefits of concentrated professionalism and scale. It's just the case that Amazon has been working on this for a long time at an enormous scale -- same with Microsoft, same with Google -- and there are very few organizations in the world that can run a more efficient data center than those guys. In fact, there's none. So the only question then is whether your organization's IT operation is cheaper than what the cloud vendors will sell you, because you're almost certainly not being more efficient at the bottom line. The question is really the markup.
Editor's note: This Q&A has been edited for clarity and conciseness.