As big data continues its ascent up Stamford, Conn.-based IT research firm Gartner Inc.'s well known hype cycle, a new crop of big data service providers is emerging. Perhaps none is more surprising than MetaScale, which came onto the scene in April and is a wholly-owned subsidiary of Sears Holdings Corp.
Based in Hoffman Estates, Ill., MetaScale is a managed big data service provider that operates on a cloud-based model. That means MetaScale can provide as much or as little support as is needed for its clients, specifically for those who know they're ready to take advantage of big data analytics but don't have the infrastructure or skills to do so.
SearchBusinessAnalytics.com news editor Nicole Laskowski talked with Phil Shelley, CEO and founder of MetaScale and CTO of Sears, about big data challenges and market trends. Read a partial transcript of the interview below, and listen to the full podcast to hear more of what Shelley has to say about big data.
What are some of the challenges businesses are facing with big data, and can you break your response into two parts: the management and then the analytics of big data.
Phil Shelley: First of all, on the management side, we're at the threshold now where we can think about managing data in a whole new way. Anyone that's been in IT long enough has known that the Holy Grail was to have all of your data in a single point of truth -- in a single place where the system was large enough to answer any question that comes along. Of course, that never really happened, and we ended up with ETL for copying data, lots of copies of the data, different systems for different purposes and different points of truth. So, data management was always a real headache. That is really changing. … You can now have a data model where you really do have a single point of truth. With all of the transactional detail in your company in one place, with all of the history in one place, [you can] now manage and model and have a data architecture for the enterprise that really makes sense -- and really increases data use. Reuse of data is really important and now possible with these techniques.
Once you have that data in one place, then using it [has] all sorts of new possibilities, because now with Hadoop, we have the ability to keep huge amounts of history. [We can] not only keep it, but actually analyze it without moving it. When you're talking petabytes of data, which we are, you absolutely cannot afford to move it to analyze it. The old way was always to move the data to an analytics platform with ETL. That doesn't work in this modern way of thinking. So, having a platform where you can store the data and then analyze it without moving it is just a tremendous improvement over what we've had in the past.
So, you're bringing the tools to the data rather than moving the data to the tools?
Shelley: There are tools now emerging that are allowing you to do that -- that put a graphical front end and an analytical front end on top of these big data repositories. So you run the query where the data is; you run the analysis where the data is. You don't copy it; you only extract it with the small snippets of data that you really want, which are the result sets. That is a pretty dramatic, new way of thinking that takes a while for people to get used to.
I've been hearing a lot about "the logical data warehouse," "the hybrid data ecosystem" -- the concept where you put data where it best belongs. That's kind of what you're saying as well, right?
Shelley: I am, except a somewhat purer version of that. Some people have said you put data where it belongs, and you have lots of systems with lots of pieces of data because it would be the best place to put it. I am not an advocate of that because of the cost and time of ETL. But I am an advocate of having an ecosystem of tools that makes sense. Hadoop certainly is not the tool if you need high-speed SQL analytics. So you do need an enterprise data warehouse or a high-speed SQL environment to coexist with Hadoop, there's no question. What data goes into which one and when and how it's refreshed needs to be considered carefully so you don't end up with too much data in one, not enough in another. [If that happens] you end up with ETL problems and moving data. Very careful thoughts about enterprise data architecture [are important]. Rationalizing systems down to Hadoop with something else is absolutely necessary. But then, I'm not a big believer of having lots and lots of other [operational data stores] and logical data marts around because it just adds complexity. And as data gets bigger, you can't afford to do that, and you don't need to do it anymore.
This was first published in September 2012