Sergey Nivens - Fotolia
There's no shortage of excitement around open source technology, but what role is it really going to play in the day-to-day work of analytics professionals? In part two of SearchBusinessAnalytics' interview with Cory Isaacson, author of the new book Understanding Big Data Scalability, we learn why he says open source will be big. Today, open source tech is being driven by large companies with lots of resources and experience in the area, which is helping produce cutting-edge tools.
We're seeing a trend in the industry where people are looking for a common big data platform -- I'm thinking Spark or YARN. What's your thought on that approach, and do you think we'll see the industry coalesce behind any one of these big data platforms?
Cory Isaacson: It's fascinating for me to watch, but I don't think it'll be any time soon. I think Spark is the clear leader when it comes to batch analytics. What YARN is trying to do is say, 'OK, let's take the standard HDFS [Hadoop Distributed File System] infrastructure, let's take all the different data engines that exist there and try to have one flavor that meets all these needs, and let's try to expand the capabilities of existing flavors like Hbase.'
I know that that's what everyone would like to have. There are all these companies that would like to be the next Oracle, but I think it's very unrealistic that in the next several years that we'll see anything like that. There are literally hundreds of database engines on the market, and they're all good at something. So, it's a very fragmented thing. I think what you're going to see is a polyglot approach for a long time to come.
I wanted to get your take on some of the open source technology out there. A lot of people are really excited about things like R, Hadoop and Spark, as we mentioned. From your perspective, how do you see the role of open source technology in commercial applications?
Isaacson: I think it's only going to become more important because there's a fascinating thing happening in the open source community. I think it started in the database area with MySQL, but that was more of a commercial play. But it's evolved since then. The way it evolved is the giants in the industry who have the most experience, like Yahoo, introduced Hadoop, and now Twitter has introduced different engines and cluster technologies.
There's the parquet system that they added to Hadoop. That was based on work that Google did on Dremel. Facebook has introduced things like PrestoDB. There's just a fascinating array, and the biggest thing about this is that these things are truly freely licensed from companies that have incredible depth of knowledge. They're really going to drive it now, and I think the open source stack is going to be pushed higher and higher. Even commercial vendors will incorporate it. So, it's definitely going to work itself into the enterprise.
Getting back to your book, what's the 'big data end game' you talk about?
Isaacson: The advantage of the big data cluster is that it's distributed, but that's also a disadvantage. If you have parallel operations, those are slower than SQL operations. So, you want to get to the point where you have the highest probability of operations running in a SQL environment and the lowest probability of running distributed operations. If you have that, you have something that's guaranteed to scale.
Open source tools aren't always the best choice in BI
The role of open source technology in BI infrastructure