The “big data” problem is turning into a big battleground for vendors, and the skirmish at hand is to provide the functionality to store and analyze complex, voluminous types of data that businesses are hoping to exploit.
In the last few years, vendors have rolled out big data platforms and appliances to enhance warehousing and analytics programs for businesses. Some have integrated the popular Apache Hadoop, an open source data storage and processing system for massive data sets, into their offerings. Hadoop may be an answer for big data’s volume and velocity, two of three characteristics typically used to define the term, but the final characteristic, variety, is a tougher nut to crack, especially when it comes to analysis.
“Sometimes when people talk about big data, they’re talking about truly large volumes of data, and that’s where things like Hadoop can help you,” said Yvonne Genovese, a business applications analyst at Stamford, Conn.-based Gartner Inc. “In most cases, when clients are looking at Web and social data, they’re not necessarily as concerned about the volume as they are about the type of data that’s coming across.
That’s one of the reasons Hewlett-Packard’s soon-to-be released big data platform is raising a few eyebrows. The HP Next Generation Information Platform utilizes technology from two companies it bought earlier this year: Vertica Systems, acquired in March, and Autonomy, acquired in October. Combining Vertica’s analytical database for structured data with Autonomy’s tools for unstructured data in the same environment could potentially solidify a place for HP on the big data battlefield and raise the bar on vendor offerings.
Data variety and the database
For years, vendors built and sold databases for an environment that neatly packages data into columns and rows.
“Traditionally we’ve been able to manage very structured data -- data that fits well in the database, that can be converted to an Excel spreadsheet or a BI [business intelligence] tool,” Genovese said.
These days, the onslaught of unstructured data -- text, voice recordings, videos, images -- has complicated things, she said. While it’s possible to store unstructured data into a relational database or pull that data out for analysis, doing so can be tedious and inefficient, especially for businesses seeking the holy grail of real-time analytics.
A relational database stores unstructured data in its entirety but lacks the processing power to look across different pieces of content, said David Menninger, a former Vertica employee and research director specializing in analytics and BI for San Ramon, Calif.-based Ventana Research Inc. Alternatively, unstructured content can be transformed into structured data so that it fits into a relational database’s columns and rows, where the content is summarized or aggregated, but that means potentially missing out on certain details.
As Menninger explains it: “It’s kind of like having an abstract of a white paper. If you don’t have the white paper itself, you can’t see if it says something that’s not in the abstract.”
That’s one of the reasons Hadoop has become so popular. Hadoop eases the time it takes to load structured and unstructured data, which it can store in a raw or structured format, in part because of its Hadoop Distributed File System (HDFS), according to James Kobielus, a senior analyst for Forrester Research Inc. in Cambridge, Mass. That feature breaks up large data sets and spreads them around to different servers.
Data variety and analytics
While big data appliances and platforms have the technology to manage the volume, the velocity and even the storage needs for the variety of data businesses may be facing, performing analytics on different data types is another story.
“What you need optimally for full support of the vast variety of new data types is a broad range of connecters, text analytics, natural language processing embedded into appliances,” Kobielus said. “And you need a rich metadata layer to manage both -- so you can manage the metadata related to sentiment and all that stuff coming from Twitter and so forth.”
Vendors have provided connectors to unstructured content and even content extract, transform and load (ETL) tools, “but that’s a bolt-on to their architectures,” Kobielus said.
“There are not a whole lot of competitors yet who, from the very start, are bringing really strong text analytics into an appliance platform plus a structured data form … for scale out,” he said.
That’s what makes HP’s new information platform unique. With the Autonomy software, HP has placed a priority on analyzing unstructured content, a decidedly different approach than other vendor products.
“Most vendors are focusing on volume and velocity, and, to a lesser extent, they’re also focusing on variety, but it’s not their core focus,” Menninger said. “Variety is central to the premise of Autonomy.”
Autonomy indexes unstructured data as it arrives while keeping content available for any additional processing; a key strength is its natural language processing, enabling users to search and retrieve information using a search-based interface, he said.
Even so, Menninger, for one, is a bit skeptical about HP’s announcement.
He notes Hadoop was not mentioned in the announcement, which he calls “a mistake” considering the current momentum behind it. (Vertica offers a column-oriented analytical database, which Menninger said can handle big data’s volume and velocity.) He said HP is playing catch-up in the data warehousing market, where he still sees significant activity. And, while Menninger doesn’t question the support needed for unstructured data, he questions if consumers will buy into HP’s approach.
“[HP] is coming at this problem from, primarily, the unstructured side of the equation. That’s where I think the market is going to have a little bit of difficulty digesting what they’re offering,” he said. “I still see structured data as driving most of the activity. Unstructured data is rising in importance, and the evolution will be around how do you add unstructured to structured data.”
But Kobielus welcomed the product to the market, saying the release could “address an underserved segment in the big data appliance space that’s evolving.”
“To bring together two hemispheres -- the data mining and the text mining -- into an integrated architecture is really great,” he said. “They’re at the forefront and slightly ahead of the market.”
Kobielus believes customers are interested in taking a closer look at this kind of architecture, and vendors will head down a similar path in the next few years. Doing so may mean bringing a bigger Hadoop footprint -- for things such as HDFS -- into their core products, he said, noting the comment is a bit speculative right now.