With terabytes, even petabytes, fast becoming a comfortable benchmark for corporate data stores, many companies are fixated on the volume and variety of "big data."
Traditional relational data warehouse and BI initiatives exploit well-defined practices and mature product offerings to aggregate a fraction of available data in a highly structured and ordered fashion. Their intent is to allow business users to slice and dice data and create reports that deliver answers to generally known questions around what is happening and why it's happening. For example, "What are sales for a particular region during a specified time period?"
The notion of applying analytics to big data shakes up these well-established conventions, opening the door for companies to uncover patterns in information, pose questions they may not have otherwise considered and, ultimately, establish strategies designed to deliver a competitive edge.
"Over the last 20 to 25 years, companies have been focused on leveraging maybe up to 5% of the information available to them," said Brian Hopkins, a principal analyst at Forrester Research Inc. in Cambridge, Mass. "Everything we didn't know what to do with hit the floor and fell through the cracks. In order to compete well, companies are looking to dip into the rest of the 95% of the data swimming around them that can make them better than anyone else."
Traditional RDBMS struggles to keep up
Unlike the aggregated, mostly structured transactional data that comprises traditional relational database management system (RDBMS) efforts, big data stores are flooded with data streams coming from a variety of sources, including the constant stream of chatter on social media venues like Facebook and Twitter, daily Web log activity, Global Positioning System location data and machine-generated data produced by barcode readers, radio frequency identification scans and sensors.
Stamford, Conn.-based Gartner Inc. estimates that worldwide information volume is growing at an annual clip of 59%, and while there are hurdles around dealing with the volume and variety of data, there are equally big challenges related to velocity, or how fast the data can be processed to deliver any benefit to the business.
This whole notion of extreme data management has put a strain on traditional data warehouse and BI systems, which are not well-suited to handle the massive volume and velocity requirements of so-called big data applications, both economically and in terms of performance.
"There is a paradigm shift in terms of analytics as to how you use this finer-grained data to come up with more accurate or new types of analyses," said David Menninger, vice president and research director for Ventana Research Inc., a San Ramon, Calif.-based consultancy specializing in data management strategies.
Big data analytics, bigger data processing
Key to the paradigm shift is a host of new technologies designed to address the volume, variety and velocity challenges around big data analytics. At the heart of the new movement is massively parallel processing (MMP) database technology, which automatically splits database workloads across multiple commodity servers and runs them in parallel to garner significant performance boosts when working across extremely large data sets.
Building on this core architecture are columnar databases, which store data in columns as opposed to rows, serving to shrink the amount of storage space while greatly accelerating how quickly a user can ask questions of data and get results. In-database analytics, built specifically for large-scale analytics and BI workloads, is yet another technology that companies are evaluating to efficiently and economically provide fast query response and deliver access to larger amounts of detailed data. The same goes for data warehouse appliances, an integrated and packaged set of server, storage and database technology tuned to the needs of large-scale data management applications.
While these capabilities tend to play off core RDBMS technology, there's a relative newcomer garnering a significant amount of attention when it comes to big data analytics. Hadoop, with roots in the open source community, is one of the most widely heralded new platforms for managing big data, particularly the flood of unstructured data like text, social media feeds and video. Along with its core distributed file system, Hadoop ushers in new technologies, including the MapReduce framework for processing large data sets on computer clusters, the Cassandra scalable multi-master database and the Hive data warehouse, among other emerging projects.
Hadoop attracts attention for big data uses
According to a new Ventana Research benchmark report, titled "Hadoop and Information Management," more than half of the 163 survey respondents (54%) are currently using or evaluating Hadoop as part of big data initiatives, driven by their desire to analyze data at a greater level of detail and perform analytics that couldn't be done previously on large volumes of data. However, despite this early wave of enthusiasm, the reality is that Hadoop is still relatively immature, and there are only a handful of packaged analytics tools available today designed to shield users from the inherent complexities of the open source technology.
While new technologies present a challenge to data warehouse professionals, the real issue for firms looking to make a go at big data analytics is to get past the emphasis on big volume or risk missing out on the broader opportunity for business.
"The volume aspect has been around for a long time, but it's really more about extreme data," said Yvonne Genovese, a vice president and distinguished analyst at Gartner. The ability to turn data into information and information into action is where IT and data warehouse executives can really make their mark, she said.
ABOUT THE AUTHOR
Beth Stackpole is a freelance writer who has been covering the intersection of technology and business for 25-plus years for a variety of trade and business publications and websites.