Guide to big data analytics tools, trends and best practices
A comprehensive collection of articles, videos and more, hand-picked by our editors
The traditional analytics framework is broken, according to Simon Zhang, director of business analytics at Mountain View, Calif.-based LinkedIn Corp., and featured speaker at a recent Big Analytics 2012 event in Boston.
Zhang backed his position with two reasons: First, the components of the framework are not well-organized and exclude important pieces; second, its structure -- usually depicted as a pyramid -- tends to break the analytics process into segments, which can fracture the business.
"LinkedIn came up with a new framework," he said. "It's not a pyramid; it's a diamond."
The design emphasizes more and easier access to data as well as greater team unity. Both are seen as important characteristics by Zhang and his colleagues since they aim to find meaningful patterns in their data. But LinkedIn, like many companies that sell a data product, is well beyond the average business, which is still wondering how to take advantage of the widely discussed yet elusive concepts of "big data" and "data science."
In reality, big data emerged years ago and has since been building momentum, but the topic exploded onto the scene in 2011. Last year, Gartner Inc., the research IT group based in Stamford, Conn., included big data on its annual "Hype Cycle for Emerging Technologies" report for the first time. And analysts began surveying the big data terrain, focusing much of their attention on an open source technology called Hadoop.
A year later, the concept of big data can still be a difficult one to grasp. Just pinning down a definition can turn into a daunting task. But while vendors and analysts figure out how far to stretch that label, understanding where big data came from can shed some light on how the industry arrived at this point -- one that just may push businesses to break with tradition.
Big data's roots
The term "big data" has been around for decades. A Quora posting provides an example of its usage dating back to 1987. Almost 10 years later, in 1996, Silicon Graphics International Corp.'s chief scientist, John Mashey, gave a talk called "Big data and the next wave of 'infrastress.'"
"[Infrastress is] stress on the infrastructure of computing," he said in a 1999 interview with Government Computer News. "It's what happens when technologies move at different speeds and put stress on the parts that aren't moving so fast."
In his presentation, Mashey explained that CPUs, memory and disk space were advancing faster than other aspects of computing, such as bandwidth and file systems. This disparity can create bottlenecks, instability and force businesses to find workarounds, he said.
What is 'big data'?
Big data is a general term used to describe the voluminous amount of unstructured, semi-structured and textual data being created on a daily basis. Although big data doesn't refer to any specific quantity, the term is often used when speaking about multiple terabytes to petabytes of data that organizations can analyze in search of potentially valuable business insights and trends.
Read the entire WhatIs.com definition of big data.
At the time, Mashey typically referred to the big data as the growth in data volume, pointing to a relatively new data source known as the Internet, and discussed its impact on storage systems. A few years later, Doug Laney, an analyst with META Group Inc. (which was acquired by Gartner), added to this description.
"It wasn't just about growing volumes," Laney said. "Information management was challenged in a variety of ways."
In his Feb. 2001 commentary, Laney described the complexity of the data landscape as three-dimensional. The volume of data was on the rise, he observed, but so was the velocity and the variety of data -- big data's three V's.
His reference to variety was a way of describing structured data residing in multiple sources. The challenge here, he said, didn't have to do with the amount of data in each source, but instead with how to integrate all of the data together.
Big data's variety has evolved since then to reflect multiple data structures, which have blossomed at an unprecedented rate. In addition to the typical structured data businesses are familiar with, text, images, video, audio files and Web logs have emerged.
Although the initial description has changed, Laney's original observation remains: Data integration is still difficult.
Breaking the v-v-v-ault
Like Mashey, Laney found the Internet -- particularly the groundswell of e-commerce -- to be a significant factor in the changing data environment.
"The lower cost of e-channels enables an enterprise to offer its goods or services to more individuals or trading partners and up to 10 [times] the quantity of data about an individual transaction may be collected -- thereby increasing the overall volume of data to be managed," Laney wrote in 2001.
E-commerce created a kind of new reality for marketers and retailers, according to Peter Fader, co-director of the Wharton customer analytics initiative at the University of Pennsylvania and professor of marketing.
"We were able to all of a sudden see and track all kinds of behavior that before was invisible," he said. "And we had the technology to create databases around it."
The Internet fundamentally changed customer relationship management (CRM) systems, said Fader. There are parallels between the CRM movement and big data, according to Fader. Businesses regarded the additional information as key to gaining deeper knowledge of their customers. And, thanks to Moore's Law, compute power and storage became cheaper and more accessible, enabling businesses to keep rather than discard data.
"We're just naturally hoarders," he said. "And when you find assets that might be of value -- whether it's arrowheads, real estate, or in this case, data -- we want to grab it all."
The Internet wasn't the only relatively new, promising source of data; sensors, which also owe a nod to Moore's Law, were coming into their own in 2001. Together, the two had a major impact on the rate, or velocity, at which data was being produced, Laney said.
"The frequency of the data that spewed off of devices from point-of-sale systems to RFID scanners to mobile devices was increasing," he said. "We realized that current systems didn't have the capacity to load and process that data within a given window."
Ultimately, businesses are still facing what Mashey described back in 1996 as infrastress: Some technologies are growing more quickly than others. And businesses may need new technologies if they want to take advantage of the data they've collected and new data sources.
In fact, these days, analysts tend to believe that the three-V definition of big data falls short. Gartner, for example, recently released a revised definition of big data -- one that builds on the expanding variety, velocity and volume characteristics it originally came up with.
Specifically, Gartner states that big data will require "new forms of information processing for enhanced insight discovery, decision-making and process automation."
"To claim that it's just about data growth misses the point," Laney said during a recent Webinar presentation. "Its usage to help the business perform or transform is just as -- or more -- important to the definition."
Nicole Laskowski is news editor for SearchBusinessAnalytics.com. Follow her on Twitter @TT_Nicole.
Learn more about big data and its impact on data warehousing
A new Gartner report aims to break down the big data hype
Find out why big data and analytics need to share the stage