Several years ago, Cabela’s Inc., a sporting and outdoor gear retailer, ventured into “big data” territory by starting to pull clickstream data from its website into a relational database.
“This kind of data helps us get into the mind of the customer,” said Dean Wynkoop, manager of data management and insights at Sidney, Neb.-based Cabela’s. “We start seeing the relationships that exist.”
Just taking in and storing the clickstream data poses challenges for Cabela’s, but what’s really prompting the retailer to edge into big data technology is the difficulty of efficiently loading, extracting and querying that data for analytical purposes. And Cabela’s is not alone: The additional complexity introduced by large data sets and new, often unstructured, data sources is changing how a growing number of businesses manage their data and the analytics process.
The term big data itself can be confusing. Depending on who’s doing the talking, the meaning can shift from ballooning data volumes of all sorts to a focus strictly on unstructured information, such as Web server logs and statistics or text from social media sites. However, analysts and IT vendors typically define the phrase as data that is marked by its volume, variety and velocity; some also add a fourth dimension, such as complexity or variability, to the list.
Variety makes for a volatile big-data mix
Yvonne Genovese, a business applications analyst at Stamford, Conn.-based Gartner Inc., said she is finding the type of data, or the variety component, to be the most pressing big data analytics concern for her clients. Structured data fits neatly into a database and can be easily extracted into an Excel spreadsheet or a business intelligence (BI) tool for analysis, Genovese said. But performing analytics on unstructured data is more complex.
“Social data doesn’t fit well into a database. It doesn’t transfer easily into something like Excel,” she said. “[Businesses] want to analyze that, but the traditional BI tools on the market today don’t do text. Excel doesn’t do text.”
A third category, semi-structured data, also poses new challenges for companies, according to Genovese. “It’s a hybrid use case,” she said. “It can be both structured data and content, meaning that [businesses] want to see how many ‘likes’ they had for a particular product on Facebook … but they also want to see what the feedback is.”
Cabela’s is not yet mining text as part of its analytics program. And while the clickstream data it's capturing contains columns of structured information as well as things like URLs, Wynkoop said the retailer is still primarily focused on addressing the first “V” in the big data definition: volume.
To create a targeted email marketing campaign, for example, Cabela’s needs to load the most recent clickstream data into its database, convert the information into a usable format for its statisticians and then feed their analytical findings into the campaign management process. With the volume of information involved, it can take days to complete that process using the retailer’s existing technology.
And that can have business consequences. “The longer you wait to send a targeted email, the more irrelevant it will be,” Wynkoop said. “If you can come into more real-time feeds with efficient methods of extracting the information, then you can cut the cycle time down.”
Big data analytics may require a bigger toolbox
As a result, he added, “we are working on proof of concepts to augment our relational database with a big data architecture.” That includes considering the addition of new tools to the company's analytics toolkit. For example, Wynkoop said that because the retailer’s statistical analysts are experienced in SQL programming, they’re interested in taking a look at software that uses the MapReduce open source programming framework to break query jobs into pieces, perform the required processing on those pieces in parallel and bring them back together through SQL. That could provide Cabela’s with more functionality and the kind of real-time analytics capability Wynkoop is looking for, he noted.
For Gartner, the analytics process contains four steps, each building off the one before it: descriptive, or what happened; diagnostic, or why did it happen; predictive, or what is likely to happen; and prescriptive, or what should I do.
That breakdown stays the same with big data analytics, Genovese said. But she added that the characteristics of big data installations will influence both the kinds of analytical questions that businesses ask and the tools they’ll need to find the answers. For example, to analyze text-based data, organizations will require tools that can provide capabilities such as natural language processing, text classification and ontology analysis.
In the future, though, companies might not have to tack those platforms onto their existing analytics systems. Genovese said vendors are recognizing the need for hybrid analytics environments and responding with new products or acquisitions signaling their interest in providing that kind of technology to users. Such offerings, she added, could “change the information space dramatically” for supporting big data analytics initiatives.