As businesses try to derive useful insights from pools of big data, many are finding that balancing data size against analytical modeling needs can be a tricky process. But doing so is crucial to the success of big data analytics projects.
At one end of the spectrum is Facebook. Speaking at the 2014 Big Data Innovation Summit, held in Boston in September by The Innovation Enterprise Ltd., Facebook data scientist Mario Vinasco said the size of big data repositories at the social networking company can hinder analytics efforts. Facebook isn't lacking in big data volume: It collects thousands of data points on millions of users. Analyzing all of that at once is impossible, Vinasco said.
So, for a recent project that sought to determine the increase in social interactions elicited by a new feature, he extracted a sample of data on just 100,000 users. Some big data proponents argue that extremely large data sets eliminate the need for that kind of basic statistical modeling practice by enabling analysts to generate true population statistics rather than estimates. But Vinasco said Facebook has so much data spread across so many different kinds of databases that what would be a simple query in a relational database isn't feasible. As a result, he added, limiting the scope of an analysis through the use of data sampling techniques can actually be helpful.
"In the world of these big, big data sets, you need to work backward toward something specific," Vinasco said.
On the other hand, there are lots of businesses that lack the data needed to answer key business questions. For those organizations, acquiring new data types and building out their analytics infrastructure is often necessary before they can develop effective analytical models.
For example, when Consumers United Inc., an online insurance agency that operates under the name Goji, first launched its website in 2007, it wasn't a big data user. Sean Parenti, manager of strategy and analytics at the Boston-based company, said it received customer leads from a third-party service and managed the data in spreadsheets. The analytics team then ran simple algorithms on the data to determine the likely cost per acquisition of each lead.
"This kind of strategy was very cumbersome, and the number of man-hours to keep it afloat was terrifying," Parenti said.
Now Goji gathers a greater amount of Web data on customer leads and runs it all through an internally built analytics platform to calculate not just the cost of acquiring customers but also their expected lifetime value to the company.
Even organizations where the size of big data is huge may not have enough to answer the questions they want to address. Tim Brooks, a software engineer at the Staples Innovation Lab, a San Mateo, Calif.-based unit of Staples Inc. that is building e-commerce and customer analytics systems for the office supply retailer, said he has a tremendous amount of data at his disposal. But during a project to model how customers would respond to price changes, Brooks found that he didn't have enough.
There are numerous factors that can affect a customer's willingness to buy products at certain prices, all of which Brooks wanted to model. The problem, he said, is that the more factors you consider in a predictive model, the more data you need. Brooks had historic sales data and was looking at things like customer demographics and income. But there were holes in the data. For example, there wasn't any data about certain customer segments for some days of the week. The information shortfall was eventually resolved by collecting more data on customers' Web browsing activities, Brooks said.
Brooks' co-worker Courosh Mehanian, a senior data scientist at the Innovation Lab, said that by incorporating extra data types that truly describe customers and their intentions, it's possible to deliver valuable results to Staples as well as customers. Staples wins by increasing sales and the customer wins by having a more personalized experience.
"We have lots and lots of data, [on] millions of monthly visits to the sites and people making purchases. What we can come up with turns out to be very useful and provides value for the customers and for us," Mehanian said.
Can big data be sized up on size alone?
Sometimes smaller is better in analytical modeling