Big data environments make large amounts of information available for analysis by data scientists and other analytics professionals. But in many cases, experienced data analysts and consultants say, the key to developing effective analytical models for big data analytics applications is counterintuitive: Think small.
Having pools of big data to dive into doesn't change the basic tenets of analytical modeling for predictive analytics and data mining, according to practitioners such as Michael Berry, analytics director at travel website operator TripAdvisor LLC's TripAdvisor for Business division. In a keynote speech at the 2013 Predictive Analytics World conference in Boston, Berry said patterns and relationships hidden in sets of big data typically can be found by looking at representative samples of the available information, without having to comb through it all.
"I don't tend to end up using very much data [in analytical models]," Berry said. "Patterns reveal themselves pretty quickly. And when you have enough data to spot a pattern, the results don't change if you add more data." Often, he added, he gets better answers on analytical queries "if I look at less data in a shorter time than I do if I spend more time and look at more data."
"Sampling is a powerful thing," agreed Karl Rexer, president of consulting services company Rexer Analytics in Winchester, Mass. In developing an analytical model to predict potential customer churn, an analytics team at a large company might have access to millions of records on hundreds of thousands of customers. "But," Rexer said, "do you need to use all that data? A lot of times, the answer is no."
Small sample size generates big results
Tony Rathburn, a senior consultant and training director at The Modeling Agency LLC in Pittsburgh, usually starts with only about 5,000 data records when he builds predictive models for clients, even if much more information is there for the taking. To identify the customer behavior or other parameters that modelers are looking for, most predictive analytics applications only need to be "hand-grenade close," Rathburn said. A well-chosen selection of data can get you there, he added -- and throwing more data at analytical models without proper attention to sampling might make them less accurate by adding "noise" to the equation.
Storage technology vendor NetApp Inc. automatically collects performance-monitoring data from its products at customer sites; about a petabyte is stored in a Hadoop cluster, and sensors on the devices send in as much as 1 TB of new data each week, said Shiv Patil, a senior data warehouse architect and business analyst at the Sunnyvale, Calif., company's AutoSupport operation. Patil and his colleagues use the data to try to predict equipment failures before they happen, in order to prevent outages and minimize disruptions to customers.
More on building effective analytical models
Watch a video Q&A with consultant Eric Siegel on how to do uplift modeling
Read tips to help you avoid developing flawed predictive models
Get user and analyst advice on how to foster faith in predictive analytics findings
But the AutoSupport analytics team builds its predictive models around sample data sets, not the entire data vault. To find the patterns it's looking for, "we don't need to analyze all the data," Patil said. Creating valid samples takes some effort and experimentation -- but once they're in place, he said, going beyond them is "just adding more data" unnecessarily.
Not all big data analytics applications can be done via sampling. For example, uplift modeling is a form of predictive analytics that's aimed at pinpointing potential customers who could be persuaded to buy products so marketing efforts can be targeted at them instead of at people who have already made up their minds one way or the other. Carmichael Lynch, a Minneapolis-based advertising agency, is using an automated analytics service developed by online ad-buying platform vendor Rocket Fuel Inc. to analyze millions of car dealer transactions and other data records to drive an uplift modeling program for client Subaru of America.
Fill it up with data variables
Rocket Fuel's analytical model scores possible Subaru customers on the basis of about 300,000 different variables that it examines daily -- from ZIP codes and Web browsing activities to factors such as demographic data, gender, ethnicity and local weather patterns. "I don't know if I agree that there's ever a saturation point where you have enough [data]," said Peter Amstutz, an analytics strategist at Carmichael Lynch, after a presentation about the Subaru program at Predictive Analytics World. "Maybe there's another variable out there that could be predictive."
Rathburn, despite his advice to take a "small data" approach to planning and building analytical models, said it's useful to have a full allotment of big data to choose from. "It's akin to a library," he said. "You don't read all the books, but you need access to different books at different times."
And having access to a collection of big data can expand the range of analytical modeling that's feasible to do, even when sampling techniques are being used, said Dean Abbott, president of consultancy Abbott Analytics in San Diego. For example, population data can be sliced down into smaller geographic regions for modeling because there are more records to help smooth out the data and still leave enough to create statistically valid samples. "You can build more complex models, which means you can build more precise models reliably," Abbott said.