Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
One of the keys to success in big data analytics projects is building strong ties between data analysts and business units. But there are also technical and skills issues that can boost -- or waylay -- efforts to create effective analytical models for running predictive analytics and data mining applications against sets of big data.
A fundamental question is how much data to incorporate into predictive models. The last few years have seen an explosion in the availability of big data technologies, such as Hadoop and NoSQL databases, offering relatively inexpensive data storage. Companies are now collecting information from more sources and hanging on to scraps of data that in the past they would have considered superfluous. The promise of being able to analyze all that data has increased its perceived value as a corporate asset. The more data, the better -- seemingly.
But analytics teams need to weigh the benefits of using the full assortment of data at their disposal. That might be necessary for some applications -- for example, fraud detection, which depends on identifying outliers in a data set that point toward fraudulent activity, or uplift modeling efforts that aim to segment potential customers so marketing programs can be targeted at people who might be positively influenced by them. In other cases, predictive modeling in big data environments can be done effectively -- and more quickly -- with smaller data sets through the use of data sampling techniques.
Tess Nesbitt, director of analytics at DataSong, a marketing analytics services company in San Francisco, said statistical theorems show that, after a certain point, feeding more data into an analytical model doesn't provide more accurate results. She also said sampling -- analyzing representative portions of the available information -- can help speed development time on models, enabling them to be deployed more quickly.
Predictive models benefit from surplus data
Still, there's an argument to be made for retaining all the data an organization can collect. DataSong helps businesses optimize their online ad campaigns by doing predictive analytics on what sites would be best to advertise on and what types of ads to run on different sites; for sales attribution purposes, it also analyzes customer clickstream data to determine which ads induce people to buy products. To fuel its analytics applications, the company ingests massive amounts of Web data into a Hadoop cluster.
Much of that data doesn't necessarily get fed directly into model development, but it's available for use if needed -- and even if it isn't, Nesbitt said having all the information can be useful. For example, a large data set gives modelers a greater number of records held out of the development process to use in testing a model and tweaking it for improved accuracy. "The more data you have for testing and validating your models, it's only a good thing," she said.
Data quality is another issue that needs to be taken into account in building models for big data analytics applications, said Michael Berry, analytics director at travel website operator TripAdvisor LLC's TripAdvisor for Business division in Newton, Mass. "There's a hope that because data is big now, you don't have to worry about it being accurate," Berry said during a session at the 2013 Predictive Analytics World conference in Boston. "You just press the button, and you'll learn something. But that may not stand up to reality."
Staffing also gets a spot on the list of predictive modeling and big data analytics challenges. Skilled data scientists are in short supply, particularly ones with a combination of big data and predictive analytics experience. That can make it difficult to find qualified data analysts and modelers to lead big data analytics projects.
Analytics skills shortage requires hiring flexibility
Mark Pitts, vice president of enterprise informatics, data and analytics at Highmark Inc., said it's uncommon for data analysts to come out of college with all the skills that the Pittsburgh-based medical insurer and healthcare services provider wants them to have. Pitts looks for people who understand the technical aspects of managing data, have quantitative analysis skills and know how to use predictive analytics software; it also helps if they understand business concepts. But the full package is hard to find. "All of those things are very rare in combination," he said. "You need that right personality and aptitude, and we can build the rest."
Along those lines, a computer engineer on Pitts' staff had a master's degree in business administration but didn't really know anything about statistical analysis. Highmark paid for the engineer to go back to school to get a master's degree in statistics as well. Pitts said he identified the worker for continuing education support not only because the engineer had some of the necessary qualifications but also because he had a personality trait that Pitts is particularly interested in: curiosity.
At DataSong, Nesbitt typically looks for someone with a Ph.D. in statistics and experience using the R programming language, which the company uses to build its predictive models with R-based software from Revolution Analytics. "To work on our team, where we're building models all the time and we're knee-deep in data, you have to have technical skills," she said.
Ultimately, though, those skills must be put to use to pull business value out of an organization's big data vaults. "The key to remain focused on is that this isn't really a technical problem -- it's a business problem," said Tony Rathburn, a senior consultant and training director at The Modeling Agency, an analytics consultancy in Pittsburgh. "That's the real issue for the analyst: setting up the problem in a way that actually provides value to a business unit. That point hasn't changed, regardless of the amount of data."
Executive editor Craig Stedman also contributed to this story.
Get tips on managing the analytical modeling process from consultant Wayne Eckerson
Read case studies, trend stories, tip articles and more in our big data analytics guide
Take a short quiz to check your knowledge of big data analytics best practices