RichRelevance Inc. faces one of the prototypical big data challenges: lots of data, and not a lot of time to analyze it. For example, the marketing analytics services provider runs an online recommendation engine for Target, Sears, Neiman Marcus, Kohl's and other retailers. Its predictive models, running on a Hadoop cluster, must be able to deliver product recommendations to shoppers in 40 to 60 milliseconds -- not a simple task for a company that has two petabytes of customer and product data in its systems, a total that grows as retailers update and expand their online product catalogs. "We go through a lot of data," said Marc Hayem, vice president in charge of RichRelevance's service-oriented architecture platform.
It would be easy to drown in all that data. Hayem said that managing it smartly is critical, both to ensure that the recommendations the San Francisco company generates are relevant to shoppers and to avoid spending too much time -- and processing resources -- analyzing unimportant data. The approach it adopted involves whittling down the data being analyzed to the essential elements needed to quickly produce recommendations for shoppers.
The full breadth of the historical data that RichRelevance stores on customers of its clients is used to define customer profiles, which help enable the recommendation engine to match up shoppers and products. But when the analytical algorithms in the predictive models are deciding in real time what specific products to recommend, they look at data on just four factors: the recent browsing history of shoppers, their demographic data, the products available on a retailer's website and special promotions currently being offered by the retailer. "With those four elements, we can decide what to do," Hayem said, adding that data on things such as past purchases, how much customers typically spend and other retailers where they also shop isn't important at that point in the process.
In the age of big data, knowing what information is needed in analytics applications, and what isn't, has never been more important -- or in many cases, more difficult. The sinking cost of data storage and the rise of the Hadoop data lake concept are making it more feasible for organizations to stash huge amounts of structured, unstructured and semi-structured data collected from both internal systems and external sources. But getting the question of what to use, what to hold onto for the future and what to jettison wrong can have both immediate and long-term consequences.
Even though a particular data set may seem unimportant now, it could have uses down the line. On the other hand, cluttering up Hadoop systems, data warehouses and other repositories with useless data could pose unnecessary costs and make it hard to find the true gems of information amid all the clutter. And not thinking carefully, and intelligently, about the data that needs to be analyzed for particular applications could make it harder to get real business benefits from big data analytics programs.
Know when to say 'when'
In a survey conducted by Capgemini Consulting last November, only 35% of the 226 analytics, IT and business professionals who responded described their big data initiatives as successful or very successful. One of the big reasons, according to a report on the survey, is that most organizations "are far from being able to use [big] data effectively." For example, only 35% of the respondents said their organizations had strong processes for capturing, curating, validating and retaining data, while 79% said they had yet to fully integrate all of their data sources. In addition, the top big data implementation challenges they cited included data silos, a lack of coordination between different groups and ineffective data governance.
Like RichRelevance, The Lucky Group Inc. tries to get value from its analytics efforts by keeping things in context. The Santa Monica, Calif., company publishes Lucky, a magazine that focuses on shopping, and operates several membership-based retail websites tied to the publication. Lucky Group tracks a variety of things. It collects internal data on monthly revenue, product sales and what pages visitors are looking at on its sites, which include JewelMint.com and StyleMint.com. The company also gathers customer data, including what products people buy and how much they spend. It uses Pentaho's data integration and analytics tools to pull the information into a MySQL database and then analyze it.
But when analyzing current sales performance or projecting future demand, Lucky Group's executives and other end users typically don't need all the data that's on hand. The mix of products it sells changes constantly, and customer tastes often change as well. As a result, fresh data is the most valuable, said Jay Khavani, the company's senior manager of business intelligence and data warehousing. "What was relevant in 2010 is not necessarily relevant right now," he noted. "We wouldn't analyze all our data."
Instead of simply dumping data into a central repository for business users and analysts to explore, Lucky Group partitions the information, primarily by year. In addition to producing more relevant results, Khavani said that approach saves time and resources by enabling analyses to be run more quickly than they otherwise might be. But, he added, users can still get what they need in order to make more-informed business decisions -- for example, what products are performing well and how customer preferences have evolved in recent months.
Right people, meet right data
Even if you narrow down the types of data you want to look at, though, predictive analytics and data mining applications might not benefit from using the full amount that's left. Speaking at software vendor SAS Institute's 2014 Premier Business Leadership Series conference in Las Vegas last October, Harvard Business School professor Clayton Christensen said he's skeptical about the value of running predictive models against larger and broader data sets. "Big data for big data's sake just gives us more data, and that's not the insight I think we need," he said.
The key to effective predictive modeling is finding the right data to accurately and quickly answer the questions being asked, Christensen added. To make that feasible, he said, organizations should make sure they have skilled data scientists or other experienced analytics professionals who can meticulously aggregate the required data and then build well-designed analytical models to pull out the desired findings in an objective way.
But data scientists can't do it on their own, said Sarah Biller, president of Capital Market Exchange in Boston. The analytics services company provides investment portfolio managers with projections of how corporate bonds will perform and other information on the bond market, based on ongoing analysis of social media posts and business news stories from a list of what it considers to be expert market watchers. It combines the analytical results with more traditional data, like past performance of particular bonds and the market in general, to produce the projections.
To make sense of such diverse data, Biller said Capital Market Exchange has invested in a team of people with specialized data management and analytics skills. The process starts with a data architect who structures the data for analysis. Then a couple data scientists develop and run the algorithms that analyze the data, using a homegrown system and the R programming language. Next up is a group of business analysts and data visualization specialists who interpret the results and prepare the findings to be presented to the company's clients in a Web-based dashboard.
Manage complexity, not just volume
As Biller's experience shows, there’s more to big data than just volume. The wide variety of data types that many organizations are trying to incorporate into big data analytics applications also makes the job difficult for program managers. In addition to the technical challenges of bringing all that data together and figuring out what to analyze when, organizational issues can complicate the process.
Eugene Kolker, chief data officer at Seattle Children's Hospital, said during a panel discussion hosted by IBM last October that his principal job duty involves managing the complexity created by the need to analyze many different types of data. Like other healthcare providers, Seattle Children’s relies on a multitude of systems in various departments, including electronic health records, laboratory information systems and scheduling applications. Kolker said the systems generate data in different formats, making it a challenge to combine all the information for analysis.
He added that the technical aspects of reconciling the different data types can be sticky, but they’re the least of his big data challenges. The bigger issue is data owners who are overly protective of the information in their systems. Kolker said that to make effective analytics possible, he works closely with departmental managers and tries to build a good working relationship with them. "The people angle isn't just important," he said. "It's a major deal."
It's that kind of focus on getting at business value that can make big data analytics initiatives manageable -- and successful. The bottom line is that collecting data isn't the important part -- it's what you do with the data that really counts.
Learn more about dealing with big data challenges
Why effective governance can help companies get the most out of their big data
How Amazon tools are solving big data problems with big data analytics