BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
BOSTON -- No one questions that good, clean data is the cornerstone for good, clean business intelligence, but managing that data -- especially as it grows in variety, volume and velocity -- poses challenges for analytics these days.
The three V’s, or what IBM and the analyst firm Gartner Inc. use to define “big data,” are testing the limits of traditional data warehouses and necessary functions such as extract, transform and load (ETL) capabilities, according to Usama Fayyad, CEO of the data strategy, technology and consulting firm Open Insights and former chief data officer and executive vice president of research and strategic data solutions at Yahoo.
What is big data analytics
Fayyad, considered a thought leader on data mining and a featured speaker at Enzee Universe, the Netezza user conference held here last week, came armed with tips and tricks on how the future of analytics is rooted in big data.
Why analytics is the future of big data
The bulk of Fayyad’s presentation came from personal experiences documenting why analytics is the future of big data.
In the early 1990s Fayyad was working for NASA’s Jet Propulsion Laboratory. Astronomers from the Palomar Observatory were managing 3 terabytes of data, and through photos and data calculations, trying to differentiate faraway stars from galaxies. With 40 different variables extracted per image, astronomers struggled to accurately make predictions.
“This is a data set of billions of objects that look similar,” Fayyad said.
Fayyad and his team used decision tree algorithms as a data mining technique to explore what 40 variables were essential for classification, and they landed on the magic number eight, a set of variables that had eluded astronomers for more than 30 years.
“This became a famous work in astronomy,” he said. The results speak for themselves: The model created by Fayyad and his team classified images with 94% accuracy.
But not all analytics techniques are about data extraction to help classify the universe. Sometimes the little -- and even obvious -- things are just as important.
Fayyad and his team at DMX Group, a data mining and data strategy company he co-founded in 2003 and that was acquired by Yahoo in 2004, were hired by DaimlerChrysler to help with sales forecasting for micromarkets (dealership zones). There were a few hiccups, such as having to build data marts because data couldn’t be removed from the warehouse, but ultimately the company wasn’t taking any action on what it was seeing. One of the biggest impacts they made was a simple tweak to the way the reports looked.
Fayyad said he worked backward and discovered the one person who put the final spreadsheets together.
“I managed, along with the team, to convince her to turn some numbers red and others green,” he said. Green indicated that incentives were needed; red indicated they weren’t. “The minute those reports started showing up, even though they were seeing the same data as before, the action became very obvious.”
Data management tips and more
Along with big data comes questions of data storage and computation. Fayyad suggested keeping an open mind while offering a few tips on avoiding potential pitfalls.
For example, big data can mean investing in more processors, which can be costly, he said. This, in turn, may lead organizations to turn to public cloud storage as a seemingly cheaper alternative, but, he said, regardless of the myth out there, bandwidth needed for moving data to the cloud doesn’t come cheap and can create additional problems in maintenance.
Fayyad also warned attendees of “expensive ad hoc sandbox computation,” much of which is better performed on platforms other than Hadoop, he said speaking specifically about his experience at Yahoo. Hadoop is an open source project based on the Google MapReduce platform for analyzing large data sets. Yahoo is the biggest contributor to the open source project.
“Once we figured out what we needed from the data,” he said, “people kept insisting on keeping the data on Hadoop rather than in specialized stores. The grid is a good way to explore new computations, but it’s not always the right long-term solution for storage and other computations.”
Data analytics tips and tactics
Fayyad described a kind of vicious cycle for the ever-growing data warehouse these days. While businesses believe data and analytics are important, the business needs aren’t always being met, making it difficult to justify additional storage investments.
He suggested working toward data reduction by extracting summaries from data, mapping the data into segments and computing dashboards quickly and accurately.
“If you do that, you’ll justify a lot of the things you’ll need to create the infrastructure you need to support it,” Fayyad said.
The inability to meet business needs quickly can also cause groups to splinter off, taking on tasks themselves. This is fueled by a growing scalability of technology that enables interaction, starting with the Web.
“Businesses have crossed into this zone without realizing it,” he said.
Fayyad briefly touched on data mining, a now popular construction within analytics that uses algorithms to find statistical patterns, create predictive models and discover hidden relationships within data.
“Data mining is gaining attention,” he said. “A lot of the interesting queries are hard to stage in SQL.”
He also recommended that when thinking about predictive analytics, it’s important to think about everything around predictive analytics, from visualization to modeling.