Launching a successful “big data” analytics initiative requires large amounts of system planning, ample business-user...
involvement and a well-defined deployment strategy, according to IT professionals who have experience dealing with large and complex data installations used to support analytics processes.
Just ask Michael Brown, chief technology officer at comScore Inc., a Reston, Va.-based company that tracks and measures the behavior of Internet users and uses the data it collects to identify online market trends for its corporate customers.
ComScore, which runs more than 1,000 servers and analyzes about 8 TB of raw data on a daily basis, recently implemented a new version of the Hadoop open source distributed computing framework in an effort to get a better handle on the ever-increasing amounts of incoming information and reduce the amount of time it takes to generate and deliver market intelligence to clients.
The company will reach the milestone of having a petabyte of data under management before the end of the year, said Brown, who added that planning for exponential growth is one of the keys to success in any big data analytics project.
“Think big and plan for data volumes that are going to be 10 times what you’re dealing with today, because if anything, the rate of data growth is rapidly increasing,” Brown said. “Whatever you think is big today will be eclipsed in the future.”
It’s also important to remember that “thinking big” means more than just planning for increases in server capacity. Brown said companies launching big data analytics initiatives also have to decide if their existing data centers are big enough and their software licenses are flexible enough to support the expected future data growth.
“Make sure that you run the numbers and determine what will happen if the data gets really big, so that you’re doing the best thing for your company from a cost perspective,” he said.
Giving users a voice on big data analytics plans
And don’t forget to seek input from the business users who will be making important decisions based on the results produced by big data analytics systems, said Will Duckworth, comScore’s vice president of core processing. Failing to do so, he warned, will almost certainly lead to problems as users realize that their data analysis and reporting requirements aren’t being met.
“Involve business users early in the process, as early as you can,” Duckworth said. “Business users are going to come up with queries that they want to run [or] analyses that they’re going to want to see that you’re not going to consider, and you may need to make changes to your architecture that you didn’t anticipate.”
The process of developing systems for analyzing big data can seem quite different from building a traditional data warehouse architecture. For example, big data analytics initiatives might require a Hadoop distribution, NoSQL database technology and a high-performance server cluster to provide fast analytics performance, especially if they involve unstructured data.
But many of the strategies and tactics associated with enterprise data warehousing -- such as taking steps to ensure high levels of data quality and system uptime -- also apply to big data analytics initiatives, according to Michael Brandt, manager of business intelligence (BI) at New York-based LinkShare Corp.
LinkShare offers a variety of marketing services to online advertisers and publishers, focusing on areas such as search engine marketing, lead generation and affiliate marketing. It also provides near-real-time data synchronization and analytics capabilities so clients can quickly see how well online ads are performing and whether they need to be swapped out for more effective marketing messages.
All told, LinkShare fields about 185,000 report requests a day, primarily from external users -- and the number of requests continues to grow, Brandt said. Responding to them all requires the company to capture, store and effectively manage huge amounts of data about Internet clickstreams and online consumer behavior. He noted that the company’s database tables have all increased in size by at least 60% and as much as 80% over the past 18 months.
LinkShare recently launched a new data warehouse appliance deployment project in an effort to improve system bandwidth and performance, and Brandt said the company is looking into the possibility of adding a Hadoop installation during its next planned upgrade cycle in three to four years.
Big data analytics team: Pyramid builders?
Brandt, who spoke at a forum on big data analytics held recently in Orlando, Fla., by The Data Warehousing Institute, detailed a “pyramid of success” strategy that his team employed during its data warehouse upgrade and recommended that any organization engaging in a big data initiative follow the same kind of approach.
The first level of the pyramid outlined by Brandt is focused on making sure that data warehousing and analytics systems are up and running properly, with clearly defined service-level agreements and uptime requirements. Interrupted data-processing workloads caused by an unstable system can result in redundant data, missing information and other data quality problems, he said.
The second level involves making sure that data quality levels are where they need to be. Just getting information to internal users or external clients isn’t good enough, Brandt said. If data quality is “spotty” or worse, users likely will end up making bad decisions, he cautioned. “I’d rather not give them any data at all than give them bad data.”
The third level of Brandt’s pyramid centers on making sure that information gets to business users and clients not just on time but at the right time. Delivering data to users as quickly as possible isn’t always necessary, Brandt said. For example, users in finance might not need to see reports about daily financial activity until the next morning, whereas Web advertisers might want reports on the performance of their banner ads right away, so they can quickly replace underperforming ones.
It’s finally time to focus on speed at the fourth level of the pyramid. Brandt said organizations should ask themselves a variety of performance-related questions, including, How can we get large amounts of data loaded into a table quickly? And how can we accelerate the data transformation process?
By the time they reach the fifth and six levels, companies should have achieved the minimum requirements for effective big data analytics. As a result, they can focus on making sure that the system is well-tuned, as easy to use as possible and even a little fun.
In LinkShare’s case, Brandt said “fun” will come in the form of upgrades designed to make the company’s analytics system more interactive and social for end users. “We think this will engage users a little more from a reporting standpoint,” he said, while declining to disclose specific details.