Everyone who surfs the Web knows the name Yahoo. While the company delivers content and services to more than 680 million people worldwide, many of those services for free, the company depends on revenue from selling advertisement space. Earlier this month, David Mariani, Yahoo’s now former vice president of user data and analytics, told Gartner BI Summit attendees about the search engine giant's new analytics and data management strategy that was designed to make ad space even more valuable to advertisers.
Mariani, who recently joined Klout, said Yahoo’s ads are sold in two different forms: Search-based advertisements are seen on query pages; display or banner advertisements are seen on the homepage and deeper inside the site. The company's new business intelligence (BI) strategy is geared toward optimizing "performance display" banner advertisements, designed to drive a visitor to make a purchase, provide an email address or click through to a specific site. To do this, Yahoo has to determine who’s visiting the site, and then target the most appropriate advertisement to that visitor. Yahoo’s latest BI project requires loading, storing and querying vast amounts of data in real time to make an advertisement campaign as relevant as possible.
The problem: Yahoo created an ad exchange or network to sell banner ads that operates like the New York Stock Exchange, Mariani explained. Sellers are equated to publishers who determine if they have space for an ad placement, and buyers are advertisers looking for the best bang for their buck. Mariani said advertisers are drawn to search-based ads because they can be easily aligned with the interests of visitors to the site. For example, someone surfing for information about vacations might see an advertisement for a travel company.
But that kind of information about personal preferences tends to be more elusive for banner advertisements. To make banner ad space a worthwhile investment for advertisers, Yahoo needs to determine a visitor’s intent.
“We have to work in real time amongst many different dimensions,” Mariani said, including time of day, location, age and gender.
To accomplish this, Yahoo, which receives 3.5 billion ad impressions daily, needs to analyze streams of data to determine what advertisements will interest an individual visitor most. Each ad impression and any actions taken during that impression are recorded. The company determines visitor demographics and runs algorithms to further optimize banner advertisement campaigns. That equates to almost a half-trillion rows of data per quarter that need to be loaded into the system, sorted and then made accessible to end users through queries that can be delivered in under 10 seconds.
“Our system needs to find the exact nugget that’s going to make a difference in that campaign and generate revenue,” Mariani said.
The implementation: Yahoo built a system using “off the shelf parts” rather than purchasing a business intelligence (BI) appliance or specialized technologies.
The organization decided to implement Hadoop, an open source file management system, and says it has been instrumental in helping the company run extract, transform and load (ETL) operations for data aggregation.
Hadoop and the aggregation engine wait for events to come in from the ad server, amounting to about 1.2 terabytes of raw data per day. Oracle 11g Real Application Clusters (RAC), which provides data archiving and staging functions and allows for scalability, loads files as soon as it gets them and puts them into partitions.
The data is then loaded into a "cube" and compressed from 1.2 terabytes per day into 135 gigabytes. SQL Server Analysis Services 2008 R2 multidimensional online analytical processing (MOLAP) technology operates as Yahoo’s cube engine and loads partitions from Oracle, creating a 16-terabyte cube per quarter. Data loaded into the system can be queried 8 to 12 hours later.
“Each stage in the building of this warehouse is not connected to the other,” Mariani said. “Instead it’s highly parallel, and there’s no data sorting here. It’s all processed in the order that it comes in.”
Once completed, the cube creates and then publishes a snapshot to a bank of BI query servers fronted by a load balancer. In other words, Yahoo has completely separated the loading process from the querying process.
Yahoo next implemented two types of query interfaces. First, an ad hoc query interface, sometimes seen as a BI nightmare, was added to help optimize campaigns. Those queries needed to happen fast, and Yahoo’s system can query a half-trillion rows of data created every quarter in six seconds. Mariani refers to this as a kind of self-service environment that enables end users to create visualizations quickly. The second query interface was developed using what the organization calls a targeting, analytics and optimization (TAO) web application, a customized search function that queries based on specific parameters. That query happens in less than two seconds, feeding information back to the end user so that the ad campaign can be adjusted where needed.
“Our users are monitoring ad campaigns in real time and are making changes in real time,” Mariani said.
The outcome: This project was delivered in 12 months. The new system can currently handle about 100,000 queries per week. Mariani said now that it’s in place, it’s generating “tens of millions of dollars” and making ads more relevant.
Yahoo has been able to measure its success by comparing campaigns that use TAO against those that do not. According to Mariani, campaigns managed by TAO have pulled in about twice as much revenue.
“Advertisers like it, publishers like it, so it’s good for the Yahoo ad exchange,” Mariani said.
Additionally, advertisers were willing to spend 15% more for a campaign managed by TAO versus a campaign that wasn’t, which means more revenue for Yahoo.
By managing and quickly accessing data, Yahoo has also been able to provide a better snapshot of what customer segments look like. Dashboard reports can chart statistics such as conversion and click-through rates.
And, Mariani promised, “You haven’t seen nothing yet.”
Yahoo is currently in the process of building a system that can handle more daily ad impressions and customer segments, which could amount to as much as 10 times the data it’s currently working with. Mariani said this has already been achieved inside Yahoo laboratories.