Guide to managing a data quality assurance program
A comprehensive collection of articles, videos and more, hand-picked by our editors
Like Facebook and Google+, Tagged.com provides a place for people to connect, share thoughts and interests, even flirt. But behind the bright and colorful user interface, algorithms are prompting information stores to churn out recommendations for users on who might make a good partner, conversationalist or gamer. Yes, on the face of it, Tagged is a social networking site. But, underneath, Tagged's success rides on the quality of its data.
The same could be said for any business that uses data to find ways to become more efficient, save on costs or get in front of disasters before they happen. Quality data can a competitive advantage, but unresolved data quality problems can be a hindrance.
For Tagged, data quality is a priority, said Johann Schleier-Smith, chief technology officer and co-founder of the San Francisco, Calif.-based company. It requires continuous work, especially when glitches materilaize.
More on data quality problems and solutions
Gartner predicts rising demand for data quality tools
Practitioners label data quality problems a major issue for analytics programs
Learn how to improve data quality on a tight budget
A red flag
Like other social networking sites, Tagged analyzes clickstream data, Web and mobile application logs, all of which tend to accumulate rapidly. To put it in context, a single application within the Tagged environment produces 50 billion log entries every month, according to Schleier-Smith.
The scale of Tagged's system is large "whether measured by data rate, number of machines from which we aggregate data, or number of product teams introducing changes as they work on new features," he said.
But for the business to gain as much value as possible from its data, capturing log files and clicks is only one part of the equation. The company also needs to make sure all of that data, which runs on a server farm of more than a thousand machines, is moved into the data warehouse for analysis. And that's where Tagged started to notice a problem.
Dashboard reports that keep track of business metrics, like the number of people visiting the site or the number of recommended matches between users, started to look off, Schleier-Smith said. Data, it was later determined, was getting lost in transit.
Cause and effect
Early incarnations of Tagged's data analytics platform ran into equipment failures, software bugs and kinks from newly built features. In some cases, events like these introduced new -- and even masked standard -- types of information; in other cases, data never made it to the data warehouse. Such mishaps threatened the quality of Tagged's data analysis by skewing dashboard reports and, in the worst-case scenario, introducing faulty data into the development process.
Tagged decided to install components that would help provide more control over data intake and processing. "Our data team spent several months focused on end-to-end quality of logging flows," he said.
Team members built in automatic controls, which monitor the data and sound an alert if something dips below approved benchmarks, such as the time elapsed between log events or the number of events loaded in the last few hours.
Sweat the small stuff to catch the big problems. Even little discrepancies, such as numbers that should add up but are off by just a little, are often indicators of serious issues.
Johann Schleier-Smith, Tagged.com
Additional fail-safes were also added. As log files pour into the Tagged environment, they're pushed downstream into the data warehouse to "aggregate data from many individual servers into one system that serves as a central point for analysis," said Schleier-Smith. That still happens, but now the system also creates a temporary copy of the file at the source of origin as a backup in case a problem arises during the move and in the case of routine maintenance. Plus, hardware failures now trigger an automatic recovery feature.
The team also installed a dashboard alert so that when problems with log flows arise, employees who rely on that data are kept in the loop.
"Sweat the small stuff to catch the big problems," Schleier-Smith said. "Even little discrepancies, such as numbers that should add up but are off by just a little, are often indicators of serious issues."
The road ahead
Initially, the project helped standardize how data entered and moved across Tagged's environment, but it eventually helped to transform the way log files are processed. "Data lost in transit is now a rarity at Tagged," said Schleier-Smith, "and we have moved on to tackling more subtle issues of data quality."
Today, as Web, mobile and clickstream logs flow into Tagged, the system is programed to add structure that flags specific data it expects to use for analysis.
Schleier-Smith gives an example of messages exchanged between two users. Analysts could figure out if those two users are friends by determining when the original connection was made or searching through their lists of friends. But to determine if the two users had been friends at one point but now aren't would be a more time-consuming, complicated process.
As a social networking site that recommends potential matches to users, both pieces of information are important, Schleier-Smith said. Tagged developers programmed a way for those data points to be highlighted.
"If we want to study messaging patterns among friends, it's much more practical to add a field to the message log," he said.
In other words, Schleier-Smith said, data quality endeavors are not just about ensuring the data is correct; they are also about figuring out how to make the data more useful and powerful to the business.
"Careful design up front is rewarding," Schleier-Smith said. "Thoughtfully structured data leads immediately to better insights, whereas logging treated as an afterthought often leads to deficiencies that are difficult to correct once high-volume data is flowing."
Besides, he said, automating redundant tasks and giving employees high-quality data makes them more productive and clears the way so that data scientists and analysts can do what they're good at: solve problems.