Data sandboxes help analysts dig deep into corporate info

Giving analytics professionals control of small amounts of space in data warehouses lets them experiment with data sets in a managed environment.

You need to think big when you think about eBay Inc.'s auction and shopping website; for example, picture 100 million...

site users, 300 million active items, 50,000 product categories and an average of $2,100 worth of goods sold every second. The same applies if you think of eBay as a data management and business analytics company: It generates 50 terabytes of data a day and supports efforts to analyze that data by 7,500 business users and analysts.

Data sandboxes, on the other hand, sound pretty small. But they're a key component of eBay's efforts to keep its data analysis processes from getting bogged down.

"We can become swamped if people are asking for different views of the data -- different reports or dashboards," said Chris Rogaski, eBay's senior director of analytic application technology, during a presentation at the Gartner Business Intelligence Summit in Los Angeles in April. "We needed to get ahead of that … so that our business analysts and product managers can make data-informed decisions."

The San Jose, Calif., company has taken several steps to help it stay in front of the user demand. Its data analytics platform is composed of a Teradata-based enterprise data warehouse (EDW) that stores structured transaction data; a separate "deep storage" Teradata database, called Singularity by eBay, that holds semi-structured data such as analyses of the behavior of site users; and a Hadoop system for unstructured data, including the raw user behavior data, other forms of machine-generated info and text. Together, the three pillars provide about 90 petabytes of storage space, Rogaski said.

In addition, eBay is liberally handing out virtual data marts inside the EDW to employees who want to explore, manipulate and even add to specific data sets on their own. The data marts are part of the company's Analytics as a Service, or A3S, program for users involved in analyzing data. Using a tool created by eBay's IT department, business users and data analysts can apply for, and are usually granted, 100 GB of space -- giving them what are known in business intelligence (BI) circles as data sandboxes to play around in.

Also referred to as analytics sandboxes, the user-controlled spaces are walled-off areas that keep experimentation with data separate from a data warehouse's production database environment. At eBay, users have access to the data in the EDW and can copy information that they want to analyze into their data marts. And with the help of a second eBay-developed tool, they can upload additional data to work with. "If people have a new data source we don't know about, we can't be in the way of that data becoming a part of their analysis," Rogaski said.

Family feud frustrates analytics efforts

The long-standing feud between the IT department and the business in many organizations is well documented. It can be chalked up partly to differing priorities: While business users have pressing business problems to resolve, IT teams are tasked with governing the use of data and maintaining data quality standards. For analytics professionals looking to dig deep into the most current data, the divide can be a source of frustration.

Often, "analysts need data that's not yet in the data warehouse," said Wayne Eckerson, a BI consultant and research director for TechTarget Inc.'s business applications and architecture media group. "It's not there because it hasn't been sourced or it's not yet loaded."

In other cases, he said, data analysts may view the BI and analytics tools deployed by their companies as inflexible compared to Excel -- leading them to go their own way by surreptitiously setting up Excel-based spreadmarts outside of IT's purview. But stretching Excel across the enterprise for data analysis uses is hardly ideal, Eckerson added: "Everyone knows analysts deliver valuable information, but organizations cannot run on spreadsheets."

That's where data sandboxes come into play, according to Eckerson. He said sandboxes can help bring spreadmarts and other so-called data shadow systems out of the dark corners of an organization by ensuring that analytics users have access to the data they need and can exert some level of control over the information.

For BI and IT managers, a well-managed data sandbox offers a safe place for users to experiment with corporate data inside a company's data management infrastructure. It's an environment "that is not storing the primary copy of the data but is storing [information] in a format suitable for analysis," said Gordon Linoff, founder and principal of consultancy Data Miners Inc. in New York and co-author of Data Mining Techniques: For Marketing, Sales and Customer Relationship Management.

Data sandboxes can be constructed in data warehouses and analytical databases or outside of them as standalone data marts (see "Hadoop systems offer a home for sandboxes," below). In eBay's case, hosting sandboxes as virtual data marts inside the EDW keeps data movement down and reduces the need for users to make copies of data and store them in other systems, Rogaski said.

Best when analyzed by this date

He acknowledged that a "minimal" amount of data duplication occurs as users populate their sandboxes. "But it happens, and that's a cost of the way we're doing business," he said. To decrease the instances of duplication, eBay uses an expiration date system, with analysts typically setting an end date for their use of a data set. Once the limit is reached, Rogaski's team confers with the analysts before expunging their data from the system -- a process that eBay refers to as "garbage collection."

This is about learning new things. And you need the skill set to make use of it.

Gordon Linoff, founder and principal, Data Miners Inc.

Because sandboxes by their very nature involve playing with data, Linoff believes that having the right skills is an important part of a successful deployment. Data scientists and other users may need to manipulate data and analyze what they're looking at on the fly. "This is about learning new things," he said. "And you need the skill set to make use of it."

That may be a good rule of thumb for many businesses but not for all. Rogaski said one of eBay's goals is to make its BI and analytics data accessible to "a wide swath of people." Even a business user "who really just wants to be told what they need to know" can apply for a virtual data mart, he added.

Managing usage was one of the big challenges that Eckerson cited for organizations looking to set up data sandboxes. For example, he said that before users distribute any reports containing unique views of the data they're working with to other people, the manipulated information should be checked by the corporate BI team to make sure the metrics are correct and no errors have crept into the data.

"You can give users access [to data], but you also have to give them some guidelines," Eckerson said. "They don't like restrictions, but if they're going to use corporate resources, they have to agree to certain things."


Hadoop systems offer a home for sandboxes

With petabytes of data storage space available across three high-powered analytics platforms, eBay has ample wiggle room before it needs to start worrying about the virtual data marts that it sets up for data analysts and other users affecting the performance of its enterprise data warehouse system. But for many other companies, performance issues could be a valid concern with data sandboxes, and a reason to put them outside of an EDW.

One alternative location is a standalone data mart. A Hadoop system is another. "Most people don't use the term sandbox when implementing Hadoop," said Wayne Eckerson, research director for TechTarget Inc.'s business applications and architecture media group. "But in many ways, companies want to do data mining and exploration there."

The open source distributed computing technology is free-standing but can be connected to data warehouses to exchange data, and Hadoop clusters should be able to provide space for data scientists and other skilled analytics users who may be taking up valuable computing resources in an EDW system, Eckerson said. He cautioned, though, that people using Hadoop sandboxes will have to be adept at manipulating the MapReduce programming framework and familiar with related technologies such as Hive and Pig.

Yet another possible sandbox host, said Gordon Linoff, founder and principal of New York-based consultancy Data Miners Inc., is "a separate system running SAS or SPSS -- analytics tools that are not database-oriented and are designed more for statisticians."

Dig Deeper on Business intelligence data mining