The buy vs. build debate in data warehousing has taken a wickedly humorous turn with the choice being rewritten as buy vs. suffer. To wit: Either buy packaged systems, even if some modest amount of integration is required, or suffer the pains of reinventing the wheel.
But for high-end, challenging applications integrating business intelligence (BI) data with other information such as Internet clickstream and social networking data, the pendulum is swinging back in the direction of build. The drivers? The usual suspects: growing data volumes (accompanied by high numbers of concurrent users and a high velocity of update activity), technology innovations and business opportunities. These variables are challenging old data warehouses to adapt to new environments and acquire new tactics, techniques and tricks.
Let’s look at some of the numbers. The exponential growth of data is staggering. A sense of the continuing explosion is provided by an estimate in a McKinsey Global Institute report published in May 2011 that the 800 exabytes of data now generated in a given year would fill a stack of DVDs stretching from here to the moon and back. And if Facebook’s 900 million active users formed a country, it would have the third largest population on the planet.
The fundamental question of BI and data warehousing remains the same: Who are buying or using our product or service and when and where are they doing so? That can be asked across a variety of dimensions, such as geography, delivery channel, promotions and method of interaction. But paradoxically, if things are going to stay the same, significant changes are needed.
Going forward, the challenge is to build bridges to and leverage new types of data—for example, XML, clickstream data, unstructured and semi-structured sources such as social networking sites, and other forms of “big data.” Then organizations need to encapsulate the functionality of the traditional data warehouse behind a rational, usable and maintainable interface supporting the new data types.
For example, one new trick is that the classic customer dimension is morphing into a subset of the community space. The community dimension drives conversations about products and services, commercial buzz, economic exchange and top-line revenue. “Going viral” is no longer a rare exception; it is a way of life for the virtual consumer and customer avatar.
The technology innovations that are affecting data warehousing include advances in virtualization and cloud computing, plus the Hadoop open source file system and programming framework and NoSQL database technology, which are closely related.
Functional programming lives on in Hadoop and the various distributions of the open source platform released by IT vendors. In brief, Hadoop is a distributed environment designed to increase system reliability by storing data across clusters of commodity systems and using its MapReduce parallel processing function.
Hadoop has a sweet spot in processing oriented to extract, transform and load operations, and it has put wind in the sails of NoSQL databases. NoSQL technology makes a virtue of necessity and seeks an alternative to the relational model, especially for handling truly heroic volumes of data. But a word of caution: Hadoop currently is limited in concurrency. Nonetheless, the NoSQL movement has been successful enough that it became necessary to build a SQL front end to Hadoop called Hive.
Web Companies Are Driving Innovations
Many of those innovations were created by new technology leaders such as Amazon, Google and Yahoo. Meanwhile, the “terabyte club” that once boasted a hundred members with data warehouses of that size now has too many to count; a petabyte is the new terabyte. In addition, there is the requirement to monetize the explosion of social media such as Facebook, the page referrals paradigm of Google and event-driven encounters between would-be consumers and Web-based products and services. The path to revenue generation lies through aggregating data into meaningful categories that describe the behavior of human beings in encountering and interacting with products and services. But wait: That is precisely what data warehouses and BI systems do.
For example, yet another new trick comes from Facebook, which has developed a complex, custom measure of user engagement based on counts of impressions and clicks. This information then becomes the target of data mining and analysis operations. The technology that supports the effort includes Hadoop, MapReduce and Hive; the aggregations of impressions and clicks aim at total data in the range of 1.7 petabytes.
Today, Facebook’s revenue model is mainly realized as an advertising platform. But the company also plans to allow fee-based access to its “social graph,” or profile information database, to marketers and advertisers who potentially can use the information to fine-tune ads and related content for potential customers. In effect, the profile database is an nth-generation data warehouse. It will never be a single data store. It’s more like a “data storm”—a virtual, distributed, dynamically changing image of variables, attributes, aggregations and hypotheses for ongoing analysis, mining and inquiry.
Old and new data warehouses alike are finding themselves in a race between helping business users to work smarter and leaving them in a data fog. Customer dimensions are morphing into communities and vice versa. Transactions are devolving into events. And the path through a page-ranked network is a nearly incomprehensible dimension that rears its head like a new computing grand challenge.
Data warehousing and BI will continue to contribute to finding meaning and creating business opportunities amid the changing market dynamics—but they themselves will never be the same again.