As described in the first part of this series, there are many good reasons why data entry and data storage are dispersed. Still, data has to be integrated. And there are many different reasons why data has to be integrated. For example:
- Customer care can be improved when sales data is integrated with complaints data and data from social media -- all three being different data sources.
- Transport planning can be more efficient when internal packing and delivery data is integrated with weather and traffic data -- both being external data sources.
- Internal sales data becomes more valuable to an organization when it's integrated with, as an example, demographic data. The combination may explain why certain customers buy certain products. The sales data may be stored in an ERP system running on local servers, whereas the demographic data is available from an external website whose physical location is completely unknown.
- To develop the right key performance indicators (KPIs), sales data must be integrated with manufacturing data.
Another reason for integrating data is purely to be able to make sense of the data. For example, sensor data coming out of high-tech machines can be highly cryptic and coded. The explanations of these codes may be stored in a database residing on another system. So to make sense of the sensor data, the information must be integrated.
It's obvious that the need to integrate data from different sources is important for every organization. And now that data has been distributed, data integration becomes an even bigger technological challenge.
For the last 20 years, the most popular place to integrate data has been the data warehouse. In most data warehouse systems, data from multiple sources is physically moved to and consolidated in one big database (at one site). Here, the data is integrated, standardized and cleansed, and made available for reporting and analytics.
This centralization and consolidation of data makes a lot of sense from the perspective of the need to integrate data. And if there isn't too much data, it's technically feasible. But can we keep doing this? Can we keep moving and copying data, especially in this era of big data? It looks as if the answer is going to be no, and for some organizations it's already a no. Here are four problems with the data warehousing approach:
The ever-growing amount of data. There's a reason why big data is the biggest trend in the IT industry. The word "big" says it all. Big data is about managing, storing and analyzing massive amounts of data. And sometimes big data can be too big to move. For some organizations, the amount of data generated each day is more than can be moved across the network (depending on the network characteristics). In such a situation, when data is moved to a central site for integration purposes, the network cables will start to look like snakes swallowing pigs.
The growing importance of operational intelligence. End users want to work with zero-latency data that is 100% or close to 100% up to date. If data is first transmitted in large batches over the network and stored redundantly, there will always be a delay. When users demand operational intelligence, it's better to request data straight from the source.
Privacy. More and more international legislation addresses storing data on individuals, such as customers, patients and website visitors. These rules are becoming tighter and tighter -- and rightfully so. This implies that when an organization needs access to demographic data on individual customers, it can't just copy and store that data in its own systems for integration purposes. The data must be used where it's stored.
The sheer cost of storing data. Consolidating big data is starting to become too expensive when stored in traditional SQL database servers.
Until now, centralization may have been the right approach for data integration, but as more data is entered and stored in a distributed fashion, it may not be the right solution in the near future. In the 1980s, distributed database technology moved data to the user, and for integration purposes data was moved to the point of query processing. It's now time to move the query processing to the location where the data is collected. This minimizes network traffic duplication on stored data, and it lowers the risk that data will become inconsistent -- or just plain incorrect -- and/or out of date. If the mountain will not come to Mahomet, Mahomet must go to the mountain.
About the author:
Rick van der Lans is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy; email him at firstname.lastname@example.org.