How can all of the distributed data in an organization be integrated without first copying it to a centralized data store, such as a data warehouse? Data virtualization technology offers a potential solution. In a nutshell, data virtualization makes a heterogeneous set of data sources look like one logical database to business users and applications. These data sources don't have to be stored locally; they can be anywhere.
Data virtualization software is designed and optimized to integrate data live, on the fly. There's no need to physically store all the data to be integrated centrally. Only when data from several different sources is requested by users is it integrated -- not before that. In other words, data virtualization supports integration on demand.
Because data virtualization servers retrieve data from other systems, they must understand networks. They must know how to efficiently transmit data over the network to the server where the integration takes place. For example, to minimize network traffic, mature data virtualization servers deploy so-called push-down techniques. If a user asks for a small portion of a table, only that portion of the data is extracted by the data virtualization software from the data source, not the entire table. The query is "pushed down" to the data source instead of requesting the entire table.
Push-down allows a data virtualization server to move the processing to the data instead of moving the data to the processing. In the latter case, all the data is transmitted to the integration server that subsequently executes the request. Especially if big data sets are used, this approach would be slow because of the amount of network traffic involved. A preferred approach is to ship the query to the data source and transmit only relevant data back to the data virtualization server.
Moving processing to the data is a powerful feature to optimize network traffic, but it's not sufficient on its own for the distributed data world of tomorrow. Imagine that a data virtualization system runs on one server and all the requests for data are first moved to that central server, then queries are sent to all the data sources, answers are transmitted back and all the data is integrated and returned to the users.
This centralized processing of requests can be highly inefficient. It would be like a parcel service operating worldwide that first shipped all parcels to Denver, and from there to the destination address. If a specific parcel has to be shipped from New York to San Francisco, that isn't a bad solution. However, a delivery from New York to Boston is going to take an unnecessarily long time because of the detour to Denver. Or what about a parcel that must be shipped from Berlin to London? That parcel is going to have to make a very long journey before it arrives in London.
Besides the inefficiency aspect, it's not recommended to have one data virtualization server because that lowers availability. If the server crashes, no one can get to the data anymore. It would be like the parcel service in a situation where the airport in Denver is closed because of bad weather conditions.
To address the new data integration workload, it's important that data virtualization servers support a highly distributed architecture. Each node in the network where queries originate and data sources reside should run a version of the data virtualization software for processing these requests. Each node that receives user requests should know where the requested data resides, and must push a request to the relevant data virtualization server. Multiple data virtualization servers work together to execute the request. The opposite effect is that when no remote data is requested, no shipping of data will take place.
Network knowledge needed
This is only possible if data virtualization servers are knowledgeable about network aspects, such as the fastest network route, the cheapest one, how to transmit data efficiently and so on. Like they must know how to optimize database access, they must also know how to optimize network traffic. It requires a close marriage of the network and data virtualization technology. Note that this requirement to distribute data virtualization processing across multiple nodes isn't very different from the data processing architectures of NoSQL database systems.
Data and data entry are more and more distributed across networks, and over time that will only escalate further. The time when all data was stored together is forever gone. Sun Microsystems' tagline once was "The network is the computer." In this era, in which data is entered and stored everywhere, users who access the data can be everywhere and big data systems are being broadly deployed, an analogous statement can be made:
The network is the database.
And if the network is indeed the database, copying all of the data to one centralized node for integration purposes is expensive and almost technically undoable, and it may clash with regulations. Due to its integration on-demand capabilities, data virtualization technology offers a more suitable approach to integrating widely dispersed data. But it's required that data virtualization servers have a highly decentralized architecture and be extremely network-aware.
About the author:
Rick van der Lans is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy; email him at firstname.lastname@example.org.