Thirty years ago, John Cage of Sun Microsystems coined the phrase "The network is the computer." He was right then, and he's even more right today. Nowadays, application processing is highly distributed over countless machines connected by a network. The boundaries between computers have completely blurred. We run applications that seamlessly invoke application logic on other machines.
But it's not only application processing that is scattered across many computers; the same can be said for data. More and more digitized data is entered, collected and stored in a distributed fashion. It's stored in cloud applications, in outsourced ERP systems, on remote websites and so on. In addition, external data is available from government, social media and news websites, and the number of valuable open data sources is staggering. The network is not only the computer anymore; the network has become the database as well.
This dispersion of data is a fact. Still, data has to be integrated to become valuable for an organization. For a long time, the traditional solution has been to copy the data to a centralized site, such as a data warehouse. However, data volumes are increasing (and not only because of the popularity of big data systems). The consequence is that more and more often data has become too big to move -- for performance, latency or financial reasons, data has to stay where it's entered. For integration, instead of moving the data to the query processing (as in data warehouse systems), query processing must be moved to the data sources.
This three-part article explains the problem of centralized consolidation of data and describes how data virtualization can help turn the network into a database by using on-demand integration. It also explains the importance of distributed data virtualization for operating efficiently in today's highly networked environment.
A short history lesson
Once upon a time, all of an enterprise's digitized data was stored on a small number of disks managed by a few machines, all standing in the same computer room. Specialists in white coats monitored these machines and were responsible for making backups of the valuable data. It's very likely that all of the end users were in the same building as well, accessing the data through monochrome monitors. The "network" that was used to move data between the machines was referred to as the sneakernet.
Then the time came when users started to roam the planet, and machines residing in different buildings were connected with real networks. Compared to today, these first generations of networks were just plain slow. For example, in the 1970s, Bob Metcalfe (one of the co-inventors of Ethernet) built a high-speed network interface between MIT systems and ARPANET, the forerunner of the Internet. The interface supported a dazzling-at-the-time network bandwidth of 100 Kbps. Compare that with today's 100 Gigabit Ethernet technology offering a million times more bandwidth. In an optimized network environment, one terabyte of data can now be transferred within 80 seconds. That would have taken 2.5 years in the 1970s.
Because users were working at remote sites, accessing data involved transmitting data back and forth, and that was slow. Database vendors tried to solve this problem by developing distributed database servers in the 1980s. By applying replication and partitioning techniques, data was moved closer to the users to minimize network delays. With replication, data is copied to the nodes on the network where users are requesting data. To keep replicas up to date, distributed database servers support complex and innovative replication mechanisms.
Data, data, everywhere
Nowadays, it's no longer the computing room where new data is found. Data is entered, collected and stored everywhere. Examples include:
Distributed data collection. Websites running in the cloud collect millions of activity log records indicating visitor behavior. Factories operating worldwide run high-tech machines generating massive amounts of sensor data. Mobile devices collect data on application usage and track geographical locations.
Cloud applications. Business applications such as Salesforce.com and NetSuite store enterprise data in the cloud.
Open data. Thousands and thousands of open data sources have become available for public access. They contain weather data, demographic data, energy consumption data, hospital performance data, public transport data -- and the list goes on. Almost all of these open data sources are stored somewhere in the cloud.
Outsourced servers. The fact that so many organizations run their ERP applications and databases in the cloud has also led to a distribution of enterprise data. In fact, some organizations really have no clue anymore where their data is stored physically.
Personal data. Data created by individual users or small groups of users is stored far and wide. It's available on their own machines, on mobile devices and in services such as Dropbox or Google Drive.
But it's not only that data is stored in a distributed fashion; data entry is distributed as well. Employees, customers and suppliers all enter data via the Internet, using their own machines at home, their mobile devices and so on. Data entry has never been more dispersed.
To summarize this history lesson, in the beginning data and users were centralized. Next, data stayed centralized and users became distributed. Now data and users are both highly distributed.
Continue on to part two of this series.
About the author:
Rick van der Lans is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy; email him at firstname.lastname@example.org.