Big data is all the rage these days because of a combination of forces, including the continued growth of data volumes, the increased velocity of data creation and updates from a variety of internal and external sources, and the availability of easily installed tools for building scalable analytics platforms using commodity hardware.
Similar to the boom in automobile use driven by the increased capacity of the interstate highway system, the improvements in computational power and speed for business intelligence (BI) and analytics applications enable broader dissemination of actionable knowledge in organizations. When that's coupled with demands from business users for faster access to information to speed up decision making, the pressure to provide right-time intelligence capabilities grows exponentially. But in many organizations, there is a bottleneck in the technology infrastructure causing unwanted delays in the delivery of information. What can be done to break that bottleneck?
Batch extract, transform and load (ETL) processes might have been satisfactory when your data warehouse was refreshed every month. Now pervasive and right-time analytics seems to be within reach, but the batch-oriented approach is insufficient to meet today's -- let alone tomorrow's -- data integration and delivery needs.
Greater storage capacities and more powerful computers allow more and more data to be generated, published and ultimately captured and stored for analysis. Flowing all that data into analytical environments enables more reports and BI information to be streamed to the operational environment for business units to act on. Data scientists want to combine these data streams with the masses of data collected and archived over the years to support deep analytics applications.
The lingering barrier to success, associated with the only part of the technology infrastructure that has not scaled to meet the new demands, is data provisioning and the timely delivery of massive data sets from source systems to BI and analytics platforms. Basically, the ability to provide integrated analytics capabilities to the growing community of business users is being throttled by the inability to provide rapid access to consistent and up-to-date data. Without addressing the challenge of data latency, data provisioning will continue to be the biggest bottleneck to increased productivity and accurate business decision making.
Data latency's business ramifications
Business processes relying on big data and other BI applications are negatively affected by that bottleneck. Consider these examples:
Delayed access to data archives. Big data platforms are increasingly being used as interim data archival systems. These "warm archives" must be loaded with data coming from both internal and external sources, and timely migration of the data is necessary for indexing, searching, matching and then delivering information to business users. Data latency reduces the performance and effectiveness of the systems.
Longer development cycles for analytics applications. The process of developing advanced analytics applications consists of a series of iterative steps, involving the development, testing and scoring of analytical models. Big data analytics applications need to be designed using large data sets, and each repetition of the model-test-score cycle may require that the data sets be tweaked and reloaded onto the development platform. Slow data availability elongates the application development cycle and may result in missed business opportunities.
A lack of BI and analytics scalability. Demand for real-time BI and analytics capabilities from a wider community of users could cause an explosion in the number and types of analyses performed simultaneously in an organization. But that would require simultaneous availability of current and timely data to power the analyses -- something that would be hard to achieve if data delivery was sluggish.
Delayed and questionable decision making. Lags in data delivery both into and out of BI and big data systems cause delays in providing actionable information to business decision makers. At the same time, data latency introduces concerns about data currency and consistency that can contribute to uneasiness about the trustworthiness of analytical results -- and, ultimately, the decisions based on those results.
New thinking needed on data delivery
Breaking the data bottleneck is a critical step in ensuring that big data analytics applications and conventional BI systems alike can operate at peak performance. Doing so requires tools and strategies that address the idiosyncrasies that allow the bottleneck to exist in the first place. For example, any solution to the problem must be able to do the following:
- Eliminate or finesse the root causes of data latency and scale data delivery to the speed of the platform.
- Ensure currency, timeliness and consistency of data streamed from internal and external sources.
- Provide broad accessibility to a variety of data sources, including both structured and unstructured data.
What is needed to close the data accessibility gap are alternative approaches to conventional ETL processes. For example, many businesses are reacquainting themselves with data replication technology that has been in use for more than 20 years. High-performance data replication speeds the initial delivery of data, and techniques such as change data capture help to ensure currency and coherence with the data in other enterprise systems. Caching techniques such as those used by data federation and data virtualization software not only speed data delivery, but also provide seamless transparency across the inevitable structural and semantic variations found in siloed systems.
Together, these methods can be used to alleviate some, if not most, of the delays associated with data access latency. And breaking the data bottleneck will help remove the final barrier to scalable and elastic high-performance computing for big data analytics.
David Loshin is president of Knowledge Integrity Inc., a consulting, training and development company that focuses on information management, data quality and business intelligence. Loshin also is the author of four books, including The Practitioner's Guide to Data Quality Improvement and Master Data Management. Email him at firstname.lastname@example.org.