One of the growing uses for big data platforms is capturing streams of data that are continuously ingested, processed, stored and analyzed. Real-time streaming analytics provides immediate visibility into business activities and feeds operational reporting, which is particularly beneficial for organizations that can act quickly on incoming data as events unfold.
A common example is manufacturers that have equipped various plant-floor machines with embedded sensors that measure aspects of the operating environment and communicate those measurements in streams of data, increasingly via the internet of things (IoT). These data streams are fed to a central system and analyzed to develop predictive maintenance models that can identify impending equipment failures and drive pre-emptive replacements of at-risk parts, thereby reducing unplanned downtime.
Delivery and logistics services companies are another case in point. Many capture myriad streams of operational metrics on their trucks from in-vehicle computers and sensors, including data on fuel use, speed and acceleration, air intake and tire pressure. Combined with GPS-based location data that's also collected in streams and transmitted via IoT connections, the metrics can be used to assess ways to reduce fuel consumption and improve driving habits as well as to speed delivery times.
Electrical utilities have deployed monitoring devices across their power grids to transmit data about energy consumption, network stresses and potential risks of equipment failure. The data streams can be analyzed to look for ways to balance delivery of electricity, identify selected "hot" devices that could be powered down during times of high electrical usage and predict the types of parts that repair crews might need before leaving the garage.
Not a smooth flow in data streams
All of those examples share some common characteristics. In each case, there are multiple sources producing data and streaming it independently, sometimes combining streams from numerous devices to create a single logical stream. That requires a logical "broker" at the local level to meld the data and then transmit it to a centralized location.
However, that still leaves a variety of incoming data streams, and users may want to independently ingest each one into a real-time analytics system -- even to the point of differentiating between the actual data sources within a logical data feed. In addition, data from the different streams is likely to be filtered, processed and stored in asynchronous ways.
Under these circumstances, it begins to become clear that deploying big data platforms to ingest, process and analyze data streams only partially addresses the overall challenges of enabling real-time streaming analytics. Another big hurdle is organizing the methods by which streaming data is forwarded to an analytics system in a way that preserves the integrity of the operational events generating the data. IT and analytics teams need to ensure that the process of combining data streams and forwarding them maintains the order in which sensor readings, alerts and other data points are created.
That's even more challenging when operating in a distributed environment, especially when the data streams are interweaved like in a sequence of events among collaborating processes on different machines. The need to keep things properly coordinated creates prerequisites for an overarching broker mechanism that can manage the queuing and transmission of data in all of the incoming streams.
Data-stream management features
Such a broker must be able to oversee the organization of streaming data by originating source, combine different streams while preserving the order of events and maintain data consistency across sources. At the same time, it has to transmit the data streams without causing any significant delays in the desired real-time delivery. And it must provide fault tolerance with assurances of recovery in the event of a failure in the data streaming environment.
Apache Kafka has emerged as the most prominent example of a fault-tolerant message broker and queuing system in the big data ecosystem. Kafka, which was created at LinkedIn and released as an open source technology, works with Spark Streaming, Storm, Samza, Flink and other stream processing platforms, as well as HBase, Hadoop's companion database. It acts as a clearinghouse for real-time message streams, providing a combination of scalability and reliability features to help address the need for high performance in streaming analytics applications involving large volumes of data.
Kafka uses a publish-and-subscribe messaging format to transmit data streams from source to target systems. Messages generated in Kafka are persisted on disk and replicated across different nodes in the server cluster that the software runs on. Because the data is replicated, multiple subscribers to different data streams can be supported simultaneously; replication also allows the tool to balance workloads across the cluster to maintain performance and data availability in the event of a node failure.
Other open source message broker technologies are also available -- RabbitMQ and ActiveMQ, for example. And as more companies recognize the potential business value waiting to be tapped in data streams, more of them likely will also see the need to deploy Kafka or another messaging system to support real-time streaming analytics applications.
Kafka moves upstream by getting its own stream processing capabilities
Real-time streaming analytics takes big data applications to a new realm
David Loshin on data streaming's growing role in data integration efforts