Big data is everywhere these days. Marketing materials are bursting with references to how products have been enhanced to handle big data. Consultants and analysts are busy writing new articles and creating elegant presentations. But the sad reality is that big data remains one of the most ill-defined terms we’ve seen in many a year.
The problem is that data volume is a metric that tells us little about the data characteristics that allow us to understand its sources, its uses in business and the ways we need to handle it in practice. Even the emerging approach of talking about big data in terms of volume, velocity and variety leaves a lot to be desired in terms of clarity about what big data really is.
Business drivers and origins
So, what is the problem? And, more to the point, is there an answer? The problem is that big data in a technical sense, beyond the common characteristic of “bigness,” has little else in common. Hence the difficulty in coming up with a single, all-encompassing definition.
However, in a business sense, there is one common theme—predicting the future. Based on statistical analysis of past and present reality, we try to predict and/or influence future events, behaviors and so on. This is the same goal that we’ve seen in data mining since the 1990s. In simple terms, the business driver for big data is a logical extension of data mining. The novelty lies in the fact that with ever larger data volumes and new data sources, we can obtain more statistically accurate results and, hopefully, make more accurate predictions.
Thus, we return to data volumes. The origins of big data as a concept and phrase can be traced back to the scientific community. Researchers in astronomy, physics, biology and other fields have long been at the forefront of collecting vast quantities of data from ever more sophisticated sensors. By the early 2000s, they encountered significant problems in processing and storing these volumes and coined the term big data— probably as a synonym for big headaches. We see here the beginnings of the business driver mentioned above, as science today is founded largely on statistical analysis of collected data. What begins in pure science moves inexorably to engineering and finally emerges in business and, especially, marketing.
Definition and handling
It is that evolution in usage that leads to the conclusion that no single definition of big data is possible—it’s a phrase that takes meaning from the context of its use. Do not despair, however. This thinking also leads directly to a more useful understanding of four different classes of big data, each with specific characteristics and uses. The first class is metrics and measures, emanating more or less directly from sensors, monitoring devices and less complex machines, including RFID readers; ZigBee devices and the multitude of sensors in modern airplanes, cars and even cameras; and, perhaps most interestingly, in smartphones. Such data is highly structured and reflects discrete events or characteristics of the physical world.
The second class, also machine-sourced, consists of computer event logs tracking everything from processor usage and database transactions to clickstreams and instant message distribution. While machine-generated, data in both of these classes are proxies for events in the real world. In business terms, those that record the results of human actions are of particular interest. For example, measurements of speed, acceleration and braking forces from an automobile can be used to make inferences about driver behavior and thus insurance risk.
In classes three and four, we have social media information directly created by humans, divided into the more highly structured textual information and the less structured multimedia audio, image and video categories. Statistical analysis of such information gives direct access to people’s opinions and reactions, allowing new methods of individual marketing and direct response to emerging opportunities or problems. Much of the current hype around big data comes from the insights into customer behavior that Web giants like Google and eBay and mega-retailers such as Walmart can obtain by analyzing data in these classes—especially the textual class, so far. However, in the longer term, machine-generated data, particularly from the metrics and measures class, is likely to be the big game-changer simply because of the number of events recorded and communicated.
But what about my current BI system?
From a business viewpoint, big data significantly shifts the emphasis in business intelligence (BI) from reporting and problem-solving to prediction. The former won’t go away, of course, and high levels of competence and investment in those aspects will continue to be needed— just to stay in the game. However, the ability to anticipate changes in the market provided by advanced analytics on large data volumes will separate the leaders from the also-rans.
From an IT point of view, the issue divides largely between the first and second and the third and fourth classes. In the first and second, we deal with data that is structurally similar to that on which traditional BI is based. At the high end, volumes and velocity will continue to demand innovative technological solutions. Lower down the scale, traditional tools and techniques will likely stretch upward to larger slices of the middle ground. However, one thing is clear: The old thinking that all data must be funneled through an enterprise data warehouse cannot survive.
This becomes even clearer when we look at the top half of the picture. The data found there has different characteristics than traditional BI data. Not only does it have far less structure, but that structure is also fluid and its semantics largely unfavorable for the type of prior modeling that is the foundation of traditional data warehousing. This socially sourced data will most likely continue to require a different environment and approach to analysis and management. However, it will need to be linked to classic BI systems via summary-results data imported into the data warehouse environment and metadata that bridges the semantic gap between the two areas.
The reality is that big data is going to provide BI with a significant growth stretch and that the technology is evolving and merging rapidly to meet this challenge.
About the author:
Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. His current interest is in the wider field of a fully integrated business, covering informational, operational and collaborative environments. He is the founder and principal of 9sight Consulting; email him at firstname.lastname@example.org.