Mark Madsen hates the term unstructured data. He declared his distaste for it during his presentation on emerging technology and big data analytics at The Data Warehousing Institute (TDWI) Executive Summit in Las Vegas last month
“It’s not unstructured, it’s unmodeled,” he told conference attendees.
Madsen, president of the consulting and research firm Third Nature Inc., didn’t linger on this point for too long. He quickly charged forward to talk about data complexity, letting his aside on unstructured data fade to the background. But it raises the question, Why is unmodeled a better description than unstructured when it comes to data in the form of text, images, audio or video files?
What is unstructured data?
Unstructured data is a generic label for describing any corporate information that is not in a database. Unstructured data can be textual or nontextual. Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word documents, collaboration software and instant messages. Nontextual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files.
Read more from the Whatis.com definition of unstructured data.
SearchBusinessAnalytics.com recently sat down with Madsen to get to the bottom of that question.
We tend to define “big data” in terms of the three V’s -- volume, velocity, variety. Does this definition do it justice?
Mark Madsen: When we look volume, variety, velocity, I like Cloudera’s take, which is to pick two. Like you do on a Chinese menu, pick two options from the chart. Because Cloudera feels, and I fully agree, if your problem is just one of these things -- different data types but it’s not a lot of it or it’s high-speed but small -- it doesn’t really matter because that’s an easy, solvable problem. It’s when you add data scale that you typically start to have problems.
But the real problem isn’t just the three V’s, which talks about the data, it’s also what businesses are trying to do with the data: How much processing is involved, where it’s coming from and going to. To me, it’s a complexity argument broken down into some parameters about data.
You presented at the TDWI conference last month and mentioned that unstructured data is a misnomer. Why?
Madsen: From the pedantic, definitional perspective, unstructured would imagine that there’s no structure in there. With language, there is a structure, but it’s not necessarily a formal structure.
If that’s the case, why are so many of us walking around talking about unstructured data?
Madsen: I think the unstructured stuff came out of the idea of taking texty things and extracting information -- references to names, companies and locations -- you could then record and tally up and pull some implicit meaning out of. Unstructured to data people meant it wasn’t already squeezed into a table.
So, if not unstructured, then what?
Madsen: I don’t like the term unstructured; I think it’s really unmodeled in either a database sense or unmodeled in that it’s something like a document corpus for which no structures have been discovered. There are all kinds of academic research in different places looking at inferring structure from a data set to determine how it could best be represented. But trying to get other people to glom on to more precise terminology is a fool’s battle. I usually just say that because it opens up the door to discussion with people about the kind of data they have and what they’re trying to do with it. It sort of makes them stop and think for a second. Otherwise, you get clickstream and blog posts and press releases and document collections and log data all being lumped into this big unstructured bucket when there’s more nuance to the actual structures of the information in its content.
Madsen: I hope so, because I do a lot of projects involving that stuff. What I find interesting is all of the vagaries and nuances we use for it, Facebook being a great example of the big beast. People have Facebook fan pages, and so everybody gravitates to the easy metrics instead of thinking about what they’re trying to accomplish and finding measurements that would make sense in light of those. With a lot of information people think about easy data collection first and actual use after, and that leads to problems.
What should people think about when it comes to these kinds of sites?
Madsen: When you look at Facebook, you have a fan page, you know who all of these people are -- you have their identities; maybe you have their profiles; you can look at age, gender and a bunch of things. There are a lot of ways to do this. You can, for example, perform a search based around keywords. So if you look at common search terms and keywords associated with you, your fan page -- there are some services inside Facebook that will even tell you that based on public timelines -- you can then look at words people associate with your brand and your company, how they’re used, and tie that stuff to demographics.
How can this information potentially hurt businesses?
Madsen: When I did this for a cosmetics company, their customer core demographic was women 35 to 50 and generally middle- to-upper-middle class, cosmopolitan. But the Twitter traffic around it was from a young, urban, African-American middle-to-lower-middle class distribution. If you take what you read on Twitter, gather sentiment or product information or mentions or popularity or any of the things people talk about with Twitter, you end up with a biased sample and then you start marketing in that way and alienating your core demographic.
That’s a real obvious one. I was kind of surprised that people who had been around as long as they had had abandoned all of the common sense they’ve known for years around market surveys and market research. But I think that’s what happens when something new comes along. All of the common sense gets jettisoned for the new and shiny.
Why is text data so difficult to analyze?
Madsen: You have to build a machine understanding of the information content of the document, and it’s just early days of a lot of that science. We kind of know how to process language, but most of it is rudimentary and statistical in nature. Really understanding the meaning inside of things, how you mean a term, there’s a whole research body of work in sentiment analysis around irony detection. Which you and I do automatically, but irony is not something that a computer gets, just like a lot of other things. And then the senses of the words and what you do with it.
Is this kind of data going to become the norm for businesses?
Madsen: It’s building. I think just it’s one more set of data; it’s one more piece of the puzzle. We solved the data capture problem and data warehousing for internal transaction processing applications, but we didn’t capture all of the interaction data through websites and call centers. That’s partly because they’re not being instrumented, partly because of the complexity and partly because of not knowing exactly what to do with them. In a way, this is the final piece of the combination of monitoring automation that hasn’t been done. That’s what we’re looking at. I would expect it just to continue to grow as we figure out how to use them, manage them.