BOSTON -- Text analytics has been through a whirlwind of technological and environmental changes over the past year. “Big data,” acquisitions, social media, knowledge enrichment and integration, APIs and cloud services have all jolted the text analytics industry, according to the opening talk at the eighth annual Text Analytics Summit.
“The field of text analytics is evolving at a pretty rapid pace,” said Seth Grimes, consultant and owner of Alta Plana Corp. in Takoma Park, Md. and founding chairman of the summit. “The application of that technology is driven by the explosive growth of social activity and online information.”
While text analytics technology still struggles with unstructured data beyond text or how to handle a narrative or an argument as opposed to a string of sentences, it has made strides, Grimes said. He noted that interest in certain aspects of the field is exploding. Tens of thousands of users, for example, signed up for a recent online course at Stanford University on natural language processing, one of the core technologies for text analytics.
Grimes said the challenge with big data is the inability to get rid of the garbage.
“Clay Shirky, who we could call an analyst and observer, said it’s not about information overload; it’s filter failure,” Grimes said, referring to one of the Internet’s most well-known commentators. “The challenge is to find the information you need and to filter out the noise.”
More on text analytics
But finding the relevant material is only the first part of the equation; next, organizations need to be able to combine that data for analysis.
“We’ve been talking about silos for years,” he said. “We’re finally at a point where we have the technological capabilities and the motives to break down the silos.”
Grimes pointed to HPCC Systems, an open source big data technology developed by LexisNexis Risk Solutions, as one way to do that. The platform pulls data from disparate systems, which may include structured data as well as unstructured data-like text, and ties it together for analysis.
HPCC, though, offers a limited scope of how that unstructured data can be analyzed. While the technology is advanced enough to extract names, places and organizations, other textual elements are not utilized.
APIs, platforms and cloud services
Grimes skipped over cloud services in favor of application programming interfaces, or APIs, and platforms, holding up Radian6 and QlikTech as examples. Radian6, he said, does not offer its own text analytics tools but plugs in technology from other businesses to provide the capability.
“End users can find the capabilities they need and ignore the ones they don’t,” he said. “And it’s attractive as a business model because they don’t have to spend time developing the capabilities themselves.”
QlikTech supports information extraction from external sources through the use of APIs. Grimes said this kind of a framework enables customers to use and pay for only what they need, which gives them more flexibility and agility.
Acquisitions and information access
In the last 12 months, a round of acquisitions, which included some of the bigger players, featured technologies with a text analytics bent: Oracle purchased Endeca, Hewlett-Packard acquired Autonomy, IBM gobbled up Vivisimo and Lexmark bought Isys.
Endeca and Autonomy are known for their capabilities to mash up data -- including unstructured data -- from disparate sources. Vivisimo and Isys are known for their rich search technologies.
“That acquisition by IBM, they’re a huge player, and you would think they would have the analytical capability to do this themselves,” Grimes said. “They didn’t.”
Social media magic
While larger vendors are jumping into the social media analytics arena, Grimes said so far, he’s not impressed.
SAP Social Media Analytics, for example, is really just NetBase technology the company has agreed to resell and support. Oracle recently announced it was purchasing Collective Intellect, a small social media and text mining analytics company. Before the acquisition, Oracle social media engagement capabilities were largely CRM-focused with no analytics capabilities.
“Large companies lack the agility to respond effectively and in a timely manner to the social media challenge,” he said. “There is lots of room for innovation from small players.”
Knowledge enrichment and integration
Knowledge enrichment and integration requires that data from different sources be transformed and mapped. That can be difficult because data quality issues are often talked about but rarely addressed, Grimes said.
One way to approach knowledge enrichment and integration is through semantics, which Grimes described as a technology to help join different data types and sources using meaningful identifiers.
Knowledge-based compilation is already happening. For example, Quora, a social media site, enables a back-and-forth exchange about certain topics and questions. Grimes referred to this as a manual knowledge-based system. Another example, he said, is IBM’s supercomputer Watson, which has a huge knowledge base behind it.
The same kind of technology supports the semantic Web. Search for Vincent Van Gogh in Google these days, Grimes said, and the returns show facts about the artist’s life, images of his work and who his contemporaries were.
The semantic Web uses a graph structure to capture ontological information that links back to the data Web, Grimes said.
“It’s a stack of protocols, standards and functions, and it’s still very incompletely realized,” Grimes said, “but it’s having a significant impact on the text analytics world.”
In fact, he said, text analytics is at the point where it’s going to need to move more and more toward standards, he said.