This article originally appeared on the BeyeNETWORK.
My previous article in this series, Using Enterprise 2.0 for Business
Using Business Content in Business Intelligence
Unstructured business content varies considerably in both type and format. Broadly speaking, however, this content falls into one of two main categories – rich media (images, audio and video) and textual information (web pages, documents, electronic forms and reports, for example). From a BI perspective, most of the emphasis at present is on analyzing textual information; rich media is used primarily for reference purposes.
In Part 3 of this series of articles we saw how search can be used to explore and analyze information (left-hand side of Figure 1). Although search approaches can analyze both unstructured content and structured data, the main focus in this type of processing is on unstructured business content.
Figure 1: Processing Unstructured Business Content
The reverse is true when it comes to capturing, transforming and loading information into a target data store such as a data warehouse. Here the focus is on integrating structured data (right-hand side of Figure 1). In a data warehousing system, the unstructured content is used primarily to extend and supplement the structured information in the warehouse. This unstructured content, however, can often add considerable business value to the business intelligence environment. Web logs, email and support center reports, for example, enable companies to get valuable insight into customers’ attitudes toward product value and quality. Competitors’ website data can be used in the travel and retail industries to build competitive pricing models.
Integrating Business Content
Data integration products provide a wide variety of techniques for processing and integrating structured data. The three main ones are data federation, data consolidation and data propagation. In all three cases, the data must be captured, transformed as required and then delivered to a target application or data store.
The three techniques for integrating structured data can also be used for handling unstructured content. One extra step is required during the transformation process to extract and convert the required business information into a semi-structured (typically XML) or structured format. The transformed results can then be delivered to an application or data store using the same approaches as those used for structured data.
As with structured data, adapters are required to capture the unstructured data of interest. Adapters used in search and text analysis approaches can often be used to do this. For web content, screen scraping and clipping techniques offer other options.
The main challenge during the transformation of unstructured content into a semi-structured or structured format is to create the metadata that enables the information to be defined in business terms and to be related to existing structured data. Search and text analysis techniques, such as content annotators, can assist in this process. IBM, for example, developed and donated to the open source community an Unstructured Information Management Architecture (UIMA) for doing this (see http://incubator.apache.org/uima/). To quote IBM:
“UIMA is an open, industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components.”
Several vendors have added UIMA annotators to their products for extracting metadata from unstructured content. Vendors such as Business Objects and Informatica are also developing or acquiring technologies to annotate and integrate unstructured content into the business intelligence and data warehousing environments.
To date, little use of unstructured content is being made in corporate business intelligence systems. This is not only because technologies to achieve this are just coming to market, but also because there is a general lack of awareness and education in using this type of information for BI processing. Given that over 80% of information is in an unstructured form, the companies that begin to use and exploit this information will gain a considerable competitive advantage over those that are slow to recognize the value of unstructured business content for optimizing business processes.