News Stay informed about the latest enterprise technology news and product updates.

Mining Text in a Retail Enterprise

Much has been written and marketed about the compelling need to analyze unstructured data. Learn what retailers are REALLY doing with text mining to drive business benefits.

This article originally appeared on the BeyeNETWORK.

With all the marketing hype about unstructured data reaching fever pitch, including the me-too Google integrations by business intelligence (BI) vendors and attempts to converge search and BI interfaces, there has been a lack of compelling applications for analyzing unstructured data. In many ways, text mining as a concept has reached a point that data mining reached in the ‘90s – namely, a concept that sounds really neat, new and difficult that no one really knows how to apply. In practice, a few forward-thinking retailers have carved out some early successes in specific areas. This article will discuss some of those successes and how unstructured data is providing real insight to retailers.

Generally speaking, business intelligence in retail is a quantitative pursuit, based on sales and inventory levels and many other key performance indicators (e.g., in stock percent, inventory turns, etc.) that describe the performance of a retail business. Most business users understand the kinds of products and analytic approaches most commonly used in business intelligence, but teaching them an entirely new discipline is missionary work indeed. The trick in gaining insight from unstructured data is mapping it to known, quantitative methods and tools for analyzing the performance of a business, not offering a search box inside the BI interface.

Relevant Unstructured Data in Retail

Most marketing pitches concerning unstructured data begin with the obligatory, breathless commentary that “80% of all corporate data is unstructured,” and the reader is left to presumably conclude that they must have a problem if they don’t realize that. In practice, it is probably more reasonable to state that there is a fair amount of uncaptured value in (mostly) textual data that resides in some key systems of retailers, and there are now techniques to capture much of that value. In some cases, the amount of data can be large; but suffice it to say that it is sufficiently large so that it cannot be analyzed manually nor even with everyone’s favorite BI tool (Excel).

For the average retailer, the value to be gained is hidden insight buried in the conversations between a company and its customers that are documented. There are actually several forms of this, loosely broken into the following three areas:

Customer Satisfaction Surveys – Generally speaking, this data constitutes the results when a retailer approaches its customers to ask for specific feedback. Almost all retailers have some sort of customer survey process that can involve both structured questions and open-ended questions with written or spoken responses. The verbatims of those text-based responses are often loaded with valuable details because the customer basically uses them to talk about what they really think, structured questions aside. Often stored in survey systems, these verbatims are too voluminous to manually read, categorize and summarize, so typically an organization will rely on a sampling approach (if anything at all) to analyze them.

Contact Center Transcripts – Unlike a survey, which constitutes a retailer looking to have a particular conversation with a specific set of customers, the contact center provides a way to capture information that customers cared enough about to initiate a conversation to discuss. Certainly, there are contact center systems that customer service representatives use, and those systems capture e-mail/web submissions, notes from phone conversations and chat transcripts, but virtually nothing is done with the text-based data that these systems capture. Contact center analysis most often focuses on the “wrong” components of customer satisfaction such as call wait times, hang-ups, resolution times and other traditional call center metrics.

Internet Sources – Left to their own devices, customers, media and the general public often say what they think in many ways that can be “mined” – namely through blogs, online product review sites, news articles and news groups. This data lives on the Internet and is accessible to all, yet captured and analyzed by few. The most interesting and challenging aspect of this data is that it is grassroots and organic, and did not come into being because of sending out a survey or putting up a contact center. It can be sparse or dense and is truly without structure of nearly any sort, but contains both useful and useless unvarnished opinion.


So, then, what can be done to realize some value from these sources? Basically, the first step is to make structured data out of unstructured data. Just as data from source systems such as POS, planning, financial, inventory and other ERP-related systems is extracted and “transformed” into consumable data for BI applications, so too should unstructured data be similarly polished to become usable.
To do so, the data must be harvested from its source, be it Excel spreadsheets, PDFs, customer service databases or the Internet. If possible, relevant identifying structured data such as product names, dates, and store locations should be gathered at the same time to provide relevance to the unstructured data and make it easier to tie back to the other information stored in the warehouse. Natural language processing can then convert the unstructured text into more formal representations that are easier for computer programs to manipulate, identifying the nouns, verbs, adjectives, special taxonomies (such as technical terms or product characteristics) and typical misspellings of words that may exist in the data.

The next step involves entity and fact extraction – pulling customer or product names, dates or other identifying information out of the unstructured text in order to find cause and effect within the comments. This can be done by proactively applying a set of rules selected by the business users or development team to determine the correct words to extract. Utilizing a sentiment engine to identify indicators of positive or negative feelings within the unstructured text can help to ensure actual causation rather than misinterpretation of data. A customer stating that they are not happy with a product is trying to convey an entirely different message than someone who states they are satisfied. Yet often times the “not” of “not happy” is missed during manual analysis, skewing the results of the investigation to be more positive than they actually should be.

Finally, the text should be categorized or sorted in buckets of common types, such as product issues, repair issues or customer service issues to make it easier to find new insights. Clustering common responses within each of these categories will allow business analysts to determine where to focus their efforts to improve their business processes, product or service offerings. Does a product get excellent reviews except for a short battery life that causes frustration and returns? That is a concrete problem for the retailer to encourage the manufacturer to solve in order to decrease returns. Without the ability to segment responses to that level, retailers might consider just dropping that product from their assortment altogether, without considering possible ways to remedy the issue and possibly achieve increased sales from the improved version. Is a particular customer segment vocal about comparison shopping and sharing information about the deals they can get from the retailers’ competitors? That information can drive new marketing or promotional efforts targeted to that customer group that could ultimately increase loyalty and sales across all customers as they spread the word about their improved satisfaction.

Resulting Business Value and Sources of Insight

Once the data has been transformed from unstructured to structured, it is possible to integrate the mined text with a business intelligence tool, allowing analysts to build reports and dashboards that will provide them the information they need when they need it so that they can quickly react to customer sentiment wherever it may be found. They can also create alerts so that new quality issues or other high-priority topics are identified and addressed proactively. For instance, an alert report could be set up so that on the tenth instance of a certain complaint, the topic is marked as high priority and corrective action begins immediately.

The data that is uncovered during this process can be loaded into a stand-alone data mart or incorporated into a larger enterprise data warehouse. As a stand-alone data mart, insights can be gathered that are specific to the source of the information. For example, statistical analysis can be performed that identifies how often customers complain about a particular product breaking or the number of calls to customer service required to satisfy the customer. Common answers to open-ended survey questions can be identified that previously went unnoticed by manual analysis and sampling of result sets.

Integrated into the enterprise warehouse, however, more detail can be uncovered as disparate pieces of information are united into a single set of reports. By mapping the associated structured data to elements that already exist in the enterprise warehouse (e.g., matching cited products to SKUs sold in the transaction log or tying a store/city combination back to an actual selling location), new insights can be revealed such as relating the number of products sold by particular location to the number of and reason for customer service calls to the number of returns. This type of detailed analysis is only possible when all of this data is united in a single environment. Analysts then can drill deeper into the information to uncover hidden relationships within the data and bring it to the attention of the correct department within their organization.

Is It Worth the Effort?

Ultimately, unstructured data is only valuable to a retailer if they actually make use of the information they collect. Historically, retailers have collected the data, but not done much with it. Text mining in a business intelligence environment gives them the opportunity to capture the value intrinsic to that data. By transforming the data into a format that can be easily accessed and analyzed, retailers can obtain benefits across their entire organization as the new knowledge drives them to improve business processes and customer service offerings, tailor their product assortments to what their customers really want to purchase, decrease out-of-stocks during promotions and marketing campaigns, and reduce returns based on poor product quality.

  • Sara Charen 
    Sara is the Manager of Industry Solutions at Claraview, a division of Teradata, a strategy and technology consultancy that helps leading companies and government agencies use business intelligence to achieve competitive advantage and operational excellence. She is responsible for the development of implementation accelerators for retail analytics, including an enterprise Retail Data Warehouse, as well as focused offerings for Sales and Inventory, Store at a Glance, Assortment Planning and Market Basket Analysis. She may be contacted at [email protected].
  • Dan RossDan Ross 
    Dan is the Managing Partner of the Retail Practice at Claraview, a strategy and technology consultancy that helps leading companies and government agencies use business intelligence to achieve competitive advantage and operational excellence. Claraview clients realize measurable results: faster time to decision, improved information quality and greater strategic insight. Dan is a frequent contributor to business intelligence literature, writing on topics spanning technical approaches and business impact, and the Claraview Retail Practice serves some of the world's most advanced users of retail data warehouses.


Dig Deeper on Business intelligence data mining