Taxonomies and search: Master categories?

Does "search" change everything? Well, maybe some business intelligence.

This article originally appeared on the BeyeNETWORK.

It is always interesting when the results of two different conversations lead to a similar conclusion. Over the past few weeks, we have participated in customer-facing interactions essentially exploring the notion of meta-tagging unstructured documents for the purpose of searching. One instance was relatively straightforward: a company is assembling a master repository of its product data, and needs to create a hierarchy into which all the products are assigned. In this environment, there are three types of consumers of the master product data – internal applications, external individual customers, and external corporate customers.

The goal is to enable services supporting parametric searches, either by individuals via a website, or through services specifically designed for responding to provided requests for quotes. With this in mind, there is a desire to evaluate product descriptions and characteristics, seek out semantic similarities, verify that the user-defined hierarchies and taxonomy are complete, and perhaps even automate the assignment of a product into the hierarchies. One of the expected benefits, though, is that as terms and concepts are reviewed, synonyms can be documented within the taxonomy, allowing for more generous yet refined results for the parametric searches. This is supported by expanding any search to include matches with synonyms. For example, a search for the term “automobile” may be expanded to include “coupe” and “sedan,” potentially finding hits otherwise not found using the original search terms.

The other instance is a bit more abstract, and perhaps even more diverged from master data management and business intelligence. This customer is interested in extracting knowledge from accumulated data assets ranging from highly structured data sets to largely unstructured documents to create a report-generation capability reflecting accumulated intelligence around specific types of financial support. Here, the taxonomies are much more dynamic, evolving almost in lockstep with the news. The objective here is slightly different as well. The taxonomy is needed to support a knowledge-gathering process from across the World Wide Web, and while the resulting report that can be created as a result of a specific query is of value, the information product to be created is a collection of queries to be used in gathering that knowledge.

Looking at the exposed relationships between entities and concepts identified within unstructured documents can narrow the focus when searching. For example, were you to google “David Loshin,” you’d find some number of page hits returned. Some of those will refer to articles I’ve written or web seminars that I have participated in, but there will also be a number of hits for the Seattle dentist Dr. David Loshin as well as Dr. David Loshin, dean of the NSU College of Optometry in southern Florida. But by assessing the degree of correspondence between the search term “David Loshin” and other terms that are identified from among the many hits, there may be a distinct correlation with the search phrases “data quality,” “master data management,” and “data governance.” Combining those search phrases will filter out the dentist and the dean.

On the other hand, enough correlation with the phrase “dentist” will provide another search query, and “optics” yet another. Successive refinement will eventually enable a disambiguation between these individuals sharing the same name, perhaps without ever looking at the query results at all. And to support the intelligence process, there is a difference between the query set, which is a dynamic representation of the methods by which information is gathered, and examining the results of those queries, which is a static collection created at a specific point in time from a specific universe of inputs.

So what is this query set? Each query represents the relationship between two or more concepts, whether they are individuals, businesses, locations, concepts – it almost doesn’t matter. These concepts are manifested as the meta-tags, and their correspondence may ultimately resemble a set of hierarchies or a taxonomy.

Even though these two instances have widely different intentions, recall that I started by saying that these two conversations led to a similar conclusion. But the conclusion has less to do with the specific taxonomy process and more with the expectations of the results when considered in the context of business intelligence and its intersection with search. The concept of the report is changing from a structured presentation from a structured query to a structured data set into a more fluid collection of “results” of the intersection of semantic concepts. If this is true, then as the dependence on the dimensional structure of a centralized data warehouse erodes, the reliance on the dimensions of the taxonomies grows. Consequently, the “master data” of this emerging paradigm must incorporate those business concepts organized in the right types of hierarchies to support the business objectives.

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of Master Data Management, Enterprise Knowledge ManagementThe Data Quality Approach and Business IntelligenceThe Savvy Manager's Guide and is a frequent speaker on maximizing the value of information. David can be reached at or at (301) 754-6350.

Editor's note: More David Loshin articles, resources, news and events are available in the David Loshin Expert Channel on the BeyeNETWORK. Be sure to visit today!

Dig deeper on Business intelligence strategy



Enjoy the benefits of Pro+ membership, learn more and join.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: