This article originally appeared on the BeyeNETWORK
One interesting feature of the Google search experience is how the Web site reacts to perceived misspellings of terms entered through the search process. When a character string appears to contain a mistake, the Web site not only returns the hits that exactly matched the provided search terms, but suggests an alternate spelling along with the question “Did you mean…?” For those of us who are less-skilled as typists, this feature/suggestion is a welcome one, as it provides some level of adjusted precision on the terms used for searching to help in attaining better search results.
The ability to not just determine that there is a likelihood of a misspelling in a search request as well as suggest a correctly spelled version implies that the underlying system can distinguish between valid and invalid spellings as well as, with some degree of statistical certainty, predict the word that the end client probably intended. At a more conceptual level, this also suggests that the underlying system can differentiate between character strings with meaning and those that are perceived to be meaningless, which may be the reason that certain kinds of strings (such as uncommon names, or certain kinds of product codes, for example) may trigger the suggested search alternative.
The introduction of the notion of “meaning” to character strings can be viewed at two different levels. The simpler level is one of assignment, and defines a Boolean value to each string – it is either meaningful or it isn’t. “Unmeaningful” strings that appear are probably errors, and that would trigger a suggestion to improve the search. The more complex level seeks to establish that the search term has meaning and to figure out what the meaning is, and going beyond the Boolean indication would provide a much greater ability to hone in on finding what the searcher was truly seeking. But there seems to be a fine line between those different levels.
Consider that the kinds of terms employed in Web searching are not limited to single character strings, but in fact can be classified into a complexity hierarchy based on context. In other words, the simplest kind of search involves a single character string as the search term. A more complex class incorporates a set of search terms, but in no particular order, intended to find documents that contain all of the terms, but not necessarily in order. The next class of terms includes ordered sets of strings, in which the searcher is looking for documents with all the terms in an exact order. We might consider one more class, in which there are sets of sets of ordered terms, in which the desired documents contain all of the ordered sets of terms.
To support these kinds of searches, the documents have to be indexed multiple ways. Clearly, every word that appears in a document has to be linked to that document, which will address the simplest classes of searches by providing a reverse index to the documents by each string. Provided with a list of character strings, the results can be collected by finding the documents that can be linked via each of the search strings. Ordered searches are more difficult – either each collection of ordered strings has to be used as part of the indexing algorithm, or more complex proximity analyses have to be incorporated into the document integration process. Either way, all of the search engines have some process for this, and are all relatively good at it.
With regard to assigning a meaning to a term, semantic analyses of unstructured text have been developed for entity extraction – the ability to identify a pattern or sequence of character strings that have already been organized into a known set of taxonomies. These techniques can identify names, addresses, locations, dates, titles – and many other kinds of classified terms. As part of a semantic analysis, a text mining tool can identify a named entity and insert a tag to be associated with that named entity, and as a function of context, entity proximity, and predefined knowledge hierarchies, be able to classify those entity terms into their most likely meanings – such as determining whether the string “bush” refers to a plant or to the U.S. president. So there are processes for capturing some level of meaning by context, and even if that meaning is not understood by a computer, there is still an enhanced ability to present meaningful results to a researcher.
However, the question of delineating and assigning meaning to terms makes me wonder whether the capabilities of Internet search engines can be enhanced through a couple of ideas. Consider this: for the most part, every search transaction initiated by a person sitting in front of a Web browser has some meaning, particularly to the person doing the searching. A statistical analysis of search strings should provide a prioritization of what words carry meaning, while applying the text mining capability to the search strings could suggest the assignment of meaning to the terms. The bigger the set of search terms, the more potential meaning it carries. And, realize that search phrases are not necessarily completely unstructured, but may often conform to a more rigorous semi-structure, which makes them easier to analyze.
Here is another train of thought: when the search engine thinks a string is misspelled, it recommends a similar set of terms as a potential correction. At that point, the searcher may take one of two actions – take the engine up on its suggestion and redo the search using the corrected string, or ignore the proposed correction. In both cases the search engine should be able to gain some knowledge. The first case establishes that the original search phrase was indeed incorrect, and can now add the misspelled version to the set of strings that could be potentially corrected in future searches. The second case implies that the proposed correction was not really a correction, and that the perceived misspelled term might actually have meaning!
The next step is to look for opportunities to use the indexing framework to find instances of the newly meaningful term, and reapply the text mining analysis approach to assess whether there is enough available context to project some semantic context to that term. Of course, the more meaning that can be assigned to search phrases, the more effective the engine can be in not just finding documents that contain the sought-after phrase, but in linking together documents that share similar concepts. In turn, this should enhance the engine’s ability to deliver semantically equivalent documents, whether they are indexed collections of words that can be reached via the Web, or are crafted “documents” (read: advertisements) specifically designed to be delivered to the reader. Is this where enterprise search is moving? There are some indications that this is one direction, and in a future article, we will explore how some other algorithmic techniques can be applied in content linkage to improve search capabilities.
David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of Master Data Management, Enterprise Knowledge Management – The Data Quality Approach and Business Intelligence – The Savvy Manager's Guide and is a frequent speaker on maximizing the value of information. David can be reached at firstname.lastname@example.org or at (301) 754-6350.
Editor's note: More David Loshin articles, resources, news and events are available in the David Loshin Expert Channel on the BeyeNETWORK. Be sure to visit today!