The following is an excerpt from Tapping into unstructured data: Integrating unstructured data and structural analytics into business intelligence, by Bill Inmon and Anthony Nesavich. It is reprinted here with permission from Prentice Hall; Copyright 2008. Read the excerpt below, or download a free .pdf of this chapter: "Integrating unstructured text into a structured environment."
The world of computing has grown from a small, unsophisticated world in the early 1960s to a world today of massive size and sophistication. Nearly every person worldwide—in one way or the other—is affected by or directly uses computation on a daily basis. Nothing less than national productivity from the 1960s to the present has been profoundly and positively affected by the widespread growth of the use of the computer.
The growth of computing can be measured in two ways: growth in structured systems and growth in unstructured systems.
Possibilities of unstructured systems
Structured systems are those for which the activity on the computer is predetermined and structured. Structured systems are designed by, built by, and operated by the IT department. ATM transactions, airline reservations, manufacturing inventory control systems, and point-of-sale systems are all forms of structured systems.
Structured systems are tied closely with the day-to-day operational activities of the corporation. Because of this affinity, structured systems grew quickly. Cost justification and return on investment for structured systems came easily because of the close tie-in with the day-to-day business of the corporation. The growth of the structured environment was fueled by the desire of the business world to be competitive and streamlined.
For more information or to purchase Tapping into unstructured data: Integrating unstructured data and structural analytics into business intelligence by Bill Inmon, visit the Pearson website.
Unstructured systems are those that have no predetermined form or structure and are full of textual data. Typical unstructured systems include emails, reports, contracts, transcripted telephone conversations, and other communications. When a person does an activity in an unstructured environment, he is free to do or say whatever he wants. The person doing the communication can structure the message in whatever form is desired, using any language. In an unstructured environment, the communication can range from a proposal of marriage to a notification of a layoff to the announcement of the birth of a baby, and everything in between. There simply are no rules for the content of unstructured systems.
The growth of the unstructured environment has been fostered by the needs for communications, informal analysis (such as that found on a spreadsheet), and personal analysis (of finances, personal goals, personal plans). There was (and is) a different set of motivations for the growth of the two environments. Figure 4.1 shows the different environments.
Figure 4-1 The two basic forms of data
From the beginning, the worlds of structured systems and unstructured systems have grown separately and apart and yet—at the same time—parallel with each other. It is no surprise that today each environment is separate from the other environment in many ways:
In truth, there is little overlap or connection between the two worlds.
Imagine what the world would look like if, indeed, there was overlap (or intersection) between the two environments. Imagine the possibilities if the two worlds could connect in an effective and meaningful way, the new types of systems that could be built, the new opportunities for the usage of computation, and the enhancements to existing systems in ways that are not possible using technology. When one accepts the limitations of today's technology and today's environment, there are only so many things that can be done. Imagine what would happen if those limitations suddenly disappeared.
If a bridge is to be built between the two environments, it makes sense to bring the unstructured text to the structured environment. In doing so, the decision support analyst can take advantage of the analytical processing capabilities that exist in the structured environment.
In most organizations an analytical infrastructure exists in the structured environment. This environment consists of things such as a database management system (DBMS), Business Intelligence (BI) software, hardware, and storage. Organizations have already invested millions in their analytical environment. The existing analytical infrastructure serves only structured systems, however. Data has to be put in a structure and a format that is particular and disciplined. Despite the particulars of the existing analytical infrastructure environment, it is less expensive to bring the unstructured data to the existing analytical infrastructure environment than it is to reconstruct the analytical infrastructure in the unstructured environment. By bringing unstructured data to the existing analytical infrastructure environment, the organization can leverage the training and the investment that has already been made in the existing analytical infrastructure environment.
When the gap between unstructured data and structured data is bridged, an entirely new world of possibilities and opportunity for information systems opens up. Figure 4.2 shows that a bridge between the structured and unstructured environments has many benefits.
The possibilities for new systems blossom when the gap between unstructured data and structured data is crossed. There are enormous and new opportunities that arise when the two types of data are merged.
Figure 4-2 Forming the bridge between the structured and unstructured world
Integrating unstructured textual data
In second generation textual analytics, the key to crossing the bridge between the two worlds is the integration of unstructured text before it is sent to the structured environment. Raw unstructured text cannot simply be placed into the structured world and still be meaningful and useful. Stated differently, unstructured text placed directly into a structured environment creates a mess. There is too much data—data that has different meanings and is recorded as a single name, alternate spellings, extraneous words, and documents that have no bearing on business. All these limitations of unstructured text become manifested when unstructured data is moved whole cloth into the structured environment.
To be effective, unstructured text must be integrated before it can be moved into the structured environment. By integrating unstructured text, the bridge between structured and unstructured data is created, and the stage is set for textual analytics.
Reading the unstructured textual data
The first step in the integration of unstructured text is the physical reading of the text. To be integrated, raw text must first be read or "ingested."
In some cases, the text first appears in a paper format. In this case, the text on the paper must be read -- scanned -- and the text converted to an electronic format. This process is typically done in optical character recognition (OCR). There are quite a few challenges to this process of lifting text from a paper foundation:
- Sometimes the paper is old and brittle and is destroyed by the process of trying to read it. In this case, the analyst must not count on reading the paper more than once.
- Sometimes the print font on the paper is not easily recognizable by the scanner. In this case, there are a lot of manual corrections.
- Sometimes the scan process reads and interprets the words incorrectly.
As a rule, the process of converting from paper to electronics is one that involves a manual scan and correction after the electronic scan is done, if for no other purpose than to make sure the electronic scan is successful. In many cases, manual corrections must be made when the scanning and conversion process has made an error or the electronic scan process has made assumptions about what is read that are not true.
However it is done, the text needs to be lifted from the paper media and converted into an electronic format.
Then there is the case of voice recordings. Like data found on paper, voice data likewise needs to be lifted from the media in which it was stored and reset into an electronic format that is intelligible to a program that reads and analyzes text. Voice recordings can be converted to an electronic format by means of voice character recognition (VCR). Text can be lifted from VCR as well. The issues of quality and reliability for VCR are similar to OCR considerations.
Choosing a file type
When the text is in an electronic format, the format and structure of the text needs to be taken into account. Some of the typical formats for the reading of electronic text follow:
- .txt compatible
Often the vendor supplies software to read these file types. However, often the vendor does not guarantee a 100% successful reading. For this reason, third-party vendors supply software and software interfaces that are more efficient and more reliable than those supplied by the vendor. It is true that you have to pay for third-party solutions.
Often the vendor supplies software to read these file types. However, often the vendor does not guarantee a 100% successful reading. For this reason, third-party vendors supply software and software interfaces that are more efficient and more reliable than those supplied by the vendor. It is true that you have to pay for third-party solutions. However, the third-party solutions are more reliable and more efficient than the vendorsupplied solutions. Also, the third-party vendor has the responsibility of keeping up with the different releases of the base software as new releases are made.
|ALTERING THE ORIGINAL SOURCE|
One of the issues faced by the systems programmer is whether to allow the original source text to be altered. In some cases, the software reading a source file wants to add data to or otherwise alter the source text. In other cases, the source text is never altered. It is read, but not altered. By far, the safest policy is never to alter the source text, even at the expense of having redundant copies of data lying around.
Reading unstructured data from voice recordings
In some cases, where the text does not reside on paper, the text resides on tapes. Typical of this usage of tapes are telephone conversations that are taped and then transcribed. In this case, the tapes must be converted into an electronic format, much like scanning data, except the scan is not text. Typical software in this case includes VCR. VCR technology has many liabilities associated with it. VCR is subject to being fooled by accents, by people talking too softly, and other issues. As a rule, if a transcription can be done with 95% accuracy, that is considered to be good.
It is an interesting point that humans do not hear and understand 100% of the words that are spoken. Our brains "fill in the blanks" frequently. So it is not unreasonable that VCR does not do a 100% job of accurate transcription.
However it is accomplished, the original source text must be read and entered into the component that will begin the process of textual integration.
After the source text has been read, the next step is to actually integrate the text.
The purpose of textual integration is to prepare the data for textual analytics. It is true that raw text can be subjected to textual analytics. However, the reading, integration, and preconditioning of the raw source text sets the stage for effective textual analytics. Stated differently, textual analytics can be done on raw textual data, but not effectively. The data itself defeats much of the purpose of textual analytics. To be effective, textual analytics must operate on textual data that has been integrated and preconditioned.
The importance of integration
It is not always obvious why raw text needs to be integrated and preconditioned before it is useful and most effective for textual analytics. The following cases make the point of why integration of text is a necessary precursor to effective textual analytics.
A simple search is to be conducted on the name "Osama Bin Laden." Operating on unintegrated data, the search fails to find references when the name "Usama Bin Laden" appears or the name "Osama Ben Laden" appears. If textual integration had been done properly, the search for "Osama Bin Laden" would have turned up all occurrences of all spellings of his name.
Indirect search of alternate terms
Suppose an analyst wants to find all places where there is a mention of a broken bone. If the analyst searches for "broken bone," the analyst finds all the places where there are permutations of the term. However, if data is integrated first, an indirect search for "broken bone" turns up the many terms that also mean "broken bone." Operating on integrated data, an indirect search on broken bone finds "fractured radius," "lacerated tibia," "oblique fractured ulna," and so forth.
Indirect search of related terms
In addition to looking for alternate terms, related terms can also be accessed by the textual analyst. Consider the term "Sarbanes Oxley." If a direct simple search is made on the term "Sarbanes Oxley," the search will turn up the many places where that term is found. Consider what happens when raw textual data is integrated before the search is done. An indirect search can discover the many terms that are related to Sarbanes Oxley. For example, when the raw text is integrated and a search is done on related terms, an indirect search on "Sarbanes Oxley" finds items such as the following:
- Contingency sale
- Revenue recognition
- Promise to deliver
Permutations of words
Another interesting aspect of integrating text is the recognition of the roots of words. When raw unintegrated text is searched for the phrase "moving the needle," if that phrase is used anywhere, the search finds it. When raw text is integrated, permutations of the base word are recognized as well. For example, when a search is made for "moving the needle" on integrated text where the stems of words have been recognized, the results find the following:
- Moves the needle
- Moved the needle
- Move the needle
From these simple examples of analysis of text against raw textual data and integrated textual data, it becomes obvious that if you are going to do effective textual analytics, the data that will be operated on must first be integrated.
The issues of textual integration
The kinds of issues that must be addressed in the integration of unstructured text into the structured environment include the following:
Determining if the unstructured document has any relevance to the business -- If the unstructured document is not relevant to the business conducted, the unstructured document does not belong in the structured environment as a candidate for textual analytics. Figure 4-3 shows that raw unstructured data is fed to the integration component. The integration component then screens the data based on business relevance. For example, an email that said "I love you, darling" would not be deemed to have business relevance and would not be placed in the textual analytical database.
Removing stop words from the unstructured environment which are extraneous to the meaning of the text -- Typical stop words are "a," "and," "the," "is," "was," and "which." Stop words are used to lubricate language, but add little or nothing to the subjects that are discussed. Figure 4-4 shows that stop words need to be filtered out so that they don't get in the way of arriving at the heart of the matter when it comes to analyzing unstructured text.
Figure 4-3 Relevant business data needs to be screened from irrelevant business data.
Figure 4-4 Stop words are removed.
Reducing words to their Greek or Latin stems -- By reducing the words found in unstructured text to a common stem, the commonality of words can be recognized when the words are literally not the same. Figure 4-5 shows that several ordinary words have a common Latin stem. The figure shows that the words "Moving," "Move," "Moved," and "Mover," all have a common stem—"Mov."
Figure 4-5 Words are reduced to a common stem.
Resolving synonyms -- Where there are synonyms, the reduction of the synonym to a common foundation allows for the possibility of a common vocabulary. It is only through the establishment of a common vocabulary that meaningful searches can be done. There are two basic ways for synonyms to be resolved. One of those ways is through synonym replacement. With synonym replacement, when a synonym is recognized, it is replaced by the more common (or more general) form of the word, as shown in Figure 4-6.
Figure 4-6 Synonym replacement is another activity that can be done to precondition unstructured data.
The other way for synonyms to be resolved is through synonym concatenation. In synonym concatenation, synonyms are not deleted. Instead, synonyms are concatenated with their original word, as shown in Figure 4-7. By using synonym concatenation, either the specific word or the synonym can be accessed and analyzed.
Figure 4-7 Synonym concatenation is another option for preconditioning unstructured data as part of the integration process.
The problem with synonym concatenation is that the texture of the original English sentence is destroyed. Usually, this doesn't matter. However, if there is a need to preserve the original texture of the English sentence, synonym concatenation is not useful.
As a rule, synonym concatenation is the best choice for managing synonyms.
Resolving homographs -- In the case of words that have multiple meanings, the correct and unique term replaces the nonunique common term. This is the second ingredient needed for the establishment of a common vocabulary in raw text. In many regards, homographic resolution is the reverse of synonym resolution. In Figure 4-8, the term "ha" is replaced with different medical terms based on the person who originally wrote the term.
Figure 4-8 Homographs are expanded to a more precise meaning.
The capability to handle both words and phrases -- It is not sufficient to support textual analytic processing by using just words. Phrases need to be supported as well, as shown in Figure 4-9.
Figure 4-9 Both words and phrases need to be handled.
Allowing for multiple spellings of the same name or word -- Some names and words can be spelled in many different ways. Common misspellings need to be included as well. In Figure 4-10, some of the many common misspellings of "Osama Bin Laden" are incorporated into the integrated text. In doing so, the textual analyst is sure to find the references even if they are not spelled correctly or spelled as the person initiating the search thinks they should be spelled.
Figure 4-10 Alternate spellings should be handled as well.
Negativity exclusion -- In the case of negativity exclusion, where there is a negative, the words that follow the negative expression are removed from any indexing or other reference. In Figure 4-11, cancer is not included in the indexing process because it is preceded by a "not."
Punctuation and case-sensitivity -- Punctuation and case-sensitivity need to be removed as a consideration for searching. In Figure 4-12, the term "asher lev" can be found even though the term is written "Asher Lev" in the unstructured text. Punctuation and case are eliminated as a basis for finding a match between search argument and the text operated on.
Figure 4-11 Negativity exclusion is another aspect of textual integration.
Figure 4-12 Punctuation and case-sensitivity should not be a factor in doing textual analytics.
Document consolidation -- On occasion, document consolidation is a useful aspect of textual integration. When textual consolidation is done, documents that hold like information are logically consolidated into a single document, as shown in Figure 4-13. The grouping of like documents can have the effect of enhancing the manageability of the process of textual integration.
Figure 4-13 Document consolidation is sometimes a good thing to do as part of the textual integration process.
Themes of data -- Another important aspect of textual integration is that of determining basic themes of data. The themes can be discovered in a document or in the text that has been gleaned from multiple documents. In Figure 4-14, data is clustered around water and steel.
Figure 4-14 Creating themes for documents and groups of documents is another aspect of textual integration.
These basic activities of integrating unstructured text are the minimum subset of processes that need to occur to provide a sound foundation in the preparation of text for textual analytics. Many other related processes can be applied to unstructured text as it is prepared for movement to the structured environment.
More about integrating unstructured data and text
Continue reading this chapter -- for information about simple integration applications -- by downloading a free .pdf of "Integrating unstructured text into a structured environment."
Read chapter one from this book: "Managing unstructured data in the organization."
Read other excerpts from data management books in the Chapter Download Library.