Search engines can help find relevant documents, but a new breed of technology goes beyond simple document retrieval. These text-mining tools make it possible to discover new knowledge in the form of trends, anomalies, relationships, and patterns that span multiple documents and large document collections. By extending the way text databases can be explored, text mining can add valuable content analysis and decision support tools to existing intranets.
From Documents to "Nuggets"
Users of modern intranets have many tools at their disposal for finding relevant information. Keyword search, topic browsing, and other techniques can help users quickly pinpoint the most relevant documents. But sometimes just returning a set of relevant documents is not enough. For users interested in discovering new knowledge and insights buried within text databases, conventional document-based retrieval methods are not enough.
Consider a human resources manager with a database containing hundreds of employee survey responses in free-form text. She can use standard document retrieval to find responses from particular employees or even to find responses that mention a particular concept, but document retrieval will not help her understand what the common concerns are among the respondents. What she needs are tools that enhance and complement existing search functionality; tools that can help her discover useful facts about the employee base—facts that would otherwise remain hidden in the document collection.
Finding useful facts or "nuggets" of knowledge in databases of text is the essence of text mining, an analysis process that attempts to uncover hidden patterns in unstructured text data. Text mining is currently being used in knowledge discovery and business intelligence applications ranging from human resource management to market intelligence to research and development. Text mining techniques are also being used to extend conventional information retrieval systems with features that create a more interactive and contextually aware search experience.
The Foundations of Text Mining
Text mining owes its heritage mostly to the data-mining community, but it also shares a technical foundation with information retrieval. In order to understand how text mining works and its potential applications, it's useful to compare it with these two more-established technologies.
| Region | Product | Quarter | Revenue |
| Northeast | CT460 | Q1 | 250 |
| Southeast | CT340 | Q1 | 300 |
| Southwest | CT340 | Q2 | 200 |
| Southeast | CT460 | Q2 | 350 |
| Northeast | CT340 | Q1 | 400 |
| Southeast | CT460 | Q1 | 250 |
| Northeast | CT460 | Q2 | 200 |
| Southwest | CT340 | Q1 | 300 |
Table 1. Example sales database for data mining.Text Mining vs. Data Mining
Traditional data mining relies on a well-defined structure to define how a set of data can be analyzed. For example, consider the simple sales database in Table 1, where revenue data is captured and organized by region, product, and sales quarter.
The non-numeric attributes of region, product, and quarter serve as the dimensions across which the revenue data can be summarized and plotted in a data-mining application. Because there are multiple dimensions to the raw revenue data, it is possible to summarize and view the data across any one or a combination of these dimensions in order to uncover interesting facts. For example, we can create a graph to analyze sales by region, within quarters, and across all products.
The advantage of being able to generate a graph like this is that we can learn at a glance that Q1 was a good quarter for the Northeast. It is then possible to "drill down" into the data to find out what made this quarter so successful.
Text mining tries to apply these same techniques to unstructured text databases. To do so, it relies heavily on technology from the sciences of Computational Linguistics (CL), Natural Language Processing (NLP), and Machine Learning to automatically collect statistics and infer structure and meaning in otherwise unstructured text. The usual approach involves identifying and extracting key features from the text that can be used as the data and dimensions for analysis. This process, called feature extraction, is a crucial step in text mining.
Table 2 shows an example of feature extraction on a help desk e-mail. A scan of the text reveals which product the e-mail is about, which competitors it mentions, and even the emotions of the sender.
| E-mail Text | Extracted Features (Type: Value) |
| "I have been a user of your CT360 product for four years and was very disappointed to learn that you are discontinuing support for automatic feature extraction. I know that I can purchase the same capability fromSnorkelSoft Systems, but I would prefer not to change products at this time. Please reconsider removing feature extraction from your product. Thank you." | Product: "CT360" Competitor: "SnorkelSoft Systems" Caller Sentiment: "very disappointed" |
Table 2. Example feature extraction on a help desk e-mail.Product, Competitor, and Caller Sentiment are the feature types of interest in this example. The feature types used in a text-mining application depend on the problem domain but usually include types of nouns and noun phrases like person names, product names, geographic regions, date stamps, or company names. Feature types can also be less concrete concepts like emotion or "sentiment," as in the example above.
Natural language processing, and sometimes machine learning techniques, are used to identify parts of speech and extract features. Simpler techniques involving look-up lists (e.g., list of product names) can also be used with good results if the meaning of the features in the text is unambiguous (e.g., CT360 is always a product name).
A simple aggregation of the example feature information from above creates a structured table that can be analyzed using traditional data-mining techniques (Table 3).
| Product | Caller Sentiment | Number of Calls |
| CT340 | Satisfied | 43 |
| CT340 | Dissatisfied | 23 |
| CT460 | Satisfied | 54 |
| CT460 | Dissatisfied | 12 |
| CT520 | Satisfied | 10 |
| CT520 | Dissatisfied | 37 |
Table 3. Hypothetical data and dimensions extracted from e-mail messages.
This is a very simple example of text mining, but even in this simple example it is easy to see the value of a graph revealing customer satisfaction by product.
Text Mining and Information Retrieval
Information retrieval (IR for short) is the technical discipline that includes document search, navigation, categorization, and filtering. Text-mining techniques have been used in information retrieval systems as a tool to help users narrow their queries and to help them explore other contextually related subjects. Dynamic clustering is one of the most common text-mining techniques that have crossed over into information retrieval. Clustering automatically organizes similar documents into thematically related groups. The software company Vivisimo [http://www.vivisimo.com] provides an excellent example of this feature.
Clustering systems group documents according to common features extracted from the documents. Usually these features are general nouns or noun phrases. In applications where the domain and/or content is well understood, more specialized clustering that uses special feature types (person names, company names, etc.) may be more useful.
Finally, for documents that have time-date stamps on them, it is possible to enhance standard document retrieval by performing automatic trend analysis on search results. Figure 3 shows an example screen shot of a combined search and trend analysis application on medical content. Clicking on the bars in the bar chart reissues the searcher's query with a date range qualifier. In addition to allowing the searcher to drill-down on a particular date range of interest, trend charts like this can also reveal important patterns that may be of interest beyond what might be present in the individual documents.
Text Mining in Practice
There is growing interest in text mining as a tool to tap "the other 90%" of data that exists in unstructured text data formats. On the software vendor front, the few small software companies that specialize in text analysis and mining have remained relatively prosperous despite the recent economic downturn (see ClearForest [http://www.clearforest.com/], Inxight [http://www.inxight.com/], Temis [http://www.temis-group.com/], and Megaputer [http://www.megaputer.com]. Other more established companies are also getting into the act. SAS [http://www.sas.com], a leading provider of business intelligence software, last year announced a new text-mining component for its Enterprise Miner product. Another major business intelligence software vendor, SPSS [http://www.spss.com], made a similar announcement after acquiring the text analysis software company Lexiquest.
Potential applications for text mining abound in virtually every market sector—any place where there are large untapped databases of textual information. Hewlett Packard is using SAS's text-mining product to automatically cluster free-form text notes created by the company's telesales representatives. These clusters will be used to improve decision making about what products might be appropriate for a particular customer. CONOCO is using software and services from Temis to analyze corporate e-mail and internal surveys in order to get a periodic snapshot of the company's "self-image."
On a more sober note, government intelligence agencies have long been interested in text mining and associated technologies. Growth of the Internet and the events of September 11, 2001, have increased interest in text mining as a tool to track potential terrorist threats by analyzing e-mail messages, online chat rooms, and other sources.
Conclusion
Text-mining technology is entering the mainstream for both stand-alone business intelligence applications and as a tool for enhancing document retrieval applications. Advances in computer hardware speed and improvements in the effectiveness of text analysis software are making text-mining applications more realistic for commercial use than ever before. In the future, adoption of text mining ideas will result in applications that blend text-mining, data mining, and full-text information retrieval to support integrated information access and knowledge discovery across combinations of structured and unstructured sources.