A picture is worth a thousand, ten thousand -- or is it ten million words?
March 29, 1995
RICHLAND, Wash. –
Scientists at the U.S. Department of Energy's Pacific Northwest Laboratory have pushed forward technological frontiers by developing new software tools that graphically display images based on word similarities and themes in text.
The visualization tools, Galaxies and ThemescapeTM, were developed for the U.S. intelligence community, but could have broader use.
"The tools were designed to help analysts within the intelligence community better evaluate the myriad of information they review," said PNL visualization expert Jim Thomas. "However, the technology also can be used by others who analyze large amounts of information including lawyers reading previous cases, doctors reading patients' histories or regulators reading environmental procedures."
"The trick -- and the tedium -- for the information analyst is to find the 'needle' of critical information in the 'haystack' of millions of words," explained PNL cognitive scientist Jim Wise. "Within minutes, PNL's visualization tools focus the analyst on the few pertinent documents or terms needed rather than the hundreds or thousands accessible."
By applying PNL's visualization tools to large sets of documents, information analysts can see quickly a picture of the content similarities and themes in the documents without having to read unnecessary text.
"PNL's tools don't replace reading," Wise said. "They just assist the user in getting to the right items to read."
HOW IT WORKS
The mouse-driven visualization function in Galaxies computes the word similarities and patterns in documents and then displays the documents on a computer screen to look like a universe of "docustars". Closely related documents will cluster together in a tight group while unrelated documents will be separated by large spaces.
In Themescape, themes in the documents are layered and appear on the computer screen as a relief map of natural terrain. The mountains in Themescape indicate where themes are concentrated in the underlying documents; and their shapes -- a broad butte or a high pinnacle -- reflect how the thematic information is distributed and related across documents.
Once the visualization tools have displayed the content similarities and themes in the documents, researchers can refine their search by using several built-in support functions including a document characterization or gisting tool, a word search tool, a time analyzer and an annotation tool.
For example, suppose a person (medical researcher, policy maker or involved citizen) was interested in finding out what direction the United States was heading in breast cancer research. First, the person would need access to as many breast cancer-related documents as possible, which could be in the form of news articles, memos, policy documents and research papers. Much of this information is available online and can be downloaded over the Internet to a computer or purchased on CD-ROM.
Drawing from this large, unstructured document base, the person could use PNL's visualization tools to automatically organize the documents into clusters according to their content similarities and into thematic terrains according to the themes in the text.
By clicking on a cluster or star in Galaxies, the words that occur most in the document(s) appear on the screen. If a person wanted to explore further, selecting a "docustar" retrieves the document's header or its complete text. The analyst also could enter queries on terms of particular interest and the "docustars" containing those terms would light up like "novas" on the screen. By clicking on a spot in Themescape's terrain, an analyst can reveal all the terms that make up that part of the terrain. At this point, the analyst has access immediately to all of the documents containing those combinations of terms that are relevant to the search. During all of these processes, the person can use the annotation function to capture and store thoughts and ideas for future reference.
LOOKING FOR TRENDS
After finding and annotating the relevant groups of stars representing the different documents, the person could use the timeslicer to make the "docustars" in Galaxies appear as a function of time, which can be expressed over a range of years down to the minute depending on how the document was timestamped when it was produced. By using the timeslicer, the person can gain an understanding of what trends in document patterns have developed over time, which may be important in determining future trends or medical options. By applying PNL's tools, the person can analyze quickly thousands of documents without having to read them, develop an understanding of patterns and themes over time and produce a picture of the trends. From this information, the person may be able to reach an informed opinion as to the future direction of breast cancer research in the United States.
PNL's two visualization tools reside in a larger software package called SPIRETM, or Spatial Paradigm for Information Retrieval and Exploration, and are a breakthrough in the way people interact with large amounts of text- based information. With tools like Galaxies and Themescape, people can use their innate visual "pattern matching" skills and the power of a computer to take themselves from a swamp of documents into a perceived "information space" that informs and enlightens in the most intuitive ways.
Tags: Fundamental Science, Cancer Research