New retrieval techniques improve access to agency information
Nowhere is the challenge of data retrieval felt more keenly than in the government, where information is the primary product of many federal agencies.
"There has never been a market more tailor-made for a technology, because of the amount of data the government has," said David Appelbaum, general manager for North America at software developer Autonomy Inc., San Francisco. In fact, government research programs, such as the Defense Advanced Research Projects Agency, funded a lot of early search-software development and deployment in the government, said Rita Joseph, Autonomy's vice president of public-sector operations. From those early days until recently, only highly skilled people searched databases. "Five or 10 years ago, the searchers were trained professionals" who were fluent in computer languages and could tell the computer exactly what they wanted, said Susan Feldman, director of the document and content technologies program for IDC.
But today, wor.kers throughout government need to search for data relevant to their jobs, even though they are not database experts. And many federal agencies provide some access to their records to citizens who connect to the databases using the Internet.
That has increased pressure on developers of search-engine technology. Ineffective searching — whether it happens because nonprofessional searchers don't know the best way to phrase a query or because the computer can't read the natural language it is analyzing — wastes time.
The three primary means of searching for documents and data are the full-text search, the taxonomy search and the metadata search.
The full-text search is rather self- explanatory: The computer reads every word of every document entered into the system and then records the words' locations in an index. To perform a search, the engine takes the user's keywords and looks for matches in the index, then indicates the source files from which they came.
The problem is that words have multiple meanings, and a basic keyword full-text search doesn't differentiate between them. Computer terms such as Java and Windows, for example, have other common meanings. A person searching for information on new thermal panes for the home isn't much interested in Microsoft Corp. operating systems.
With some added intelligence, a full-text search can look for a variety of associated words that provide clues to the intended definition of the word sought, filtering out hits for other meanings. "The more information you can give a search engine, the better," Feldman said, because additional search terms provide a clearer picture of the information sought. Some even permit users to copy and paste entire documents on the topic, so the engine can seek similar documents.
Another approach is to analyze relationships between documents for a taxonomy search. The indexing process works by classifying documents into a hierarchy. The indexing can take a few days to a few weeks, but once done, provides a quick way to find information.
"We go in, spend a week learning how [clients] do business, learn their processes and their content," said Roger Phillip, vice president of marketing and business development for Semio Corp., San Mateo, Calif., which offers a taxonomy search product. The U.S. Postal Service uses Semio's search technology to find documents, especially those employing terms unique to the service, referred to internally as "postal.speak." "There is a 100-page glossary of postal terms," said John Gregory, USPS marketing specialist. "So we put that whole thing into the taxonomy."
With a metadata search, on the other hand, each file has its own mini- database describing important aspects of the file, such as the author, creation date and keywords relating to its content.
A metadata search can be more selective than others. It can be limited to files created on specific dates, for example. It is also quick, because the search engine doesn't need to read the whole document, only its metadata. Proponents liken this to the card catalog in a library, except the metadata is contained in the file itself so there is no central repository to be protected and maintained.
Metadata searches presume that every file has associated attributes, but that is not always the case. Federal agencies may require users to record such information when creating documents, which can be done with Microsoft Word, Excel and PowerPoint files. However, older files may lack the required metadata.
One solution, available from companies such as Hiawatha Island Software Co. (HiSoftware), Concord, N.H., is software that not only performs metadata searches, but also creates metadata tags for existing files. HiSoftware's Hi-Search software works in conjunction with other search tools, such as products from Verity Inc., Sunnyvale, Calif. "We can take your classification engine and provide tags for text and nontext documents," said Rob Yonaitis, HiSoftware president and CEO.
Finally, security is an important aspect of search technology, said Hadley Reynolds, the director of research for the Delphi Group, Boston.
Agencies with users holding different clearance levels must have a search engine that screens the query results before showing them to the user. Even if the documents are secure and cannot be viewed by those lacking the clearance to read them, simply acknowledging the existence of some documents could com.promise security, Reynolds said. An advanced search tool will display only those results that a particular user is cleared to read, he said.
Current data- retrieval tools are far from perfect, and faced with the daunting challenge of merging human languages with computer processing power, vendors are tackling the problem in different ways.
"People are working in every conceivable area of this space right now because the current state of the art is so poor," Reynolds said.
Carney is a freelance writer based in Herndon, Va.
NEXT STORY: LexisNexis slices government info