Rescuing missed information

Cutting-edge commercial wares give agencies a whole new outlook on searching for information

About a decade ago, federal agencies started to grapple with the specter of storing and sharing mass quantities of information. A governmentwide search standard seemed like the right solution at the time. It would ensure that important records would remain accessible regardless of which agency or system housed them.

But since then, a steady stream of new information retrieval technology, popularized by user-friendly Web sites such as Google and Yahoo, have hooked legions of users and attracted the attention of agency executives. Many of them resist the government search standard's burdensome requirements.

The overhaul of the FirstGov Web portal is providing a high-profile example of the potential of new search technologies for government. Therefore, experts believe agencies will follow industry and adopt cutting-edge search technologies such as metasearch, clustering and topic maps. Those techniques promise to dig deeper into the government's online knowledge base, in addition to making search results much easier to use.

Several federal Web sites already use metasearch and clustering features, which allow searches to span disparate systems, data types and agencies, a challenge that the governmentwide search standard was supposed to address. Others are experimenting with topic maps to help employees find relevant information that traditional search tools might overlook.

Dave Goebel, president of the Goebel Group, a consulting firm, said federal users have become so comfortable with commercial search engines that they now expect similar performance from government search tools.

"That's forcing the agencies to proactively seek out commercial solutions," he said.

Meanwhile, search vendors continue to expand their products' capabilities in response to demands for better, faster retrieval tools. For example, Google sells enterprise search products, apart from its Web search site, that agencies can use with topic maps, metasearch and clustering tools.

"We are making our search appliances more and more open," said Rajen Sheth, product manager in Google's enterprise group. "We've built [application programming interfaces] to integrate search with a variety of different types of data and applications."

Another priority for vendors is helping users make more sense of search results that can list hundreds and even thousands of hits.

"The ongoing problem is that just about anything you type in [a search form] will lead to an overabundance of information," said Raul Valdes-Perez, co-founder of Vivisimo, which runs the clustering search site Clusty.com, and an adjunct associate professor of computer science at Carnegie Mellon University.

GILS: Interoperable search

The government first attempted to tackle search issues 10 years ago when it created the Global Information Locator Service standard. GILS responds to searches that reference information by title, subject, author, date and location. For GILS to work effectively, federal employees need to index all public government information by assigning those five labels to electronic records.

Based on the International Organization for Standardization's (ISO) 23950 specification for information search and retrieval, GILS would allow a user on one system to search and retrieve information from other GILS-compliant systems. The library community, smitten with the notion of an electronic card catalog of human knowledge, embraced the GILS standard.

Eliot Christian, who created GILS and manages data and information systems at the U.S. Geological Survey, said GILS would eventually allow users to search for government information with the search engine of their choice -- Google, Microsoft's MSN Search or FirstGov, for example. The standard would then direct users to the proper information on their first search attempt. In addition to convenience for current users, GILS would guide researchers to the proper information sources in decades to come.

At an April meeting of the Industry Advisory Council's eGovernment Shared Interest Group, Christian urged the government to be tougher about requiring agencies to specify GILS compatibility for new acquisitions.

The Office of Management and Budget's Circular A-130 and the Paperwork Reduction Act of 1995 mandate GILS-compliant software, but most departments have ignored them because they say manually coding records is too complicated and time consuming.

Christian said that by adopting GILS, agencies could reduce the costs of managing older systems and implementing new search technologies, which frequently change.

However, government officials find the evolution of commercial products more appealing.

The National Institute of Standards and Technology officials recently proposed withdrawing GILS as a mandatory federal standard because modern search technology has eclipsed it, they say. In another blow to GILS, General Services Administration officials decided not to require GILS compliance when they awarded a contract to revamp FirstGov's search engine.

Former federal officials say that if they had known robust commercial search tools were on the horizon, they would not have pushed GILS as a governmentwide standard. Dan Chenok, former branch chief of information policy and technology at OMB, said officials wrote the policies when search engines were in their infancy.

"Most agencies have been implementing the goals that underlie GILS and the requirements of the Paperwork Reduction Act with search engine technology that did not exist 10 years ago," said Chenok, now a vice president at SRA International.

Some experts say GILS and the new crop of search technologies can coexist.

Kevin McCook, federal sales director at search tools vendor Verity, said GILS helped streamline federal records management.

"Not everyone has complied, but there has been effective guidance available that has, at least, limited the potential chaos," he said.

McCook said Verity supports GILS as a stabilizing force across government, but he endorses other technologies for deep and complex probes. He said the intelligence world and some scientific organizations need specialized, more advanced techniques that surpass GILS' simplistic tagging conventions.

Metasearch: Expanding frontiers

Most government Web sites still rely on traditional search technology, which suffers from a serious shortcoming when working with different information types.

Traditional search engines use automated software called a crawler, which reads information on static Web pages and builds a central index with links to the original sources. The engines compare search queries against this index and quickly generate lists of links.

But online information storage increasingly favors structured database systems instead of static Web pages. Those database systems present their information on Web pages only when users request it. Search engines that use traditional crawlers never see all the information stored in databases, which include valuable government resources such as PubMed and USAJobs.

Metasearch, also known as federated search, can eliminate this blind spot. A single search triggers multiple simultaneous queries of selected databases, the Web and site-specific search engines, such as NASA.gov. The metasearch tool then collects and combines the search results, eliminates redundancies and presents the finished product as one list.

According to a recent study by the free Web metasearch service Dogpile.com, only 1.1 percent of the first page results from the four leading commercial search sites match. Dogpile uses the federated search technique to launch simultaneous queries of those four sites -- Ask Jeeves, Google, MSN Search and Yahoo -- and report results in a consolidated hit list

Programmers who work with metasearch tools say the federated search technique produces more reliable results than traditional search engines.

"People who assume that Google has everything…really miss relevant items," said Tamas Doszkocs, a computer scientist at the National Library of Medicine (NLM). He has been working for almost a decade on a metasearch engine called ToxSeek, which scours toxicology and environmental health databases at government agencies. The site, accessible during its beta-testing phase, is scheduled to launch later this year.

In addition to metasearch capabilities, ToxSeek also uses clustering, another new search technique. With clustering, algorithms sort search results into groups based on textual and linguistic similarities.

For example, a ToxSeek user could search for "cancer" and "smoking," and the system would return results categorized by a variety of subheads, including the information's source, topic and type.

Clustering lets users see results that would otherwise appear near the end of ranked lists, and they can survey the information landscape before digging in.

One of the earliest adopters of clustering in the government is the Homeland Security Digital Library. The library, maintained by the Homeland Security Department and Naval Postgraduate School, deployed a version of ToxSeek more than six months ago.

The search tool, named SeekOnce, short for "Seek Once, Retrieve Many," spans a variety of resources, such as research studies, theses, white papers, legislation, journal articles and commercial databases. It can read plain text documents, PowerPoint presentations, multimedia files, images and spreadsheets. SeekOnce accesses about 50 databases and may eventually extend to as many as 250.

GSA officials gave metasearch and clustering tools a public vote of confidence last month, when they selected a new search engine for the FirstGov Web portal. Contract winner Vivisimo will work with Microsoft to provide metasearch and clustering capabilities to FirstGov users.

The move will expand the reach of the portal's search engine into a greater variety of government-related content and make it easier for users to navigate search results by clustering hits according to subject matter.

Meanwhile, the most established government metasearch tool is Science.gov, an interagency product hosted by the Energy Department's Office of Scientific and Technical Information (OSTI).

Since 2002, a dozen agencies, including the Defense and Agriculture departments and NLM, have contributed to the portal. Science.gov, like ToxSeek, can query selected databases, such as PubMed, MedlinePlus and DefenseLINK, but it does not cluster results.

OSTI Director Walter Warnick said science teachers are using the portal often and giving it positive reviews. The mayor of Oak Ridge, Tenn., home of the Science.gov portal, used it to help his child with homework.

When a big science story breaks, such as a tsunami or hurricane, people can look for context at Science.gov. Warnick added that it is also popular with college students looking for science internships and fellowships.

"We think we have 98 percent of all the federal research and development budget represented in Science.gov," Warnick said.

That desire for comprehensiveness illuminates one of the primary shortcomings of metasearch tools, however. "There are limits on how many subordinate databases you can do at one time," he said. "The more databases you have, the slower the response."

Search tools that use crawlers don't have that problem because their search queries are run only against the central index that the crawler created.

For this reason, OSTI posted a lesson plan on its education site, ScienceLab, which advises students to use commercial engines in combination with Science.gov for a more productive search experience.

"We see metasearch as not a competition with Google but as complementary," Warnick said.

Topic maps: Making connections

Traditional search engines, while increasingly more precise and expansive, cannot think like a human being. For example, what happens when two sets of electronic documents use different words and vocabularies to discuss a related topic? A traditional search engine might miss the link between the two sets, because it can only match words, not the meaning of the ideas discussed within them.

The still-emerging area of topic maps can help educate search engines.

Like metasearch, topic map techniques do not replace traditional search tools. They can work in conjunction with them, however, to provide more powerful search navigation. For example, a NASA topic map could be set up so that when a person enters "Pathfinder" into a search form, the topic map guides the user to related items, such as "Mars lander" and "evidence suggesting liquid water was once a stable presence on Mars."

Several federal agencies, including Energy, the Defense Intelligence Agency and the Internal Revenue Service, have started to add topic maps to traditional search technology.

In 2001, when IRS officials wanted to improve customer service on their tax assistance hot line, they developed an internal topic map that would help call center operators find relevant information more easily.

The IRS topic map identified many of the descriptors that callers might use and then programmed links between related terms. For example, the terms "abandonment" and "disposition of property" are different ways of referring to something that has similar tax implications. With topic maps working alongside the search engine, a query for either term would direct a call center operator to the relevant online information.

Michel Biezunski, a consultant at Coolheads Consulting who worked on the IRS project and co-editor of the ISO standard for topic maps, said the old call center setup forced operators to flip through numerous resources, including manuals and multiple technical Web sites, to answer caller inquiries.

"There was too much information and not enough time because they were on the phone," Biezunski said. Now the topic map guides hot line operators to the most helpful information faster.

Topic map implementation requires more elbow grease than search appliance installation. Unlike traditional search engines, most topic maps require human and artificial intelligence. A computer does not know that "abandonment" and "disposition of property" are related in a tax scenario. A person, typically a government subject-matter expert, has to teach the computer to recognize the relationship.

Many topic map applications include a search engine to help users find a starting place in the knowledge network. In some cases, topic maps are not necessary for simple searches.

"If you can use the usual stuff, then do it," Biezunski said. "Google is fine if what you have to do is relatively shallow. If you are really trying to explore a domain, then Google must be frustrating."

Intelligence agencies, which want to share information but must first translate one another's jargon, are candidates for topic map experimentation, experts say.

George Kondrach, executive vice president at Innodata Isogen, an information management consulting firm, has been helping the Office of Naval Intelligence and DIA with their topic map projects for about a year.

"People like the CIA and DIA and the ONI, they don't even speak different dialects of the same language, they speak different languages," he said. Secrecy is one reason for the differences, but the agencies need to share some information, he added.

"This [topic map] overlay transcends the semantics of each agency," he said.

Some topic map consultants would like the Government Accountability Office to encourage all government agencies to convert their vocabularies into topic maps.

"I'd be real interested in seeing the GAO get excited about something like this. It would make their job so much easier," said Patrick Durusau, a private consultant who is co-editor of the Topic Maps Reference Model and chairman of the U.S. National Technical Advisory Group to the ISO committee that developed the topic map standard.

Integrating departments' topic maps would create greater transparency and accountability in government, he said.

"Topic maps enable individual agencies to retain not only their traditional nomenclature but also their information systems," Durusau said. "A topic map sits as a wrapper around such resources and provides the means to reliably merge data from different agencies into a single coherent view."

It might give the expression "cutting through government red tape" a whole new meaning.


**********

NEXT STORY: IRS preps seat management study