GPO awaits Web-harvesting technology

Agency seeks technology that can query specific classes of information.

Information technology innovation has caused a setback for the Government Printing Office, the agency responsible for distributing government publications.

GPO had to postpone a technology procurement that is crucial to its new digital distribution system because it anticipates a technical breakthrough in the area, GPO officials and industry experts said.

GPO officials canceled a request for proposals Oct. 17 for Web-harvesting tools because they said the technologies submitted by the earlier February deadline are already outdated. The RFP called for solutions that could recover fugitive documents, which are materials that federal agencies have published on the Web but for which no copy or record exists in GPO's database.

"We have decided to refresh the RFP for procuring a Web-harvesting tool," said Mike Wash, GPO's chief technical officer. "New technologies have been developed since then, and as such, we feel that it is most prudent to get experience evaluating the latest technology for this critical need."

GPO's dilemma is only one example of a tug of war in which the pace of computing progress competes with slow government funding. Federal officials are yanked in both directions. The challenge is deciding when to cast a net into the rapids. Officials often have only one shot at acquiring a sophisticated solution that must last far into the future.

For example, managers at the U.S. Census Bureau are buying handheld computers for population headcounts that must still be useful five years from now.

Web harvesting is vital to GPO's plans for a new distribution structure, the Future Digital System. The agency plans to create an architecture that can verify and track all versions of official government documents. Officials say the system's design will ensure authenticity of government information and provide permanent public access to it.

"We only want to do this once," GPO Public Printer Bruce James said. The question is "how do we get it right the first time so we don't have to do it again 20 years from now."

GPO's new RFP, which the agency expected to release last week, demands semantic logic solutions. James described semantic logic as algorithms capable of querying Web content for specific classes of information. The agency had concerns that previous proposals would have generated too many false positives.

James said the recall would have no effect on the timeline of the Future Digital System, which GPO expects to be operational by July 2007.

However, Web-harvesting experts from the Internet Archive, Online Computer Library Center (OCLC) and Mitre said the RFP's wording must have caused GPO officials to rethink their requirements, adding that no new Web-harvesting technologies have appeared in the past year that would have made the original RFP out of date.

Michele Kimpton, Web archive director at the Internet Archive, said the original RFP requirements kept her organization from contributing a solution. The archive, a nonprofit entity that harvests historical information in digital format, recently announced that it is working with industry, academia and foreign governments on a massive depository project called the Open Content Alliance.

"My guess is that they read a lot of these proposals and did a little investigation and said there was a mismatch" between their request and current harvesting technology, she said. "For us to use our technology, we couldn't make it fit the exact proposal."

Web harvesting, sometimes called crawling or spidering, is more than searching for and discovering information. Organizations use harvesting techniques to download code, images, documents and any files essential to reproducing a Web site after it no longer resides on the Internet.

Other federal agencies are also pursuing harvesting. As part of the Library of Congress' National Digital Information Infrastructure and Preservation Program, LOC officials are extracting Web content and its context.

Information about electronic content, such as servers that store information, will be maintained as part of the library's preservation program. With those additional details, people can compare the George W. Bush for President Web site with the Al Gore/Joseph Lieberman campaign Web site on Election Day 2000.

Gordon Mohr, technical lead for Web-crawling projects at the Internet Archive, said GPO might be waiting for the marketplace to perfect a technology known as topical harvesting. Topical harvesting, a more intelligent means of crawling, analyzes words and phrases to gather Web materials. Instead of a user defining terms for the crawler, the crawler deduces where to go. For instance, topical harvesting could analyze previous crawls of the Centers for Disease Control and Prevention Web site, then see numerous instances of the words "avian flu" and decide to harvest all government materials on vaccines from all Department of Health and Human Services sites.

"The software determines which sites should and shouldn't be included," Mohr said. "It's an area of continuing and future work."

GPO could use topical harvesting to focus a crawler on fugitive documents.

"It sounds like they, the GPO, have a very specific topic that they are interested in: Documents that have fallen through the cracks," Mohr said. "And that could be thought of as a theme for crawling."

Evolving standards and tools

Other researchers say semantic logic should prove valuable in GPO's initiative.

Eric Hughes, senior principal engineer at Mitre, said some semantic Web standards existed when GPO issued the original RFP, but commercial tools for implementing those standards are still evolving.

"We're just beginning to see Web content use these standards," Hughes said. "The resulting ability to search the Web using semantics does offer promise to decrease the number of false positives."

The Semantic Web has two parts: data representation and logic. Current semantic standards define basic data representation as concepts and relationships, or links between concepts. Semantic logic is the ability to specify frameworks that computers then use to draw conclusions from data representations.

Semantic "logic can range from simple data constraints requiring that the value of the person-age concept be a number between 0 and 140 to more complex things," Hughes said.

For example, semantic logic can define a marriage relationship such that a person can be married to at most one person, who must be married to them, except in cultures that permit multiple spouses.

Semantic logic can help harvesting work better, but the creators of Web data would have to use logic appropriately and then create search tools that take advantage of semantics, he said.

"I'm not aware of any larger-scale demonstrations of the use of richer logic to improve Web search or harvesting, but it's clearly an area where lots of good research is being done," Hughes added.

When GPO officials finally put their bets on the table, they will have to watch the game closely.

Taylor Surface, manager of digital collection services at OCLC, is part of the Library of Congress' digital preservation program. Researchers at the University of Illinois and OCLC are working to develop open-source tools for state libraries to analyze the content of specific Web domains, such as www.whitehouse.gov.

"Whatever they procure is going to need to be adaptable for change in the future," Surface said. "They need to continue to be flexible in this field of emerging technology."