A crisis for Web preservation

Fugitive documents published on the Web are not being preserved

The Federal Depository Library Program has fallen behind in cataloging and preserving access to government documents published only on the Web. As a result, public access to those publications is spotty at best.

"This is not a problem; this is a crisis," said Daniel Greenstein, head of the California Digital Library, which serves the 10 universities in the University of California system. He said information is disappearing from government Web sites at an alarming rate.

At the Government Printing Office, which runs the depository library program, officials are struggling with the problem, known as fugitive documents, said Judith Russell, superintendent of documents. Fugitive documents are electronic publications that remain outside the federal depository collections in 1,300 libraries

nationwide.

To capture those publications automatically, GPO officials may turn to Web-harvesting technologies. In May, agency officials published a notice asking vendors to submit information about Web-crawler and data-mining technologies that could assist in locating fugitive government

publications.

But GPO officials might find automated Web-harvesting techniques inadequate for capturing much of the information that the government publishes online.

According to Greenstein, Web crawlers are fairly good at capturing documents from the Web surface, but the technology is not as good at capturing information from the Deep Web.

The Deep Web is where databases and dynamic Web pages — pages that gather information in response to users' requests — reside. A recent California Digital Library study found that about 85 percent of the Deep Web is in the .gov domain.

Before the Web existed, GPO officials' jobs were easier. When the depository library program was created, office officials printed all government documents and distributed them to depository libraries nationwide, Russell said.

Now the number of government

documents published on the Web each

year exceeds the number of print publications. Many online publications remain uncataloged and unavailable at depository libraries because federal officials

fail to notify GPO that the publications

exist.

Greenstein said the agency must redefine what is appropriate for inclusion in an electronic archive of government information. Even copyright issues are clouded in the online publishing world. No one is certain, for example, whether the rights are free and clear when independent contractors supply government information, he said.

GPO officials have made a good start on creating an electronic repository. The agency's electronic archives hold more than 100,000 government documents. The office also offers links to government information on other agencies' Web sites.

But GPO officials have no idea how many fugitive government documents are on the Web. "If we knew where they were to count them, then they wouldn't be fugitive," Russell said.

Experts disagree on whether GPO's plan for capturing fugitive documents puts the agency ahead of the commercial software market. Clarence Morey, an official at Internet Security Systems Inc., said the same Web-filtering technologies that the company uses to block spam and pornography could be applied to finding fugitive government documents.

Other experts such as Jeff Young, software architect for the Online Computer Library Center, a nonprofit computer library service, said librarians will be interested in how GPO officials solve the depository library problem, especially if they do it with open-source software tools that can be shared with other libraries.

GPO officials have sought help from university libraries. They have made special arrangements, for example, with the University of North Texas Libraries for the university to maintain a collection known as the Cyber Cemetery.

But other institutions, such as the

California Digital Library, cannot afford to wait while GPO officials look for ways to solve the fugitive document problem. Officials at the research library have committed their employees and money to creating and maintaining their own electronic collection of government information.

The depository library program "is broken," Greenstein said. "We have to solve the problem because GPO isn't."