Fugitive documents elude preservationists

GPO, Library of Congress turn to Web harvesting.

Government Printing Office officials, who have a significant role in preserving government information, want to capture fugitive publications, which are documents that federal agencies have published on the Web but for which no copy or record exists in GPO's database.

To recover such documents for preservation, GPO officials are interested in new software technologies such as Web harvesting, and they are reviewing proposals from companies that make such software.

Web harvesting is one of three activities that will contribute to what GPO officials say will be the Future Digital System. In addition, GPO officials plan to convert paper-based government information to digital formats and deposit electronic documents in libraries that are part of GPO's Federal Depository Library Program.

Other federal agencies are also interested in harvesting. As part of the Library of Congress' National Digital Information Infrastructure and Preservation Program, LOC officials awarded grants last fall to several academic and other institutions for creating technologies that preserve Web content and its context.

Information about electronic content, such as the server on which information is stored, will be maintained as part of the library's preservation program. Information about the server's location and when the content was published are also important to preserve, library officials say. With that additional information, people could compare the White House Web site on the last day of the Clinton administration, for example, with the site's appearance on the last day of the first Bush administration.

Web harvesting, sometimes called crawling or spidering, is more than searching for and discovering information. Harvesting techniques are used for downloading code, images, documents and any files essential to reproduce a Web site after it has been taken down.

Search engines perform only the first step in preserving Web sites for future generations. A search engine typically finds a Web site and indexes it without storing it.

Officials at the National Archives and Records Administration also have an interest in Web harvesting as they develop NARA's Electronic Records Archives. Harvested e-documents, however, are only one type of record of government decision-making for which NARA is creating the archives.

Hard to capture

Librarians agree that many challenges face anyone who attempts to pick through government content on the Web looking for materials to preserve. "Dynamic, interactive aspects are hard to capture," said Martha Anderson, project manager for LOC's Office of Strategic Initiatives.

In the past five years, LOC staff members have collected about 26 terabytes of data. Samples from that collection include an exhibit on the Sept. 11, 2001, terrorist attacks and two presidential election exhibits.

Copyright laws complicate matters because they do not provide clear rules about what can and cannot be harvested from the Web. Eight LOC library staff members are kept busy seeking permission from Web page creators to harvest their sites, Anderson said. "If we don't get permission, then we don't collect."

Another major hurdle is digging deep enough into agency Web sites to find and filter the content, said Mike Wash, GPO's chief technical officer. "There's the issue of do you gather too much information, or do you not gather enough?" Wash said.

GPO officials will launch an experimental Web harvesting project with the Environmental Protection Agency to answer such questions. Officials must first decide when a document is a publication, said Emma McNamara, director of information access in the EPA's Office of Environmental Information. Because of the EPA's complex regional structure, officials are notorious for removing information without telling librarians.

GPO officials will evaluate the results based on their harvesting techniques' success. Such documents typically contain the terms "table of contents" and "appendices," which make automated searches easier.

Before embarking on the Web harvesting project, EPA and GPO officials discussed alternatives. One option was to have an intern search the EPA's Web site manually, comparing the holdings in the agency's libraries to GPO's index of publications and providing GPO officials with a daily update on new publications as officials put them on the Web.

But officials abandoned that option in favor of Web harvesting, which they hope will be more efficient.

E-information experts offer myriad explanations for the problem of fugitive publications, but they agree that Web harvesting is important.

Patrice McDermott, deputy director of the American Library Association's Office of Government Relations, said Web harvesting could offer access to hidden government information. For political or security reasons or because federal officials want to update their agencies' Web sites, officials remove information from sites, McDermott said. "You have no guarantee that they or you will be able to find it again," she said.

Although some unofficial sites, such as the Memory Hole, purport to offer previously unavailable government reports, fugitive publications on such sites might not be authentic, McDermott said. Recent additions to the Memory Hole include "The U.S. Army Radioactive Waste Disposal Program in the United States," a 1966 "Official Use Only" report, and another publication titled "Disease Vector Risk Assessment Profiles from the Navy."

When publications no longer appear on official government Web sites, people cannot be certain the documents are authentic or know which version they are viewing, McDermott said. Agency officials should send final versions of their electronic publications to GPO, but that is not happening, she added.

Others say federal officials are doing their best to comply with E-Government Act, e-records and other government guidelines. The rules are evolving, said Dan Chenok, former branch chief for information policy and technology at the Office of Management and Budget.

Chenok, now a vice president at SRA International, which sells information technology services to the government, said solving the fugitive document problem is crucial. "Keeping proper records of the government's activities is important for government access, government accountability and historical preservation," he said.

Open-source tools

Through the LOC's digital preservation program, institutions that received grants are making progress on new harvesting techniques. Officials at the Online Computer Library Center, working with several academic, state and government institutions, are developing open-source tools for state libraries to analyze the content of specific Web domains, such as www.nasa.gov.

Taylor Surface, director of the center's Digital Content Management Services, said the new tools will let state library officials harvest only what they need to preserve.

Another institution, the California Digital Library, is using an LOC grant to design harvesting techniques that will preserve a Web-based history of the 2003 California gubernatorial recall election and other political events. Many of the re-election activities occurred online, said Patricia Cruse, director of the California library's digital preservation programs.

All the candidates had Web sites, Cruse said, and in the future, people will want to know their names, how their opinions changed during the campaign and what they looked like. But one challenge, she said, will be informing users that the Web sites they are looking at were created in 2003 and are no longer active.

An uncertain harvest

Harvesting techniques can be used for downloading software code, images, text documents and any additional files essential for reproducing a Web site after the original site is taken off-line.

But such techniques have drawbacks in part because:

  • Interactive Web sites are difficult to capture.
  • Copyright laws fail to provide clear rules about what can and cannot be harvested from the Web.
  • Web harvesting software often collects too much or too little information.
  • — Aliya Sternstein

    Opening Pandora's box?

    The library of the land down under may be ahead of the Library of Congress in terms of Web harvesting. Preserving and Accessing Networked Documentary Resources of Australia (Pandora) is a collection of Australian-born digital publications. Pandora (pandora.nla.gov.au/subject/21) was established by the National Library of Australia in 1996. LOC's Mapping the Internet Electronic Resources Virtual Archive (Minerva) (www.loc.gov/minerva), initiated in fiscal 2000, is a comparable project to capture, save and preserve collections from select Web sites for future generations.

    — Aliya Sternstein

    NEXT STORY: E-records R&D gets grants