The ultimate archives

The National Archives may have found a way to make erecords available for generations to come

Four centuries from now, if a historian wants to read Al Gore's or George

W. Bush's inaugural address from January 2001, he or she should be able

to find it in a snap in the online electronic records archive now being

developed by the National Archives and Records Administration.

"The goal is to preserve digital information for at least 400 years,"

say researchers from the San Diego Supercomputer Center, who have provided

much of the scientific brainpower behind the project.

As the federal government shifts more of its work from paper to electronic

documents, the National Archives must radically rethink long-term preservation

of records. Computers and formats rapidly become obsolete, rendering documents

created just a few years ago unreadable. The problem is how to make documents

readable centuries from now, when computers beyond imagining today are likely

to be in use.

"It has been described as the archival equivalent of the first moon

shot," said John Carlin, archivist of the United States.

Carlin and other archives officials are confident they will have a pilot

version of the electronic records archive in operation by 2004 or 2005,

at an estimated cost of $130 million.

The Migration Problem

Until recently, the Archives' attempt to build such an electronic archive

seemed like a technically impossible dream: In theory, obsolescence can

be overcome by migrating electronic data to more modern systems. But at

the present pace of evolution, software used to manage archival collections

changes every three to five years. Combine that rapid rate of obsolescence

with the explosive growth in the number of electronic records, and mass

migration, in reality, is impractical.

"The time needed to migrate to new technology may exceed the lifetime

of the hardware and software systems that are being used," eight scientists

from the San Diego Supercomputer Center wrote in a technical paper describing

the new electronic archive.

The migration problem is further complicated by archival rules of order.

Official records must remain authentic. That means their contents can't

change, and in most instances, neither should their appearance. Paper records

always look the same, but electronic records can look very different — or

become incapable of being viewed at all — if the software needed to display

them properly no longer exists.

That's already a problem for documents created a decade or so ago in

formats that are no longer used. "Electronic records are only as good as

they are authentic," said Reynolds Cahoon, assistant archivist of the United

States and head of the effort to create an electronic archive. "If they

aren't authentic, everything is for naught."

Records exist in thousands of formats, and the challenge of keeping

up with new ones as they come out and old ones as they are discarded quickly

becomes insurmountable. So the archivists concluded that the best way to

solve it was to avoid dealing with formats altogether.

Finding the Right Language

Carlin dramatized the solution in March, when, while presenting Congress

with his 2001 budget request, he announced that two years of work by computer

scientists had led to "a major technological breakthrough" in storage technology

for electronic records.

Researchers, he said, had developed methods for storing electronic records

that promise to preserve them for hundreds of years and keep them readable

despite the obsolescence of the software and hardware used to create them.

Three years ago, scientists would have said it couldn't be done, Carlin

said.

"But now they have demonstrated it to us and given us confidence that

in three to five years we will be able to deal with the massive volume of

federal records in various formats and from various generations of technology,"

he said.

Working with the San Diego Super-computer Center, Georgia Tech Research

Institute and several other government agencies, the Archives has discovered

a method that promises to permit storing records "totally independent of

their software and hardware," Cahoon said.

A process called "persistent object preservation" appears capable of

stripping the display characteristics of any electronic document — whether

text, spreadsheet, photo or map — and storing it in a format that will allow

it to be called up by whatever software is being used in the future.

The format of choice is Extensible Markup Language or XML, a standard

language for transmitting data from one computer to another. "Tags" within

XML documents tell the receiving computer how to read and format the data.

Here is how the electronic archive would work: An incoming electronic

document would be converted into an XML document. This involves identifying

the components of the document using XML document type definitions, replacing

proprietary or nonstandard formats with XML tags and preserving information

about the document's appearance.

XML tags will also make it easier for search engines to locate documents

after they are stored. For example, e-mail messages in XML could be searched

by the names of senders and receivers, while omitting names mentioned in

the message's text. Document type definitions will also make it possible

to link related documents in groups or collections of records, a key requirement

in archiving.

Once converted to XML and tagged, documents would be stored in a "container,"

which in turn is stored in a "repository." For now, the container is a 100G

tape cartridge, but that is likely to change as new storage technology is

developed. The physical repository is a robotic storage warehouse — or multiple

warehouses scattered nationwide and linked electronically.

Presiding over the repository is a computerized "storage resource broker,"

which functions as middle-ware between the repository and applications used

to store and retrieve records. The storage broker retrieves records and

uses document type definitions to reassemble collections of records, wherever

they are in the archive.

Still a Theory

So far, a test version of the electronic archive has passed a number

of hurdles, including one that involved taking in a million e-mail messages,

converting them to XML documents, tagging them, storing them and calling

them back up. The process took less than two days, Achives officials say.

"We can prototype the concept and make it work," Cahoon said. "But we

are nowhere near ready to assemble" an archive as large or complex as the

national electronic archive will have to be.

Even when the electronic archive is up and running, work on it won't

be finished, he noted. "You can't just build this once; it's never done.

Parts will become obsolete, so you have to constantly evolve. It's designed

so any piece of the system can be exchanged for new components" and still

be compatible with the XML-based application of the other components.

But the burden of constant upgrading is also a benefit. As computing

power increases, its price declines. The archivists are counting on that

trend to make it economically possible to keep up with the swift-rising

volume of records that must be stored, Cahoon said.

The Department of Veterans Affairs is one of the agencies that could

benefit early from the electronic archive project. On a daily basis, the

VA needs access to veterans' records to process claims and determine eligibility

for benefits. "We spend a good amount of time trying to track down records,"

said VA spokesman Steve Westerfeld. Determining eligibility often takes

months. "We're in favor of anything that allows easier access and enables

us to get hold of records quicker and serve veterans better."

"The challenge we face as records move more and more to electronic is

how access is going to be provided," Carlin said. The electronic archive

is "on the cutting edge of research and technology. Nothing comparable has

ever been done."

NEXT STORY: Army fine-tunes missile defense C3