Archive-friendly PDF in the works

Slimmed-down PDF could become a standard

Two of the largest bankruptcy filings in U.S. history — Enron Corp. and Global Crossing — produced a record number of PDF documents, which federal courts must figure out how to archive and preserve.

The archival challenges those bankruptcies created explain why Stephen Levenson, judiciary records officer for the Administrative Office of the U.S. Courts, is spending much of his time these days working with colleagues on a new international standard for archiving PDF documents.

The open-standard PDF, created by Adobe Systems Inc., has become a widely used format for distributing documents on the Internet because it preserves their original look and makes copying and editing them difficult. Now, a modified version, called PDF-Archive (PDF-A), to which Levenson is committed, most likely will become an international standard early next year. Companies are expected to immediately offer archiving aids based on the new standard.

Sitting around the table at PDF-A Committee meetings are representatives from companies such as Eastman Kodak Co., Global Graphics Software Ltd., IBM Corp., PDF Sages Inc. and Xerox Corp., said Melonie Warfel, director of worldwide standards at Adobe. But equally involved are representatives from federal agencies such as the Internal Revenue Service, the Library of Congress, and the National Archives and Records Administration.

The PDF-A standard will be a slimmed-down version of PDF, Levenson said. It will be useful for formatting document files that contain multiple pages of text, raster images or vector graphics. However, it will not be suitable for archiving music and video files, he said.

Among federal archivists and records managers, PDF-A is viewed as one of two leading data format candidates for preserving future access to electronic records and documents. The other is Extensible Markup Language. The proposed PDF-A standard specifies what should be stored in an archived file by prohibiting, for example, proprietary encryption schemes and embedded files such as executable scripts. "We don't want embedded files that can do mischief inside our records collection," Levenson said.

PDF-A is based on PDF 1.4, a version of the published and freely available PDF specification that is only slightly outdated. Adobe is at PDF Version 1.6 in its development of the specification. "We'll catch up if we need to," Levenson said. "But in this business of archival preservation, we don't need to go too fast."

Unlike a PDF, a PDF-A will contain type fonts to ensure that electronic documents will look the same in the future as they did when they were created, said Charles Dollar, an electronic records consultant who is chairman of the Standards Board of the Association for Information and Image Management, a nonprofit trade group.

"Typically, the type fonts exist independently of the PDF document," Dollar said. But with PDF-A, they will be embedded in the document. "That's going to increase the storage requirements," he said. But it is a price that must be paid to ensure that type fonts are available when they are needed for reading scientific notation, for instance.

As with any new standard, there is always a risk that too few companies will use it to create new software, but Dollar said he doubts that will be the case with PDF-A.

***

What's new with PDF-A

The PDF standard is popular throughout the federal government for electronic documents, but it is not suitable for archiving permanent records. For that purpose, officials at many federal agencies expect to use a new electronic document format called PDF-Archive (PDF-A). Here's how the two compare:

PDF

Nonarchival format.

Text, raster images, vector graphics, music, video, etc.

International Organization for Standardization (ISO) standard.

Encryption and executable scripts permitted.

No type fonts included.

PDF-A

Archival format.

Text, raster images and vector graphics only.

Future ISO standard.

Encryption and executable scripts not permitted.

Type fonts included.