NARA sharpens digital preservation plans

The National Archives released a framework for preserving government information that takes into account the proliferating variety of electronic formats – some now defunct – so that records can be saved in their original form.

file sharing (cifotart/Shutterstock.com)
 

When the National Archives and Records Administration first received electronic files in 1970, most of the material was in the form of structured datasets produced by mainframe computers -- mostly ASCII text and Extended Binary Coded Decimal Interchange Code (EBCDIC), 8-bit digital encoding used by IBM and other data processing systems.

Fast forward 50 years, and the proliferation and diversity of electronic file formats -- nascent and obsolete -- is hard to get a handle on. Nevertheless, NARA must still take on agency records in a variety of electronic formats – even ones that that are out of use or can only run on out-of-support operating systems.

After months of development and comment, NARA released its Digital Preservation Framework on June 30. The revised framework, which incorporates comments from agencies, experts and stakeholders in the records management field, identifies 16 electronic record category types and offers a set of best practices for managing risk to prevent the loss or diminution of government’s digital work.

The record types range from computerized architectural plans to email to images and video as well as software and code, spreadsheets, GIS data, calendars, databases, spreadsheets, word processing documents and more.

But the master list of record types doesn't really get at how complicated some of these record accession challenges are for both agencies and NARA.

Leslie Johnston, NARA's director of digital preservation, leads the effort to manage how agencies and NARA save digital information for the future and make sure it's still available in something approximating its original format. When push comes to shove, though, the archival content of a federal record is more important than the museum experience of being able to experience it in its original form.

There's another layer of complexity: the length of time agencies retain information before it is considered archival and ripe for accession by NARA.

In an October 2019 interview in NARA's College Park, Md. facility, Johnston told FCW about the development of the framework while it was still in the midst of its comment period.

"When I explain to people what I do, what I always have to say is that my job is to think about the worst possible thing that can happen and try to keep it from happening," she said. "My job is about identifying risk and risk mitigation to the highest level possible. There's no such thing 100% risk avoidance."

While most records schedules range from seven to 15 years, there are exceptions. Census records, because they contain personally identifiable information, are released in full 72 years after a decennial population count. An architectural design file for a General Services Administration-owned building might be considered an active record for as long as the building is standing.

"We are increasingly receiving a lot of less common formats, but also legacy formats from agencies because the way the record scheduling works," she said. "It can literally be hundreds of years that they're holding onto a file before they send it to us. We have a real proliferation of file formats we have to manage, process, preserve and then make available," Johnston said.

Obsolete software, orphan specs

Already, there are playback challenges for digital records held by NARA.

"Take, say, an early WordPerfect file from the 1990s," Johnston said. "You might be able to open it in its original format using current WordPerfect -- it still exists – or Microsoft Word. But you might not fully capture all of the content or the look and feel in that migration."

Databases, spreadsheets, image files, sound files and video all pose their own challenges, based on the software used in their creation and the ability of NARA to supply compatible systems for storage and future use.

"What we have is really a constant risk decision‑making process, what sort of transformation, or accessibility, or playback can you enable that provides as much fidelity as possible of the original record content? Sometimes what you are giving up on is the exact look and feel to get as much of the content as possible," Johnston explained. "We can't be in the business of recreating the original look and feel of every platform that a federal record existed in. That covers not just social media or email, it's all records. We can't recreate the experience of how you worked with something in AutoCAD. We can't work with [virtual reality and augmented reality] and provide a fully‑blown experience for that at this point."

There are also legal and financial issues for NARA and across the information management community when it comes to preserving archival content in its original form.

"Something like software preservation, even if we wanted to use it to process things, that is actually one of the biggest policy issues that makes organizations that do digital preservation the most wary," Johnston said. "They don't always know what they can legally acquire, what they can legally use and what they can legally make available under current copyright law and under whatever licenses the original software manufacturers issued that software."

Social media

One area where the balance between content and original experience is going to play out is social media. Already, the first generation of social networks such as MySpace and Friendster and user-generated content platforms is obsolete. What happens if and when today's popular social media platforms are abandoned in favor of newer rivals?

"This is where it comes down to the concept of preserving the content over the original user experience," Johnston said. "Our transfer guidance about that sort of social media is really about the format they can get it to us in. If an agency has social media that they want to transfer to us, what we would prefer that they do is actually export it from the original platform as JSON, XML, some sort of structured markup. That preserves all of the record content and, as much as it can, the context, who, dates, what, links, image, image links, links to sites, links to news stories, links to press releases -- and we would like that to come to us."

One key goal for the Digital Preservation Framework is to have the guidelines expressed in a machine-readable format that can operate across agency systems. Implementing automation and rules-based archiving is critical because of an important impending deadline: By the end of 2022, NARA will stop accepting paper records.

"We expect agencies, after January 1, 2023, to send us digital. It could be born‑digital or digitized, but we expect digital," Johnston said.

The public comment period on the Digital Preservation Framework is open through Nov. 1, 2020.