The Future of Past Email is PDF

By Chris Prom

Archives around the world are filled with handwritten letters and typed memos.  But what about correspondence of a later vintage? How should governments, universities, business, and archives ensure the future generations can access and render email?

In 2018, this problem led the archives and records community to assess options. With support from the Andrew W. Mellon Foundation and Digital Preservation Coalition, a working group authored a comprehensive report, The Future of Email Archives, looking at the many ways that messages can be captured, preserved, and rendered.

The working group noted that some archives and libraries choose to preserve and represent email within platforms that use email-specific formats such as MBOX, EML or PST.  Others maintain or emulate old email environments. A few store messages in XML formats. These approaches require a relatively high level of technical development or support.

Archives, libraries, and other memory institutions have experimented with these approaches, but have not widely implemented them as production services. As a result, many organizations are simply storing format specific email archives as unprocessed holdings. For this group, email to PDF offers a relatively straightforward migration pathway, with demonstrated downstream benefits.

Why PDF?

One may ask: Why should PDF be considered as a potential target format for archival-quality, preservation-enabled emails? That’s a good question. Answers can be grouped under two headings:

PDF addresses gaps and risks inherent to current email formats and migration pathways

  • PDF includes rich data structures that could fully accommodate the diversity of email content and metadata. Completely self-contained, PDF (and especially PDF/A) is designed to capture text and graphical content for archival purposes.  It includes extensive provisions supporting renderings (e.g. of email content), arbitrary files (e.g. email attachments), source data (e.g. IMF), metadata (e.g. header fields), and data to verify authenticity (e.g. digital signatures), all in machine-readable form with full capture of semantics and provenance information.
  • Email-to-PDF provides a migration pathway for readily disseminating aggregated email messages independent of email applications. It would preserve many of the essential attributes of the message, including header metadata, in an easily distributable format that can be opened on any device that includes a basic PDF reader.
  • A standardized application of PDF technology can serve as a stable and structured means of bundling extractable email source data, universally usable archival-quality renderings including attachments, and provenance metadata.

Email to PDF migration leverages existing standards and a diverse vendor community

  • There are many use cases for preserving, searching and reusing email from commodity services.
  • PDF allows the ability to integrate interoperable email preservation tools into existing, widely used tools such as email servers and clients.
  • Email as PDF could be ingested, stored, preserved and disseminated from established, widely implemented repository systems that are already in use in government, academic, public, and corporate archives and libraries.
  • Since the PDF format is so extensible and widely implemented, a common understanding of best-practice for archiving email with PDF would facilitate development of email-specific viewers to provide browsing and searching functions similar to those that exist within email client applications.

In short, the "email archiving in PDF" concept seeks to build on widely implemented standards and technologies.  It would allow individuals and institutions a pathway to migrate email into the most widely used format for the distribution of text documents.

PDF possibilities

PDF is, of course, a marketplace leader in universal document presentation.  But there is a catch.

While PDF is integrated into many email systems, current outputs typically amounts to little more than a digital printout.  Attachments, metadata, context, and sometimes, even searchable text are missing. Simply "printing to PDF" fails to meet the specific needs of institutions archiving volumes of complex email messages, at least as currently implemented. 

How can institutions ensure authenticity, completeness, privacy, security and other needs, especially when working with thousands or millions of messages, when most header metadata and attachments are lost in the conversion?

In 2019 the Mellon Foundation funded some additional work to come up with the beginning of a solution.  We assembled a small group of experts, some in email archiving and others in PDF. Members included representatives of the Library of Congress, the National Archives and Records Administration, university and state archival institutions, and several PDF technical experts. 

The group identified and documented the essential characteristics and technical requirements for converting email into PDF. The work will soon be published as a set of fundamental requirements for archiving email. The recommendations set out an approach to considering ISO 32000 Portable Document Format (PDF) technology as a model for capturing email for long-term archival purposes using open, ISO-standardized technologies.

What’s next

Following the publication of Requirements for Archiving Email using PDF, the working group developing these recommendations will seek additional funding to extend the exploration into a superset specification for PDF, oriented towards the specific needs of email archiving.  At present, we are exploring many options, and we are very interested to get thoughts, suggestions, and feedback. I’m only an email message away: prom@illinois.edu!

Chris Prom is Associate Dean for Digital Strategies and Professor at the University of Illinois Library and a Fellow of the Society of American Archivists.

Originally published at https://www.pdfa.org/pdf-days-europe-2020-keynote-the-future-of-past-email-is-pdf-2/