Article Preview
TopIntroduction
The process of long-term maintenance of digital assets for use and re-use imposes a number of challenges, including the limitations of storage technologies and the choice of future-proof file formats. In context of the latter challenge, digital archives, for example, must be able to handle a number of different media formats such as audio or video recordings or textual documents. One variant of digital assets are page-oriented, text-centric documents as, for example, generated in office productivity software. The native format in which those documents were originally created is often not suitable for long-term archival (Anderson, 2005). Dryden (2008) stresses the need for digital file formats designed for long-term archival stating ‘it is not an exaggeration to say that long-term preservation of digital objects is the biggest challenge facing not just the archival profession but society as a whole.’
A common choice (Library of Congress, 2019) is, therefore, to convert those documents to PDF which has properties attractive for archival such as being ‘read-only’ and the ability to reproduce the original document across different devices (even web browsers can display PDF files, see Mozilla Labs, 2020).
In the context of long-term archival, how can it be guaranteed that PDF files can be read in a future without today’s computer systems? Here, ‘reading’ is not limited to the extraction of text and images, but includes as well the visual appearance, logical structure, and metadata of a document. Various ISO standards (ISO, 2005, 2011, 2012a) specify subsets of ‘normal’ PDF variants under the name ‘PDF/A’ in order to address those requirements, i.e. it should be possible to read a standard-conformant PDF/A file just by implementing the ISO standards.
Further, the importance of transitioning from PDF to PDF/A is elaborated by an analogy as follows:
Pressure from the preservation community provided the catalyst for many publishers to change over from acidic to acid-neutral paper in the production of published works. Introducing more stable materials at the beginning of the information production process represents in a significant victory for preservation interests which in the long run will reduce the need for salvage efforts. (Hedstrom, 1998)
Whereas there is a broad agreement on PDF/A standards are the preferred choice when archiving PDF files (Bundesarchiv, 2010; LAC, 2015; Riksarkivet, 2009; Rog, 2007; Swiss Federal Archives, 2020), adopting PDF/A standards in a PDF workflow has multiple challenges. A central aspect here is how to determine if a given PDF file actually conforms to a PDF/A standard, usually at least to the most basic specification, PDF/A-1b. Especially public sector organizations such as universities, which have a legal obligation to archive important documents (SFS, 1993, 2012), are motivated to adopt PDF/A in order to save costs (less physical storage required) and general ‘modernization’.
This study investigates the following research questions specifically related to the long-term archival of PDF/A files by public sector organizations:
RQ 1: What characterizes PDF files provided by public sector organizations?
RQ 2: How successful are public sector organizations at providing PDF/A-1b-conformant files?
RQ 3: How and why does the outcome of assessments of PDF/A-1b conformance for files differ between conformance checking tools?