In this guide you'll learn how to archive a wide varitey of document types in SharePoint by converting them to the PDF/A format. This guide can be used in SharePoint Online or On-premises deployments.
What does PDF/A aim to achieve?
PDF/A aims to produce files with static content that can therefore be visually reproduced completely precisely today and in many years time. Files that are subject to long-term archiving should work regardless of the device or operating system used. The future usability of PDF/A files must also be guaranteed in a manufacturer-independent manner – and this includes Adobe. PDF/A is a ‘complete’ format. This means that PDF files that comply with the PDF/A standard are complete in themselves and use no external references or non-PDF data. The PDF/A-1 standard is based on PDF/A specification 1.4, which means that it works within the technical scope of the functions available in Acrobat 5.
A range of rules must be observed when generating PDF/A files in order to meet the goals named above. For example, when generating PDF/A, it is important to embed all fonts and clearly specify all colors. Forms, comments, and notes are only permitted to a limited extent. Compression is allowed as a general rule, but LZW and JPEG2000 are excluded. Transparent objects and layers (Optional Content Groups) are not permitted. PDF/A uses rules for metadata that are based on XMP (Extensible Metadata Platform). Finally, a PDF/A file must identify itself as such.
Please note that you need the OCR and PDF/A Archiving Add-On add-on license in addition to a valid PDF Converter for SharePoint On-Premises or PDF Converter Services license to use this functionality.
Muhimbi’s interpretation of the PDF/A standard
There are as many PDF/A validators as there are interpretations of the specification. The validator we use internally for testing purposes is the one that comes with Adobe Acrobat Pro X.
In the screenshot below you can see the validation results of a PowerPoint file that was converted to PDF by the Muhimbi PDF Converter and subsequently post processed for output as PDF/A1b. As you can see it validates perfectly well. The same is true for conversion to PDF/A2b.
PDF/A File generated by the Muhimbi PDF Converter
As you can see in the screenshot below, the same document saved by PowerPoint itself in PDF/A format does not validate successfully.
PDF/A File generated by PowerPoint
The validator checks many rules including important ones such as font embedding, but also rules that are perhaps not so important like the fact that the Modification Date stored in the PDF is exactly the same as the one stored in the XMP meta-data.
Before we start building the workflow, ensure all prerequisites are in place. It is also assumed that the reader has some knowledge of building Workflows using Nintex Workflow.
Make sure the PDF Converter for SharePoint On-Premises version 4.1 (or newer) is installed.
Naturally, Nintex Workflow will need to be installed as well.
Make sure the Muhimbi.PDFConverter.Nintex.WebApp SharePoint Feature is activated using SharePoint Central Administration on the relevant Web Application.
You will need to have the appropriate privileges to create workflows.
If you intend to carry out PDF/A post processing, please configure Ghostscript as per the section Configuring the Muhimbi PDF Converter to use PDF/A.
Configuring the Muhimbi PDF Converter to use PDF/A
The Muhimbi PDF Converter relies on 3rd party software to carry out the PDF/A post processing step. Fortunately this software is free to download and install for both individuals and organizations. The actual use of the software is less than trivial, but our software takes care of all the complexities.
Installation of this software is optional and only required if you intend to carry out any PDF/A post processing. The steps are as follows:
Download the latest GPL Release from the Ghostscript website. Depending on your hardware and operating system, you will need to download either the 32 or 64 bit version. Muhimbi has tested the PDF/A1b post processing with versions 9.04 and later. PDF/A2b requires Ghostscript 9.06 or later.
Install Ghostscript in a location of your choice on every server that runs the Muhimbi Conversion Service. If you accept the default location, or the default location on a different drive, then the Muhimbi PDF Converter will automatically detect Ghostscript’s file path.
Once Ghostscript has been installed you are ready to go, providing you access our Web Services based interface from your own software and the PDFA.PostProcessing configuration value discussed below is set appropriately (details for use with SharePoint can be found further below). Just set the web service’s ConversionSettings.PDProfile property to PDF_A1B and any converter that doesn’t natively support PDF/A1b output will automatically send the generated PDF file to the post processor. However, depending on your exact needs you may want to update a number of settings in the Muhimbi Conversion Service’s config file.
Open ‘Muhimbi.DocumentConverter.Service.exe.config’ in your favorite text editor. The file is stored in the same directory where the Muhimbi Conversion Service has been installed. You can find a shortcut to this folder in the ‘Muhimbi Document Converter’ group in your start menu.
Change the following settings if needed:
Ghostscript.Path: Leave the path empty to auto detect the path. When manually specifying the path include the executable as well, e.g. "E:\Program Files\gs\gs9.04\bin\gswin64c.exe"
PDFA.PostProcessing: Not all converters are able to provide native PDF/A1b output. Use this setting to post-process any generated PDF file. Valid values are 'All' (Post Process files generated by all converters, including the ones that are supposed to already support PDF/A1b), 'WhenNeeded' (Post process files for only those converters that do not support native PDF/A1b output) or 'None' (Do not post process files generated by any converters. This is the default option). Please note that these values will only be used if the output format is set to PDF_A1B, either in the web service call or via the global 'ConversionSettings.ForcePDFProfile' config value.
PDFA.RasterizeTransparentContent: Define how transparent content is dealt with during conversion to PDF/A. The default setting (False) removes all transparency. If you wish to retain transparent objects then set this value to True, which will result in pages being rasterized resulting in considerably larger and slower PDF files.
ConversionSettings.ForcePDFProfile: Override the web service’s ConversionSettings.PDFProfile value during conversion. Leave empty to use the setting specified in the web service call. Accepted values are members of the Muhimbi.DocumentConverter.WebService.Data.PDFProfile enumeration or an empty string. For example: 'PDF_1_5' (Use PDF Version 1.5) or 'PDF_A1B' (Use the PDF/A standard for long term archiving).
Restart the Muhimbi Conversion Service from the Windows Services Management Console.
If you don’t directly interact with Muhimbi’s Web Services interface, but rather use the SharePoint Front End functionality that comes with the PDF Converter for SharePoint, e.g. workflow activities or manual PDF Conversion, then you MUST set the ForcePDFProfile configuration value to PDF_A1B or PDF_A2B. Please note that this is a global switch that forces all functionality provided by our product to output in PDF/A format. This may have side effects when applying PDF Security. When using SharePoint Workflows you can also specify the PDF Profile on a conversion by conversion basis, see this post.
Sample code to convert PDF Documents to PDF/A using our web services interface can be found here. If you don’t wish to use Ghostscript, or if your organization already uses a different PDF to PDF/A converter that you wish to integrate with, then please contact us.
Issues and Limitations
At this moment in time we are aware of the following issues:
64 bit only: The 32 bit version of Ghostscript 9.04 contains a bug that may interfere with PDF/A output. As a result you will need to install the 64 bit version. In other words you can only run the entire solution successfully on a 64 bit machine running a 64 bit Operating System. This bug has been fixed in Versions 9.04 and above.
Merged Documents: When documents are merged using the Muhimbi PDF Converter, and subsequently post processed for PDF/A output, then Acrobat’s validator shows the ‘CIDSystemInfo and CMap dict not compatible' validation error. It doesn’t happen for all document types and we are confident this message does not have any significant side effects.
There are other side effects inherent to the PDF/A1b standard that we currently do not have any work around. For example, as transparency is not supported, any documents that use (semi) transparent objects may not look the same as the source document. Similarly, because fonts MUST be embedded in PDF/A compliant documents, the resulting PDF file may be larger than expected, although in many cases you will find them to be smaller than the source file.
Convert to PDF/A format using Nintex Workflow
The following example explains how you can use our ‘Convert document’ action along with overrides to convert to PDF/A format. This is a single step workflow, configured in a Document Library to start when any document is created or updated. From a high level, the workflow will look like the below:
You need to add the single action, Convert document action listed under the Muhimbi PDF section in Nintex. You can fill this section as per the image below:
You may want to leave the Destination Path empty, which will write the PDF File to the exact location as the source file. Hover the mouse over the small information icons for more information about the Destination Path or any other fields.
You will also need to add following override under Content section:
This override will make sure that all converted files conform to the PDF/A2b standard. The ‘PDFProfile’ element supports the following values:
- PDF_A1B: Use the PDF/A1b standard for long term archiving.
- PDF_A2B: Use the PDF/A2b standard for long term archiving.
- PDF_A3B: Use the PDF/A3b standard for long term archiving.
- PDF_1_1: PDF 1.1 output (Compatible with Acrobat 2.0 (1994) and later).
- PDF_1_2: PDF 1.2 output (Compatible with Acrobat 3.0 (1996) and later).
- PDF_1_3: PDF 1.3 output (Compatible with Acrobat 4.0 (2000) and later).
- PDF_1_4: PDF 1.4 output (Compatible with Acrobat 5.0 (2001) and later).
- PDF_1_5: PDF 1.5 output (Compatible with Acrobat 6.0 (2003) and later).
- PDF_1_6: PDF 1.6 output (Compatible with Acrobat 7.0 (2005) and later).
- PDF_1_7: PDF 1.7 output (Compatible with Acrobat 8.0 (2006) and later).
Click Save button and click Publish button to publish the workflow.
Upload a document in the source location, and you will notice that this workflow will successfully convert this to PDF/A format.