In this guide you’ll learn how to OCR a PDF document in SharePoint using a Designer Workflow. This enables you to process scanned or bitmap based content and generate fully searchable PDFs. We’ve also added support for the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from all those documents that use a common template or layout.
Most organizations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system or MS SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology, as the content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.
This is where OCR (Optical Character Recognition) comes in. OCR analyzes image based content – e.g. a scanned PDF or an image embedded in a MS Word file and applies some image recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers can also index this text as well.
Scanned Document with OCRed text selected
It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. This is what it looks like.
The workflow activity added is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.
this document: The source document to Convert and OCR. For most workflows selecting Current Item will suffice, but some scenarios may require the lookup of a different item.
this file: The name and location to write the generated file to. Leave this field empty to use the same location and name as the source file. Please note that if your source file is already in PDF format then leaving this field empty will overwrite it. For details about how to specify paths to different libraries / site collections see this blog post.
include / exclude metadata: Control if the source file’s SharePoint metadata is copied to the destination file.
OCR language: The language the source document is written in. It defaults to English, but from version 7.2 and higher we support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes) <Regions>…</Regions>.
List ID: The ID of the list the processed file was written to. This can be used later in the workflow to perform additional tasks on the file such as a check-in or out.
Item ID: The ID of the processed file. Can be used with the List ID
Although creating simple workflows in SharePoint Designer is relatively easy, if the concept of MS SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.
Note: The OCR and PDF/A Archiving Add-On license is needed in order to use OCR in your production environment.