In this guide, you’ll learn how to extract text from a PDF in SharePoint. A common use case for this functionality is to extract a particular area of text from all documents that use a common template or layout. For example, if a reference number can always be found in the top-right corner of a scanned document, then that text can be extracted and stored in a SharePoint column from which it can be included in searches or used in additional workflow steps.
This guide outlines how to do this using an MS SharePoint Designer workflow.
Once PDF Converter for SharePoint is installed, a number of new workflow activities will be added automatically in MS SharePoint Designer, including the new Extract text using OCR activity, which looks like what’s shown in the following image.
this document — The source document to OCR and extract text from. For most workflows, selecting the current item will suffice, but some scenarios may require the lookup of a different item.
OCR language — The language the source document is written in. It defaults to English, but from version 7.2 and higher, we also support Arabic, Danish, Dutch, Finnish, French, German, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish, and Swedish.
OCR performance — Specify the performance and accuracy of the OCR engine. The recommendation is to leave this on the default setting, Slow but accurate.
Whitelist / Blacklist — Control which characters are recognized. For example, limit recognition to numbers by allowing 1234567890. This prevents, for example, a 0 (zero) from being recognized as the letter o or O.
Pagination — In some specific cases, a single image spans multiple pages. Enable pagination for those cases.
Region — Specify the x, y, width, and height coordinates of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then note that internally, the image is first converted to PDF, which may add margins around the image but guarantees that a single unified UOM is used across all file formats. If you aren’t sure how internal conversion affects the dimensions of your image or scan, use our software to convert the file to PDF and open it in a PDF reader before specifying the coordinates.
Page — By default, text is extracted from all pages and concatenated. To extract the text from a specific page, specify the page number in this field.
Result — The recognized text will be stored in this variable (of type
Note: The OCR and PDF/A Archiving add-on license is needed to use OCR in your production environment.
Example of an MS SharePoint Designer Workflow to Extract Text from a PDF
In this example, an MS Workflow Designer workflow retrieves all the PDF files created during the current day, extracts specific text from the PDF, and updates this into a list column. In an ideal setup, you’ll schedule this workflow to run outside of office hours to batch-process all newly created PDF files and extract text from them.
The legacy MS SharePoint 2007 / 2010 workflow engine is fully supported, as is the optional Workflow Manager that comes with MS SharePoint 2013 and later versions. For more details, refer to this post.
Before you start, make sure PDF Converter for SharePoint On-Premises is installed, and that you have access to a site collection with the appropriate rights to create workflows.
To retrieve PDF files from a document library, extract specific text, and update them to a list column, first configure the document library to store PDF files, and then configure the workflow by following the steps in the next section.
Creating and Configuring the Document Library
You can create and configure the document library by performing the following steps:
Create a document library named Proposal Documents. (Alternatively, you can use any name of your choice, but this naming is used for the purposes of this guide.)
Once created, navigate to Settings > Document Library Settings > Versioning Settings and enable Requires content approval for submitted items.
In the document library, create two folders, Confidential Proposals and
Approved PDF Files.
Add a separate column called OCR of text type.
Creating and Configuring an MS SharePoint Designer Workflow
Create and configure the MS SharePoint Designer workflow by performing the following steps:
Start MS SharePoint Designer and open the site collection that contains the Proposal Documents document library.
Click Add Item and select List Workflow.
Fill in the following fields:
- Specify the name for the new workflow: Extract Text from PDF Format
- Specify the list to associate with the new workflow: Proposal Documents
- Choose the workflow platform for the new workflow: SharePoint 2010 Workflow
- Click Create.
You’re now ready to create the workflow. From the conditions menu, select the If current item field equals value condition.
Click the first value (field) and select Created from the dropdown.
Click the next value (equals) and select is less than or equal to from the dropdown.
Click the next value (value) and select the three dots (...) next to function (fx). Select Current Date from the popup.
With the conditions in place, you can now add the actions.
From the actions menu, select Extract text using OCR. It may be hidden behind the All Actions option.
The following action is inserted:
Insert a new action named Update List Item and select this list. Update the text copied from the PDF to the SharePoint list column.
- Click this list.
- Select Current Item from the list, click Add, and select OCR from Set this field. Next to To this value, click fx and specify Workflow Variables and Parameters as the source. Set the field to the variable name he text has been stored to. Click OK.
- Click OK again to return to the MS SharePoint Designer workflow.
Insert a new action named Log to History List and enter Text Copied.
Click Publish to deploy and activate the workflow.
Testing the Workflow
You can test the workflow created by performing the following steps:
Upload a PDF document to the Confidential Proposals folder containing text as specified in the x and y coordinates in the workflow definition.
From the context menu, manually start the workflow.
This will start the workflow, and after a few seconds, the workflow status should change to Completed. Refresh the list, and you’ll see the OCR column contains the text that has been copied from the PDF.
If an error occurs during the execution of the workflow, you can perform the following to troubleshoot:
Check the messages on the workflow status screen.
Check for errors in the Windows Event log.
Check for errors in the SharePoint trace log.