In this guide you’ll learn how to extract text from a PDF using a Nintex Workflow. Common use cases for this functionality is to extract a particular area of text from all documents that use a common template or layout. For example, if a reference number can always be found at the top right corner of a scanned document then that text can be extracted and stored in a SharePoint column from where it can be included in searches or be used in further workflow steps.
This article describes achieving this using the Nintex Workflow.
Once the Muhimbi PDF Converter for SharePoint On-Premises is installed, and the Nintex Workflow Integration has been activated, a number of new activities will be added automatically to the list, including the new Extract text using OCR activity. It is compatible with Nintex Workflow 2007, 2010 & 2013 and this is what it looks like.
The fields supported by this Workflow Activity are as follows:
Language: This is the language the source document is written in. It defaults to English, but from PDF Converter for SharePoint On-Premises version 7.2 and above, supports Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
Performance: Specify the performance or accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
Whitelist / Blacklist: You can control which characters are recognised. For example, you can limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
Pagination: In some specific cases, a single image spans multiple pages. Enable pagination for those cases.
Region: Specify the x, y, width and height coordinates of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified - UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then use our software to convert the file to PDF and open it in a PDF reader before specifying the coordinates.
Page Number: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
Output Text: The recognised text will be stored in this variable (type String).
Source List ID & List Item: The item that triggered the workflow is processed by default. You can optionally specify the ID of a different List and List Item using workflow variables. Please use the data type of string for the List ID workflow variable. For the Item ID use the data type of Item ID (in SharePoint 2007) or Integer (in SharePoint 2010 / 2013)
Error Handling: Similar to the way some of Nintex’ own Workflow Activities allow errors to be captured and evaluated by subsequent actions, all of Muhimbi’s Workflow Activities also allow the same. By default, this facility is disabled meaning that any error terminates the workflow.
Note: OCR and PDF/A Archiving Add-On license is needed in order to use OCR in your production environment.
Example of Nintex Workflow to Extract text from PDF
In this example, a Nintex workflow retrieves all the PDF files created during the current day and extracts specific text from the PDF, and updates this into a List Column. In an ideal setup, you will schedule this workflow to run out of office hours to batch-process all newly created PDF files and extract text from the same.
The finished workflow
Before we start building the workflow, ensure all prerequisites are in place. It is also assumed that the reader has some knowledge of building Workflows using Nintex Workflow.
Make sure the PDF Converter for SharePoint On-Premise version 7.1 (or newer) is installed in line with chapter two of the Administration Guide.
Naturally, Nintex Workflow will need to be installed as well.
Make sure the Muhimbi.PDFConverter.Nintex.WebApp SharePoint Feature is activated using SharePoint Central Administration on the relevant Web Application.
You will need to have the appropriate privileges to create workflows.
Creating a new workflow
To get started, create a new workflow and choose the blank template. Ensure the workflow doesn’t start automatically, and add the following variables and data types:
Ensure that the appropriate data types are assigned to the variables. They are listed under the ‘Type’ column beside the variable name. The names are largely self-explanatory, but some additional information is provided below:
- Source Item ID: By default, the PDF file that triggered the workflow is converted to text and updated to the List column.However, as we are iterating over multiple items, we need to specify the item’s ID to convert in this variable. In SharePoint 2010 and later, select Integer as the Type, not List Item ID.
- Source List ID: The PDF Converter assumes the item that is being converted to text is located in the same list the workflow is attached to. However, if this is not the case, then the list ID (a GUID) will need to be specified as well. In this example, everything is located in the same list, so this variable is not used.
- Source Files: As we are potentially converting multiple PDF files to text, we need to define a variable of type Collection to hold the list of files we will be iterating over.
- Generated PDF Item ID: Once a file has been converted, you may want to carry out additional actions on this new file. For example, checking it in. Once converted, the ID of the PDF is automatically stored in this variable. In SharePoint 2010 and later select Integer as the Type, not List Item ID. In this example, this variable is not used.
- Generated PDF List ID: As the PDF Converter allows files to be written to different document libraries, and even completely different Site Collections, you may want to know the ID of the destination list. In this example, this variable is not used.
- Extracted Text: Once the text has been extracted from the PDF file, it is stored in this variable to be used later. For example, in our case we will update this text to the List Column.
Adding the workflow actions
You are now ready to add the actions to the workflow. You can start by adding a Query List action, allowing you to retrieve all files modified today and store the results in the Source Files collection.
You can fill out the settings for this action as per the screenshot listed above. You may want to add an additional filter rule to check that Content Type is not equal to Folder or Document Set.
You can continue by adding the For Each action to the workflow. Specify the collection’s name to iterate over and the variable’s name to store the Item’s ID in.
The next set of actions you add will need to be added inside the For Each action to ensure they are executed separately for each file in the list.
After this, you need to add the Extract text using OCR action listed under the For each action. You can fill this section as per the image below:
OCR and PDF/A Archiving Add-On integrates with all Nintex Workflow versions.
For more information about any of the fields in this screen, hover the mouse over the small information icons.
You can continue by adding the Update Item action to the workflow. Specify the List Column that you want to update the extracted text to.
The workflow is now done. If you want, you can add some logging information using the Log In the History List action.
Running the workflow
You can finalize the workflow by saving and publishing it, after which the workflow is ready to be executed.
You can either run the workflow manually or schedule it to run at a specific time of your choice.