Extract Text from PDF using Power Automate

In this guide you’ll learn how to OCR and extract text from a scanned or faxed PDF document using Power Automate.

In this specific example we use the Muhimbi ‘Extract Text using OCR’ action to extract text from an image-based PDF (list attachment) and write the extracted text to a MS SharePoint List column.

Note: Extracting text will only work with image-based content (mainly scans and faxes). It is not possible to extract text from PDFs that contain ‘real text’, such as PDFs generated from MS Word documents.

Prerequisites

Before we start building the workflow, ensure all prerequisites are in place. It is also assumed that the reader has some knowledge of building Workflows using Power Automate.

  • An Office 365 subscription with SharePoint Online license.
  • Muhimbi PDF Converter for Power Automate full or trial subscription.
    Note: Free subscription does not support OCR.
  • Appropriate privileges to create Flows.
  • Working knowledge of MS SharePoint Online and Microsoft Flow.

Setting up MS SharePoint Online Environment

  1. Create a MS SharePoint Online List and Add the following columns:

create sharepoint list

FieldData TypeDetails
Extracted textMultiple lines of textWe will use this to Store the text extracted from the PDF document.
To processYes/No (Default value ‘Yes‘)This will be used to prevent recursive Flows.

Using Power Automate to Extract Text from PDF Documents and Forms

On a high level, the workflow will look as follows:

power automate flow

  1. We will use ‘When an item is created or modified’ SharePoint Flow Trigger. In the trigger, specify the path to the SharePoint Online List to monitor for new items.

when item is created

  1. Initialize the variables referencing to the screenshot below:

variables

  1. In this step, (Condition) we will manage the recursive event (continuous loop).

condition

We are using the AND operator to prevent an endless loop from happening.

  • ‘Has attachment’ (Output of ‘When an item is created or modified’ trigger) is equal to ‘True’.
  • ‘To Process’ (Output of ‘When an item is created or modified’ trigger) is equal to ‘True’.
  • The ‘To Process’ is a Column of type ‘Yes/No’ and the default value is set to ‘Yes (true)’. Only if both the values evaluate to true, will it OCR the document or else it will just terminate the Flow. We will set the ‘To Process’ field to ‘False (No)’ in the ‘Update item’ action.
  • Now, as the Flow updates a column in the same item, the Trigger (‘When an item is created or modified’) will always be invoked by the ‘Update Item’. However, now that we have set the ‘To Process’ field to ‘False’, the Flow will be terminated when it is triggered a second time.
  1. If both conditions evaluate to ‘true (Yes)’, OCR the document else ‘false (No)’, terminate the Flow.

ocr the document

  1. Add the ‘Get attachments’ MS SharePoint action and specify the path to the MS SharePoint Online List.
  • ID: Select the ‘ID’ which should be the output from ‘When item is created or modified’. Please ensure you select CAPS ID.

select id

when item is modified

  1. As a List item can have multiple attachments, add the ‘Apply to each’ loop and set it to the ‘Body’ field, output of the MS SharePoint Online ‘Get attachments’ action.

apply to each

  1. Add the MS SharePoint Online ‘Get attachment Content’ action and specify the path to the MS SharePoint Online List.
  • ID: Select the ‘ID’ which should be the output from ‘When item is created or modified’. Please ensure you select CAPS ID.
  • File Identifier: ‘Id’ is the output of the ‘Get attachments’ action.

get attachment content

  1. Add the ‘Extract text using OCR’ action. This is where the extraction of the text from the image happens. In this example, we keep it easy and we extract all text from the page.

    Note: It is possible to specify a range of coordinates to extract the text from.

extract text using ocr

  • Source file name: Use the ‘Display Name’ i.e. the output of the ‘Get Attachment’ action.
  • Source File Content: The content of the file to process. Use the ‘Attachment Content’ the output of the ‘Get attachment content’ action.
  • Language: The language the source document is written in. It defaults to English, but supports other languages such as Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  • Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then you need to take into account that internally the image is first converted to PDF, which may add margins around the image. However a single, unified Unit Of Measure is used across all file formats. If you want to know how internal conversion affects the dimensions of your image or scan then convert the file to PDF and open it in a PDF reader to get the details.
  • Page number: By default, text is extracted from all pages and concatenated. To extract the text from a specific page, specify the page number in this field.
  • Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default ‘Slow but accurate’ setting.
  • Whitelist / Blacklist: Control which characters are recognised. For example, limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  • Use Pagination: In some specific cases, a single image spans multiple pages. Enable pagination for those cases.
  1. Add the ‘Set Variable’ action. Set the value of the OCR-Text to the ‘Temporary’ variable and ‘Out text’ to the output of the ‘Extract text using OCR’. Using this we concatenate the text of all the list item’s attachments into a single variable.

set variable action

  1. Add the ‘Set Variable’ action. Set variable(Temporary) to ‘OCR-Text’.

ocr text

  1. Add the MS SharePoint Online ‘Update item’ action, and specify the path to the MS SharePoint Online List for the item to be updated.

    Note: This Action is outside the ‘Apply to Each’ Loop.

  • ID: Select the ‘ID’ which should be the output from ‘When item is created or modified’. Please ensure you select CAPS ID.
  • Title: ‘Title’ the output from ‘When item is created or modified’ trigger.
  • Extracted Text: ‘OCR-Text’ the output of ‘Set variable(OCR-Text)’.
  • To Process: Set it to ‘No’. It is this that prevents a continuous loop.

update item

  1. With everything in place, create a list item with attachments, and after a few seconds you will notice that the text that is extracted from the PDF files is updated to the MS SharePoint List.

Have a Question?
We’re Always Happy to Help.

© Muhimbi Ltd. 2008 - 2023
This website uses cookies to ensure you get the best experience. Learn more