Extract Data from PDF using Power Automate

In this guide you’ll learn how to OCR and extract data from specific coordinates in an image or PDF document using Power Automate.

Using PowerAutomate to Convert to PDF

This example takes you through extracting text from an image-embedded PDF file and updating the extracted text to the MS SharePoint library in a custom column created for this purpose. From a high-level perspective, the Flow will look like the below:

The steps to create are as follows:

convert to pdf

  1. Create a new Flow and use the SharePoint Online trigger ‘When a file is created (properties only)’. Fill out the URL for the site collection and select the relevant SharePoint Site Address, Library Name, and Folder from the dropdown menu.

create flow in power automate

  1. Insert MS SharePoint's ‘Get file content’ action and fill it out as per the screenshot displayed below. Naturally, you will need to substitute the Site Address with a suitable value and File identifier with the output value of ‘When a file is created (properties only)’ action.

get file content

  1. Insert Muhimbi's ‘Extract text using OCR’ action and fill it out as per the screenshot displayed below.

extract text using ocr

Source file name: Name of the source file including extension. Source file content: Content of the file to OCR. Select ‘Body’ which is the output value of ‘Get file content’ action. Language: Select the language of the OCR file. In our case, we select ‘English’. X Coordinate: Select the X Coordinate (in Pts, 1/72 of an inch) to be OCR’ed. In our case we enter ‘150’. Y Coordinate: Select the Y Coordinate (in Pts, 1/72 of an inch) to be OCR’ed. In our case we enter ‘368’. Width: Select the width (in Pts, 1/72 of an inch) to be OCR’ed. In our case, we enter ‘92’. Height: Select the height (in Pts, 1/72 of an inch) to be OCR’ed. In our case, we enter ‘80’. Page number: Page number to be OCR’ed.Leave this blank to OCR all pages or for images.

update file properties

  1. Insert an MS SharePoint ‘Update file properties’ and fill it out as per the screenshot displayed below. This will update the OCR’ed text back to the column in the library for the item specified by the item id.

Site Address: Select the site address where the MS SharePoint library to which the OCR’ed content needs to be updated. Library Name: Select the MS SharePoint library to which the OCR’ed content needs to be updated. Id: This is the unique identifier of the item to be updated. In our case, select ‘Id’ which is the output of ‘When a file is created (properties only)’ action. Item: This is the column name of the library to which data has to be updated. In our case, ‘convertedtext’ is the name of the column and we will update ‘out text’ which is the output of ‘Extract text using OCR’ action. You can update the value suitably based on whatever column is named in your scenario.

publish the workflow

  1. Publish the workflow and upload an image-embedded PDF file in the specified document library. After a few seconds, the Flow will trigger and the OCR’ed content will be updated to the ‘convertedtext’ column in our MS SharePoint library.

Have a Question?
We’re Always Happy to Help.

© Muhimbi Ltd. 2008 - 2023
This website uses cookies to ensure you get the best experience. Learn more