In this guide you’ll learn how to OCR and extract data from specific coordinates in an image or PDF document using Power Automate.
Using PowerAutomate to Convert to PDF
This example takes you through extracting text from an image-embedded PDF file and updating the extracted text to the MS SharePoint library in a custom column created for this purpose. From a high-level perspective, the Flow will look like the below:
The steps to create are as follows:
- Create a new Flow and use the SharePoint Online trigger ‘When a file is created (properties only)’. Fill out the URL for the site collection and select the relevant SharePoint Site Address, Library Name, and Folder from the dropdown menu.
- Insert MS SharePoint's ‘Get file content’ action and fill it out as per the screenshot displayed below. Naturally, you will need to substitute the Site Address with a suitable value and File identifier with the output value of ‘When a file is created (properties only)’ action.
- Insert Muhimbi's ‘Extract text using OCR’ action and fill it out as per the screenshot displayed below.
Source file name: Name of the source file including extension. Source file content: Content of the file to OCR. Select ‘Body’ which is the output value of ‘Get file content’ action. Language: Select the language of the OCR file. In our case, we select ‘English’. X Coordinate: Select the X Coordinate (in Pts, 1/72 of an inch) to be OCR’ed. In our case we enter ‘150’. Y Coordinate: Select the Y Coordinate (in Pts, 1/72 of an inch) to be OCR’ed. In our case we enter ‘368’. Width: Select the width (in Pts, 1/72 of an inch) to be OCR’ed. In our case, we enter ‘92’. Height: Select the height (in Pts, 1/72 of an inch) to be OCR’ed. In our case, we enter ‘80’. Page number: Page number to be OCR’ed.Leave this blank to OCR all pages or for images.
- Insert an MS SharePoint ‘Update file properties’ and fill it out as per the screenshot displayed below. This will update the OCR’ed text back to the column in the library for the item specified by the item id.
Site Address: Select the site address where the MS SharePoint library to which the OCR’ed content needs to be updated. Library Name: Select the MS SharePoint library to which the OCR’ed content needs to be updated. Id: This is the unique identifier of the item to be updated. In our case, select ‘Id’ which is the output of ‘When a file is created (properties only)’ action. Item: This is the column name of the library to which data has to be updated. In our case, ‘convertedtext’ is the name of the column and we will update ‘out text’ which is the output of ‘Extract text using OCR’ action. You can update the value suitably based on whatever column is named in your scenario.
- Publish the workflow and upload an image-embedded PDF file in the specified document library. After a few seconds, the Flow will trigger and the OCR’ed content will be updated to the ‘convertedtext’ column in our MS SharePoint library.