How to OCR Images & Scanned PDFs using C#

As of version 7.1 Muhimbi’s range of PDF Conversion products offers support for Optical Character Recognition (OCR). Similar to all other functionality provided by our products, this new OCR facility can be used using our friendly Web Services Interface as well as our SharePoint Designer and Nintex Workflow Actions.

In this post we’ll provide a simple .NET sample that invokes our Web Services interface to make an image based PDF fully searchable. The code is nearly identical to the code to convert and watermark a simple MS-Word file with the following exceptions:

The code looks for PDF source files (an image based PDF is included in the downloadable sample code).
The conversionSettings.OCRSettings property is populated with relevant OCR settings such as the language.
The client.ProcessChanges() method is invoked rather than client.Convert().
All references to watermarks have been removed as they are not part of this sample.

You can apply the same changes to the PHP and Ruby samples to make it do the same using those languages. A separate Java based OCR sample is available here.

Sample Code

Listed below is sample code to carry out OCR processing. You can either copy the code from this blog post, download the Visual Studio Project or open the project from the Sample Code folder in the Windows Start Menu.

The sample code expects the path of the source PDF file on the command line. If the path is omitted then the first PDF file found in the current directory will be used.

Download and install version 7.1 of the Muhimbi PDF Converter Services or PDF Converter for SharePoint.
Create a new Visual Studio C# Console application named OCR_PDF.
Add a Service Reference to the following URL and specify ConversionService as the namespace. If you are developing on a remote system (a system that doesn’t run the Muhimbi Conversion Service) then please see this Knowledge Base Article.
http://localhost:41734/Muhimbi.DocumentConverter.WebService/?wsdl
Paste the following code into Program.cs.
Make sure the output folder contains an image based PDF (e.g. a scan).
Compile and execute the application. The processed PDF file will automatically be opened in your system’s PDF reader. Try using your PDF Reader’s search facility to find and highlight the OCRed text.

As all this functionality is exposed via a Web Services interface, it works equally well from Java, PHP, Ruby and other web services enabled environments. Please note that you need the OCR & PDF/A Archiving add-on license in addition to a valid PDF Converter for SharePoint or PDF Converter Services License in order to use this functionality.

This code is merely an example of what is possible, feel free to adapt it to you own needs. The possibilities are endless.

Any questions or remarks? Leave a message in the comments below or contact us.

Labels: Articles, OCR, pdf, PDF Converter Professional, PDF Converter Services

How to OCR Images & Scanned PDFs using C#

Clavin Fernandes

Related Products

PDF Converter

Share

Author

Clavin Fernandes

Have a Question?
We’re Always Happy to Help.

Products

PDF Converter

PDF Editor

OCR and PDF/A

Learn

Support

About Us

Newsletter

How to OCR Images & Scanned PDFs using C#

Clavin Fernandes

Related Products

PDF Converter

Share

Author

Clavin Fernandes

Have a Question?We’re Always Happy to Help.

Products

PDF Converter

PDF Editor

OCR and PDF/A

Learn

Support

About Us

Newsletter

Have a Question?
We’re Always Happy to Help.