About OCR
PDF2XL Support avatar
Written by PDF2XL Support
Updated over a week ago

OCR stands for Optical Character Recognition - the process by which the Business and Enterprise plans convert images (from scanned documents or image files) into text that can then be converted into Excel.

PDF2XL mostly runs the OCR process page by page: OCR is performed once for each page when you visit it for the first time, and once again while the document is converted


OCR Validation

The process of automatically recognizing text in images is very difficult, especially in a low quality scan, so it's not unexpected for the OCR to have the occasional mistake. That's why PDF2XL includes a process called OCR Validation, where the user is asked to validate some words that the OCR engine is uncertain about.

The OCR Validation process is not mandatory, but it's highly recommended to do it before converting the document so that you have an accurate output.

It's important to note that sometimes the OCR process itself will have to run again - for example, if you change something in the layout - and this will cancel any validation you might have done before. Therefore, it's recommended to fully prepare your layout, including all the details such as column formats, and only then use the OCR Validation.

Tip: Changing column formats from Automatic to a specific option may help the OCR engine read the text correctly. For example, it may be hard to tell the difference between '0' and 'O' or '1' and 'l', but if it's a numeric column then this won't be a problem.


Supported Languages

The Default OCR engine supports the following languages:

  • Arabic*

  • Bulgarian

  • Catalan; Valencian

  • Croatian

  • Czech

  • Danish

  • Dutch

  • English

  • Estonian

  • Finnish

  • French

  • German

  • Hungarian

  • Indonesian

  • Italian

  • Latvian

  • Lithuanian

  • Norwegian

  • Polish

  • Portuguese

  • Romanian

  • Russian

  • Slovak

  • Slovenian

  • Spanish

  • Turkish

*Arabic language cannot always be correctly converted, as it is dependent upon quality of the document, as well as the specific fonts used. Learn more about this here.

The *Advanced OCR engine supports the following languages:

  • Dutch/Flemish

  • English

  • French

  • German

  • Italian

  • Portuguese

  • Spanish/Castilian

*Advanced OCR only available in PDF2XL Enterprise v8+


The OCR Engine

  • Default OCR: Nicomsoft OCR library v7.1

  • Advanced OCR: Tesseract 4.1.1

Did this answer your question?