Skip to main content

About OCR

What is OCR?

Written by Team PDF2XL

OCR stands for Optical Character Recognition - the process by which the Business and Enterprise plans convert images (from scanned documents or image files) into *text that can then be converted into Excel.

PDF2XL mostly runs the OCR process page by page: OCR is performed once for each page when you visit it for the first time, and once again while the document is converted. This can take a while if you are converting a very large document.

*The application does not transcribe pictures into text!


OCR Validation

The process of automatically recognizing text in images is very difficult - especially in a low quality scan - so it's not unexpected for the result to have inaccuracies. That's why PDF2XL includes a process called OCR Validation, where the user is asked to validate some words that the OCR engine is uncertain about.

The OCR Validation process is not mandatory, but it's highly recommended to do it before converting the document so that you have a more accurate output. It's not always possible to get a 100% accurate conversion even after validation.

It's important to note that sometimes the OCR process itself will have to run again - for example, if you change something in the layout - and this will cancel any validation you might have done before. Therefore, it's recommended to fully prepare your layout, including all the details such as column formats, and only then use the OCR Validation.

Tip: Changing column formats from Automatic to a specific option may help the OCR engine read the text correctly.

For example, it may be hard to tell the difference between '0' and 'O' or '1' and 'l', but if the numeric column is selected, then the OCR will treat it as a number.


Supported Languages

The Default OCR engine supports the following languages:

*Arabic

Bulgarian

Catalan; Valencian

Croatian

Czech

Danish

Dutch

English

Estonian

Finnish

French

German

Hungarian

Indonesian

Italian

Latvian

Lithuanian

Norwegian

Polish

Portuguese

Romanian

Russian

Slovak

Slovenian

Spanish

Turkish

*Arabic language cannot always be correctly converted, as it is dependent upon quality of the document, as well as the specific fonts used. Learn more about this here.


The *Advanced OCR engine supports the following languages:

Afrikaans

Albanian

Amharic

Arabic

Assamese

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Burmese

Catalan; Valencian

Cebuano

Central Khmer

Cherokee

Chinese

Croatian

Czech

Danish

Dutch; Flemish

Dzongkha

English

Esperanto

Estonian

Finnish

Frankish

French

Galician

Georgian

German

Greek

Gujarati

Haitian; Haitian Creole

Hindi

Hungarian

Icelandic

Indonesian

Inuktitut

Irish

Italian

Japanese

Kannada

Kazakh

Kirghiz; Kyrgyz

Korean

Kurdish

Lao

Latin

Latvian

Lithuanian

Macedonian

Malay

Malayalam

Maltese

Marathi

Nepali

Norwegian

Oriya

Panjabi; Punjabi

Persian

Polish

Portuguese

Pushto; Pashto

Romanian; Moldavian; Moldovan

Russian

Sanskrit

Serbian

Sinhala; Sinhalese

Slovak

Slovenian

Spanish; Castilian

Swahili

Swedish

Syriac

Tagalog

Tajik

Tamil

Telugu

Thai

Tibetan

Tigrinya

Turkish

Uighur; Uyghur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Yiddish

*Advanced OCR only available in PDF2XL Enterprise v8+


The OCR Engine

  • Default OCR: Nicomsoft OCR library v7.1

  • Advanced OCR: Tesseract 5.2

Did this answer your question?