OCR stands for Optical Character Recognition - the process by which PDF2XL OCR and Enterprise convert images (from scanned documents or image files) into text that can then be converted into Excel.

PDF2XL OCR/Enterprise mostly runs the OCR process page by page: OCR is performed once for each page when you visit it for the first time, and once again while the document is converted

OCR Validation

The process of automatically recognizing text in images is very difficult, especially in a low quality scan, so it's not unexpected for the OCR to have the occasional mistake. That's why PDF2XL OCR/Enterprise includes a process called OCR Validation, where the user is asked to validate some words that the OCR engine is uncertain about. The images of suspected words will be displayed one after the other, with a text box under each containing what the OCR engine thinks is written there, and the user can either accept that result or, if it's mistaken, fix it.

The OCR Validation process is not mandatory, but it's highly recommended to do it before converting the document so that you have an accurate output.

It's important to note that sometimes the OCR process itself will have to run again - for example, if you change something in the layout - and this will cancel any validation you might have done before. Therefore it's recommended to fully prepare your layout, including all the details such as column formats, and only then use the OCR Validation.

Tip: Changing column formats from Automatic to a specific option may help the OCR engine read the text correctly. For example, it may be hard to tell the difference between '0' and 'O' or '1' and 'l', but if it's a numeric column then this won't be a problem.

Supported Languages

The OCR engine supports the following languages:

  • Afrikaans

  • Amharic

  • **Arabic

  • Assamese

  • Azerbaijani

  • Azerbaijani - Cyrillic

  • Belarusian

  • Bengali

  • Tibetan

  • Bosnian

  • Bulgarian

  • Catalan; Valencian

  • Cebuano

  • Czech

  • Chinese - Simplified

  • Chinese - Traditional

  • Cherokee

  • Welsh

  • Danish

  • German

  • Dzongkha

  • Greek, Modern (1453-)

  • English

  • English, Middle (1100-1500)

  • Esperanto

  • Estonian

  • Basque

  • Persian

  • Finnish

  • French

  • Frankish

  • French, Middle (ca. 1400-1600)

  • Irish

  • Galician

  • Greek, Ancient (-1453)

  • Gujarati

  • Haitian; Haitian Creole

  • Hebrew

  • Hindi

  • Croatian

  • Hungarian

  • Inuktitut

  • Indonesian

  • Icelandic

  • Italian

  • Italian - Old

  • Javanese

  • Japanese

  • Kannada

  • Georgian

  • Georgian - Old

  • Kazakh

  • Central Khmer

  • Kirghiz; Kyrgyz

  • Korean

  • Kurdish

  • Lao

  • Latin

  • Latvian

  • Lithuanian

  • Malayalam

  • Marathi

  • Macedonian

  • Maltese

  • Malay

  • Burmese

  • Nepali

  • Dutch; Flemish

  • Norwegian

  • Oriya

  • Panjabi; Punjabi

  • Polish

  • Portuguese

  • Pushto; Pashto

  • Romanian; Moldavian; Moldovan

  • Russian

  • Sanskrit

  • Sinhala; Sinhalese

  • Slovak

  • Slovenian

  • Spanish; Castilian

  • Spanish; Castilian - Old

  • Albanian

  • Serbian

  • Serbian - Latin

  • Swahili

  • Swedish

  • Syriac

  • Tamil

  • Telugu

  • Tajik

  • Tagalog

  • Thai

  • Tigrinya

  • Turkish

  • Uighur; Uyghur

  • Ukrainian

  • Urdu

  • Uzbek

  • Uzbek - Cyrillic

  • Vietnamese

  • Yiddish

**Arabic recognition may be limited. Be sure you have the Despeckle option set as "Do Not Apply" in Options > OCR > OCR Tweaking.

The OCR Engine

  • PDF2XL OCR uses Google's Tesseract 4.0

  • Version 6.5 uses iDRS version 5.2.2

Did this answer your question?