OCR stands for Optical Character Recognition - the process by which the Business and Enterprise plans convert images (from scanned documents or image files) into text that can then be converted into Excel.
PDF2XL mostly runs the OCR process page by page: OCR is performed once for each page when you visit it for the first time, and once again while the document is converted
OCR Validation
The process of automatically recognizing text in images is very difficult, especially in a low quality scan, so it's not unexpected for the OCR to have the occasional mistake. That's why PDF2XL includes a process called OCR Validation, where the user is asked to validate some words that the OCR engine is uncertain about.
The OCR Validation process is not mandatory, but it's highly recommended to do it before converting the document so that you have an accurate output.
It's important to note that sometimes the OCR process itself will have to run again - for example, if you change something in the layout - and this will cancel any validation you might have done before. Therefore, it's recommended to fully prepare your layout, including all the details such as column formats, and only then use the OCR Validation.
Tip: Changing column formats from Automatic to a specific option may help the OCR engine read the text correctly. For example, it may be hard to tell the difference between '0' and 'O' or '1' and 'l', but if it's a numeric column then this won't be a problem.
Supported Languages
The Default OCR engine supports the following languages:
Arabic*
Bulgarian
Catalan; Valencian
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
German
Hungarian
Indonesian
Italian
Latvian
Lithuanian
Norwegian
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Spanish
Turkish
*Arabic language cannot always be correctly converted, as it is dependent upon quality of the document, as well as the specific fonts used. Learn more about this here.
The *Advanced OCR engine supports the following languages:
Afrikaans
Albanian
Amharic
Arabic
Assamese
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Burmese
Catalan; Valencian
Cebuano
Central Khmer
Cherokee
Chinese
Croatian
Czech
Danish
Dutch; Flemish
Dzongkha
English
Esperanto
Estonian
Finnish
Frankish
French
Galician
Georgian
German
Greek
Gujarati
Haitian; Haitian Creole
Hindi
Hungarian
Icelandic
Indonesian
Inuktitut
Irish
Italian
Japanese
Kannada
Kazakh
Kirghiz; Kyrgyz
Korean
Kurdish
Lao
Latin
Latvian
Lithuanian
Macedonian
Malay
Malayalam
Maltese
Marathi
Nepali
Norwegian
Oriya
Panjabi; Punjabi
Persian
Polish
Portuguese
Pushto; Pashto
Romanian; Moldavian; Moldovan
Russian
Sanskrit
Serbian
Sinhala; Sinhalese
Slovak
Slovenian
Spanish; Castilian
Swahili
Swedish
Syriac
Tagalog
Tajik
Tamil
Telugu
Thai
Tibetan
Tigrinya
Turkish
Uighur; Uyghur
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Yiddish
*Advanced OCR only available in PDF2XL Enterprise v8+
The OCR Engine
Default OCR: Nicomsoft OCR library v7.1
Advanced OCR: Tesseract 4.1.1