OCR stands for Optical Character Recognition - the process by which the Business and Enterprise plans convert images (from scanned documents or image files) into *text that can then be converted into Excel.
PDF2XL mostly runs the OCR process page by page: OCR is performed once for each page when you visit it for the first time, and once again while the document is converted. This can take a while if you are converting a very large document.
*The application does not transcribe pictures into text!
OCR Validation
The process of automatically recognizing text in images is very difficult - especially in a low quality scan - so it's not unexpected for the result to have inaccuracies. That's why PDF2XL includes a process called OCR Validation, where the user is asked to validate some words that the OCR engine is uncertain about.
The OCR Validation process is not mandatory, but it's highly recommended to do it before converting the document so that you have a more accurate output. It's not always possible to get a 100% accurate conversion even after validation.
It's important to note that sometimes the OCR process itself will have to run again - for example, if you change something in the layout - and this will cancel any validation you might have done before. Therefore, it's recommended to fully prepare your layout, including all the details such as column formats, and only then use the OCR Validation.
Tip: Changing column formats from Automatic to a specific option may help the OCR engine read the text correctly.
For example, it may be hard to tell the difference between '0' and 'O' or '1' and 'l', but if the numeric column is selected, then the OCR will treat it as a number.
Supported Languages
The Default OCR engine supports the following languages:
*Arabic | Bulgarian | Catalan; Valencian |
Croatian | Czech | Danish |
Dutch | English | Estonian |
Finnish | French | German |
Hungarian | Indonesian | Italian |
Latvian | Lithuanian | Norwegian |
Polish | Portuguese | Romanian |
Russian | Slovak | Slovenian |
Spanish | Turkish |
|
*Arabic language cannot always be correctly converted, as it is dependent upon quality of the document, as well as the specific fonts used. Learn more about this here.
The *Advanced OCR engine supports the following languages:
Afrikaans | Albanian | Amharic |
Arabic | Assamese | Azerbaijani |
Basque | Belarusian | Bengali |
Bosnian | Bulgarian | Burmese |
Catalan; Valencian | Cebuano | Central Khmer |
Cherokee | Chinese | Croatian |
Czech | Danish | Dutch; Flemish |
Dzongkha | English | Esperanto |
Estonian | Finnish | Frankish |
French | Galician | Georgian |
German | Greek | Gujarati |
Haitian; Haitian Creole | Hindi | Hungarian |
Icelandic | Indonesian | Inuktitut |
Irish | Italian | Japanese |
Kannada | Kazakh | Kirghiz; Kyrgyz |
Korean | Kurdish | Lao |
Latin | Latvian | Lithuanian |
Macedonian | Malay | Malayalam |
Maltese | Marathi | Nepali |
Norwegian | Oriya | Panjabi; Punjabi |
Persian | Polish | Portuguese |
Pushto; Pashto | Romanian; Moldavian; Moldovan | Russian |
Sanskrit | Serbian | Sinhala; Sinhalese |
Slovak | Slovenian | Spanish; Castilian |
Swahili | Swedish | Syriac |
Tagalog | Tajik | Tamil |
Telugu | Thai | Tibetan |
Tigrinya | Turkish | Uighur; Uyghur |
Ukrainian | Urdu | Uzbek |
Vietnamese | Welsh | Yiddish |
*Advanced OCR only available in PDF2XL Enterprise v8+
The OCR Engine
Default OCR: Nicomsoft OCR library v7.1
Advanced OCR: Tesseract 5.2
