Our Enterprise v7+ edition has two OCR engines.
The PixelPerfect OCR is a basic, automated OCR tool that only needs to be turned on using the Start button on the OCR ribbon.
The xRay OCR is a more advanced OCR engine that includes OCR Tweaking settings to help increase accuracy of poorly scanned PDF files. This needs to be enabled in the OCR settings (after turning on the OCR from the OCR ribbon, as above).
Under the OCR Tweaking section, you'll see the following features:
- Threshold: You can either allow PDF2XL to select an automatic monochrome threshold, or set it manually. Change this setting if the scanned page is either very light or very dark.
- Despeckle: By setting this option you can make the OCR process ignore small dots and imperfections in the scanned image. If the scanned document has a lot of 'noise', this option can help enormously. To use it, check the despeckle box, and select the maximum size of the dot to remove. Moving the bar to the right will make PDF2XL remove larger and larger 'dots', up to removing quite sizable chunks.
- Remove lines: If this option is set, PDF2XL will try to remove vertical and horizontal lines before processing the image. This is mostly useful when trying to process an image scanned from old computer print-out papers that have pre-printed lines on them.
- Force DPI: This affects the dots per inch to try and provide more clarity. The higher you move the slider, the clearer the page looks to the OCR.
There is no set level for any of these settings, as all scanned PDF files are different, so you need to play around with the sliders until you have the best possible result.
If you've set the OCR Tweaking to the best possible result and you still have some errors, there are a couple more things you can try.
- You can try changing the column/field format. For example, by selecting Text, it will force the OCR to recognize that a charcter is an 'O', not a '0'.
- Finally, you can run the Validation. This will allow the OCR to run through the words that it has difficulty recognizing, and allow you to correct them before converting your file. Note that it will look for all instances of poor character recognition, which means it could be a lengthy process if you have a very large, poorly scanned file.
- If you have the PDF2XL Enterprise edition, you can use the Replace List to correct repeating instances of wrongly recognized words. For example, if your document has the word "auto" several times, and the OCR recognizes it as "aut0" every time, the Replace List will make correcting them all at once quick and easy.