Blog

A Compelling Study Of OCR : Accuracy Matters

November 13, 2020

A Compelling Study Of OCR : Accuracy Matters

A major problem many businesses face today is the inability to leverage data trapped inside scanned documents and images. Whenever a business relies on such data, manually re-keying the data becomes the biggest bottleneck and adversely affects the business.

OCR

In such scenarios, we need data-entry automation that helps extract information from scanned documents and automate document-based business processes.

But the problem is two-fold. The challenge is not just to extract information from scanned documents but also to extract it accurately. This becomes even more challenging when the data inside these scanned documents and images is tabular and graphical in nature and needs to be structured correctly.

Quick Overview of the Extraction Process

OCR

A. OCR — To accurately extract data from scanned documents, Optical Character Recognition (OCR) is needed. A combination of pattern recognition and image processing techniques are used to convert batches of scanned files to excel sheets or any other document format.

B. Information extraction (IE) — Information Extraction is the process of extracting domain-specific information from textual data sources. Bank Statements, Legal Acts, Corporate Reports, Medical Records Government Documents are other free flowing textual sources from which information extraction can provide structured information.

Gathering structured data from texts, Information Extraction enables -

Factors Affecting OCR

Receipts often are printed on thermal paper by a basic printer. That receipt might get wrinkled or folded and the quality of the scanned image reduces. Try to use the highest quality document possible. If an original is of low quality — for example, the ink is too light, the paper is not flat and white (or the text otherwise does not contrast highly with the background) — an OCR engine will have a difficult time discerning the text from any noise surrounding it.

Sometimes the images are not clear i.e. the OCR engine knows there is data but cannot accurately read it

Similarly, the quality of a document acquired by a computer for OCR will influence the quality of OCR output. If the original document is:

OCR

  1. Wrinkled, torn, or otherwise damaged, faded or otherwise aged, discoloured, smudged (or the text is otherwise obfuscated or distorted)
  2. Printed with low-contrast or coloured ink
  3. Rendered with nonstandard fonts or in human handwriting,
  4. Printed on specific types of paper that decrease crispness and contrast between the background and foreground in the resulting scan.

DOs:

DON’Ts:

Factors Affecting Information Localisation & Extraction

  1. Skew & Orientation OCR
    • There are a variety of circumstances in which it is useful to determine the text skew and orientation:
    • Improves text recognition. Many systems will fail if presented with text oriented sideways or upside-down. Performance of recognition systems also degrades if the skew is more than a few degrees.
    • Simplifies interpretation of page layout. It is easier to identify text lines and text columns if the image skew is known or the image is de-skewed.
    • Improves baseline determination. The text line baselines can be found more robustly if the skew angle is accurately known.
    • Improves visual appearance. Images can be displayed or printed after rotation to remove skew. Multiple page document images can also be oriented consistently, regardless of scan orientation or the design of duplex scanners.
  2. OCR Errors OCR
    • Many errors that are generated by OCR systems can be traced back to low quality scanning or deteriorated printed materials. Sometimes that original physical paper is no longer available to re-do the scanning.
    • In some documents, content viewable in the document images were either partially or completely lost in the OCR process. Also, useful context information would occasionally be lost as well.
    • In some documents handwritten content appears which is not recognised by the OCR. Because of these problems recognising handwritten content requires us to follow an entirely different approach.
  3. Document Structure
    • Extracting data which has a specific template such as bills, receipts, insurances etc. are extremely common and critical in a diverse range of business workflows.
    • Putting very large or very small fonts in the document can affect the data extraction.
    • In some documents, text is located at random locations which can end up affecting OCR when it is being processed.
    • Format of the text is very important factor. If the formatting of the document is not up to the mark it might end up extracting wrong information or might leave the text field blank.

DOs:

DON’Ts:

OCR is the most important aspect of your workflow management and automation system. It improves productivity and bridges the gap between humans and machines by enabling access to crucial data trapped in important documents.

Make better and accurate decisions

Learn how AI-powered insights can help you eliminate bottlenecks and transform your organization.