The document is read by your scanner. The scanner acts as the ôeyeö of your computer and sends it the image in a digitized form. At this point, the scanned image is no more than a meaningless cloud of intense points (ôpixelsö) on a lighter background.
Intelligent binarization routines convert color and greyscale images into black-and-white images.
The OCR software extracts text information from the black-and-white pixels: it recognizes the shapes and assigns characters.
Line segmentation consists of slicing a page of text into its different lines. This step also analyses lineskew, interline spacing and drop letters, and separates touching lines.
The word segmentation isolate one word from another.
The character segmentation separates the various letters of a word. If the characters have the same width (fixed pitch), this step is easy. The problem gets more interesting when the width of the letters depends on their shape (proportional pitch), when kerning occurs and when dot matrix fonts are used.
The character recognition extracts characteristics out of each isolated shape and assigns a symbol. The three most important stages are the autolearning phase, topological analysis and the optional interactive phase. During the recognition, linguistic knowledge is used to validate correct solutions and flag suspicious ones.
Extra steps are undertaken for business cards as the recognized data gets assigned to specific database fields (Readiris Corporate).