Skip to content. | Skip to navigation

Fondren Library

Sections

OCR (Text Scanning)

Guide to scanning text with TextBridge and OmniPage Pro

Contents:

  1. Introduction
  2. Recognition Accuracy
  3. Scanning Text with Textbridge
  4. Scanning Text with OmniPage Pro
  5. A Final Word



Introduction

Scanning technology allows you to scan a printed image, a picture or text, into a computer file. By using Optical Character Recognition software, you may transform a paper document into an electronic file for use in a word processing program or on the web. OCR also allows you to incorporate printed images and text into electronic documents.

Currently, the Digital Media Center scanning facilities consist of PC and Macintosh workstations equipped with color scanners and the following software:

  • OmniPage Pro 10.0, for scanning and recognizing text. This software is particularly good for multilingual documents, columns, and some non-Western European languages.
  • TextBridge, for scanning and recognizing text. It's a good choice for English text in a clear copy and modern typeface. It also scans modern texts in French, Spanish, and German well.



Recognition Accuracy

Optical Character Recognition (OCR) converts scanned images into text. It works well on most 20th-century and 19th-century typefaces. With earlier printed material, or with poor reproductions of any typeface, the OCR software begins to encounter time-consuming obstacles. Broken letters, ligatures, digraphs, uneven inking, and antiquated letters may not be recognized by the software, and each unrecognized character adds time to the proofing and correction stage of your project.

Even though 95% accuracy seems quite good and 99% accuracy looks excellent, remember that this is a measure of accuracy per character. So, 95% accuracy actually translates into accurate recognition of 1 in 20 characters. Given an average word length of 5 characters, this is a mistake every 4 words. Even 99% accuracy means 1 character in 100, or one word in 20 is incorrect.

Anything that disrupts the integrity of the letter's shape can be a potential cause of an error, although the software has some ability to compensate. Breaks in letters (and sometimes ornate italics) can cause what you will come to recognize as distinctive OCR errors. For instance, a d getting read as cl, a 1 or ! as l, an m as in, or an e as c.

You can enhance accuracy by using clear text or photocopies. A new book will scan best. Books in relatively good condition will always scan better than photocopies. A little experimenting at first can result in a lower error rate (and therefore less to correct in proofreading). Your results should be good with most modern type faces, but even with clean text of a decent type size there will be occasional errors; this error rate increases as the text's size and clarity decreases. Altering the brightness and resolution can improve results, but little can be done with a badly faded photocopy or a 17th or 18th century typeface.




Scanning Text with TextBridge

TextBridge is an OCR application that works well with modern typeface, with documents in a single Western European language, or with documents in a simple format. You will have the option to save the resulting text in a variety of popular word-processing formats. At the scanning station, click on the Apple Menu and select TextBridge to launch the software.

To select OCR settings:

  1. You can use the Tool Bar or the Menus to control the OCR process.
  2. The Input Layout, Output Layout, Original Quality, and Page Orientation should be set to Automatic. This means, essentially, that TextBridge will interpret the document automatically from the item you've scanned. In most cases, Automatic will be the best option.
  3. If you have a text that is not 8.5 x 11, includes columns or tables, is faint or hard to read, contains multiple fonts, or is not in English, you should adjust the settings to recognize your document.
  4. First, select the File menu, Process from Scanner.
  5. Then, go to the Scanner menu and select Page Size, letter, legal, A4 (look on the scanner bed for an idea about this size), or the full size of the scanner bed.
  6. From this menu, you can also set the Brightness. While TextBridge will automatically select a brightness setting, you will be better off to alter the setting. A few moment of testing here may save you hours of editing time later. TextBridge tends to run dark, which is good for recognizing lighter characters, but will also pick up other marks on the page.
  7. You may also adjust resolution from the Scanner menu. Adjusting to a higher resolution (400 dpi) is a good idea for complicated or damaged text, but will significantly slow scan time. In most cases, 300 dpi or lower is fine and it's faster.
  8. If you are using the Sheet Feeder, you should also make that selection from the Scanner menu.
  9. Next, proceed to the Recognize menu. From this menu, you can select Input Layout (number of columns, text and pictures), Output Layout (number of columns, text and pictures), Original Quality (normal, dot matrix, fax), Page Orientation (portrait, landscape).
  10. Most importantly, you may choose Recognition Language from this menu, either French, Italian, Spanish, or German. Other languages using a Latin-based character system also scan well enough, although diacritical marks will have to be added in the editing process.
  11. If you have been scanning other documents with similar fonts, in the same language, or using a similar vocabulary and have created a Training File, you can also select that file from this menu as well.

To create a training file:

  1. Creating a training file is an excellent option if you will be scanning a suite of texts with similar fonts, diacritical marks, or other features. The training file will teach TextBridge to recognize the idiosyncratic ways of your documents.
  2. A training file is not a good idea for clear text in a single font or for short projects, since it will add an extra step to the process without adding any value.
  3. Check Create Training File under the menu.
  4. Once you finished your scanning session, you will be prompted to provide a name for your training file document and it will be saved.
  5. Next time you work on your project, be sure to select the Training File under the Recognize menu.

To capture the page image:

  1. Place the material on the scanner, as you would on a copy machine. It should be aligned with the upper right corner of the scanner.
  2. If you are using the document feeder, set your documents in order, face up, in the upper right corner of the feeder (on top of the machine). Make sure the green lever is straight up when you place your documents in the feeder. Once you are ready to scan, turn the lever to the right.
  3. When your document is properly aligned on the scanner click on the Go button.
  4. You will be prompted to choose the format you would like for the final document. You can select from ASCII text, a variety of word processing applications, or even HTML. Once youÍve made a selection it will begin to process the document.
  5. You can, however, click on Stop at any time to stop the scanning or recognition process.

To edit your document:

  1. As TextBridge processes your document, it will stop at words it does not recognize. The highlighted characters will appear in a dialog box in the upper right corner of the screen. You may type in corrections as the documents scans, or press return to edit the error later.
  2. If you are creating a Training File, you will want to enter the correction though, so that you can use that information the next time you scan a document with similar spellings, fonts, vocabularies.

To scan several pages as one file:

  1. As TextBridge finishes each page, it will prompt you to Continue or End. If you choose Quit at this point, your information will be lost! Once finished with the entire document, a file is automatically created and saved to the desktop.
  2. From that file, you can edit the document using most word processing or SimpleText software.



Scanning Text with OmniPage Pro

OmniPage Pro provides a wide range of options, including the ability to learn new characters, to scan only parts of documents (Manual Zone), to spellcheck, to recognize most European character sets as well as multilingual documents, and to save the resulting text in a variety of popular word-processing formats.

To select standard OCR settings:

  1. For most OCR work, a few basic default OmniPage settings will produce satisfactory results. If your original document contains clear, readable text (such as a printed book or output from a laser printer), is arranged in a standard single or multi-column format, and features a typeface approximately 8 pt. or larger, select the following options:
  2. Set the main Process buttons to Scan Image
  3. Select Auto Zones and then Perform OCR.
  4. In the Settings panel choose the Options section of the Scanner window, choose 3D OCR and set the paper size and the orientation.
  5. In the Zones window, select either Multiple Columns or Single Columns, depending on the format of the original document.
  6. In the OCR window, select Retain Font and Paragraph Formatting.
To customize settings:
  • From the Process menu, choose Select Process Settings...

    1. Under the OCR tab, you can choose the Speed of the OCR Method (Fastest to Most Accurate), the Character Type (Normal or Dot-Matrix/Monotype), and the Language of the document (12 to select from).
    2. Under the Scanner tab, you can specify the scanner, information about the size and orientation of the page, the brightness and contrast of the scan, and the type of scanner (ADF or Flatbed). ADF tells the software that you are using an automatic document feeder and gives you the option of scanning double-sided pages automatically (not recommended because of intermittent bugs).
    3. Under the Tables tab, you can tell OmniPage to look for tables in the document automatically, and set parameters of the table's border and inside grid.
    4. Under the Direct OCR tab are options that OmniPage can automatically and intelligently perform after the page has been scanned. You can tell it to draw zones of text automatically (which you can change manually later if necessary), to proofread the text, and which applications it can use to save the text file.
    5. The Process tab lets you designate where you want newly scanned to inserted in the left column, and whether it should perform more automatic tasks after scanning the page. These including proofreading, straightening the page, and changing the orientation of the page.
    6. When you are finished setting those options, click OK or Save Settings... to apply the custom settings.


  • In the upper left-hand corner of the main Omnipage window, click on AutoOCR. The settings selected here are all also available in Manual OCR and OCR Wizard.

    1. There are 4 options the Document Source drop-down menu. "Load File" lets you select an existing image. You can also select to scan the images as Black and White (recommended for sharp text on white backgrounds), Grayscale (recommended for pages with background colors or run-together characters), or Color (for pages with color pictures). These options are in increasing order of bandwidth.
    2. Under the Original Layout menu, you can tell OmniPage whether the original page was in a Single Column (one block of text), Multiple Columns (separate blocks of text, possibly with intermittent graphics), Spreadsheet (arranged in rows and columns), or in a Mixed-Page layout (which lets OmniPage determine the format).
    3. The Output Format menu lets you decide how much of the original layout you would like to retain. The options are to Remove the Formatting, Retain the Paragraph and Font Information, Retain Flowing Columns, or True Page, which retains as much formatting information as possible. Unless you want exact fidelity to the original page, it is best to select "retain paragraph and font information"; otherwise, the scanning process will take longer and you will spend some time trying to adjust the scanned text to new formatting.
    4. Under the Export Destination menu, you can tell OmniPage to Save the scan as a File (under various formats), to Send it as Mail, to Copy the text to the Clipboard, or to read the text aloud. Your best bet is to choose "Save as File," then save the file in C:\My Documents\your_own_folder.
To capture the page image:
  1. After choosing the appropriate settings, place the document face down on the glass of the flatbed scanner, aligned in the upper right corner.
  2. If you are in AutoOCR, click on the green Start button to begin scanning the text. (In Manual OCR, click the first button to begin scanning.) A thumbnail of the scan will appear in the left Thumbnail View column. An original page image and zone areas will appear in the Image View column. And the Text View column displays the text in editable form.
  3. Several Tools in the Image View (as shown in the diagram) allow you to select and edit the zones that OmniPage uses to select portions of text.
  4. If you chose to save the scan to a file, you will be prompted to choose the format you would like for the final document. You can select from ASCII text, a variety of word processing applications, or even HTML. Once you've made a selection it will begin to process the document. In general, you should probably save the file as an "rtf" (rich text format) file, since this format retains font and paragraph information and is readable by most word processors.
  5. Be sure to save your file approximately every ten scans. File allows you to save your document, clear an existing document from the work area, load partially-completed texts, load a pre-selected group of settings, or exit OmniPage.
To edit your text:
  1. The Edit menu contains cut-and-paste utilities for manipulating scanned texts, as well as a tool for moving quickly between pages.
  2. The Checkmark icon on the toolbar initiate spellcheck, which compares the text to a basic dictionary, pausing to let you correct any potential errors it finds.
  3. In most cases, it's faster and more reliable to complete spell-checking and formatting in a word processing program.
  4. Like most OCR programs, OmniPage will often over-format a text in an attempt to create exactly the look of the printed page. As a consequence, it fills the output document with word processing codes.
  5. Save the final corrected text in MicroSoft RTF format.
  6. Load the saved file into Microsoft Word.
  7. Under the Edit drop-down menu, choose Select All. This marks the entire document for editing.

  8. On Word's main screen bar, click on the leftmost downward pointing arrow to call up the format options. Highlight and select Normal. This removes most of the annoying formatting in the document.


A Final Word

If you are interested in further editing or marking up the text for the Web or in scanning and editing images, ask the Digital Media Center staff for assistance.


Personal tools