Improving text quality with automatic majority editions

2017-02-10

Liam R E Quin (W3C)

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 License, with attribution to W3C.

Copyright ®2017 W3C® (MIT, ERCIM, Keio, Beihang)

Improving text quality with automatic majority editions

Making OCR More Awesome

Liam R E Quin (W3C)

2017-02-10

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 License, with attribution to W3C
Copyright ® 2017 W3C® (MIT, ERCIM, Keio, Beihang)

Context

The Books

photo of 32book spines

Retyping

  • Over 16,000 pages of text
  • A sample page.
  • A sample page.
  • My OCR Attempt

  • My flat-bed scanner would destroy the binding...
  • leaving me with thousands of pieces of paper...
  • But I did get very high quality with Abby FineReader, using higher resolution images.
  • A chance discovery

  • I happened one day to hit on a copy of one of the texts online
  • Archive.Org

    Conversion Quality

    Majority Edition

    Let’s look…

  • [script is shown]
  • Conversion strategy

    Small stages

    Results

    Future Work

    Links