Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart


Download 497 b.
Sana24.05.2018
Hajmi497 b.


Extracting Names Using Layout Clues in Genealogical Books

  • Aaron Stewart

  • David W. Embley

  • March 20, 2010


Overview

  • Problem

  • Current solutions

  • Our solution

    • Preprocessing (briefly, images only)
    • Pattern approach
  • Future work



Problem





Finding Names

  • Name recognition in genealogical texts

  • Focus: Lists, Directories



Finding Names



Finding Names



BYU OntoES Ontology Extraction System

  • Dictionary

  • Regular Expressions



Part 1: Preprocessing



Ancestry.com Data

  • Word text

  • Word bounding boxes

  • Genres:

    • Genealogical Books
    • City Directories
    • Yearbooks
    • Newspapers


Ancestry.com Data

  • Inconsistent punctuation

    • Commas and periods
    • Present in some books, absent in others
  • Word ordering issue

    • Only some books are affected
    • Bug in OCR/layout analysis


Word Order



Word Order - Corrected



Word Order



DEG/Ancestry OCR Reformatting

  • TLP original reordering code

  • Page separator

  • Line segment identifier

  • Line ordering

  • RANSAC margin finder



Page Separator



Page Separator



Line Segment Identifier

  • Combines words within about 2 spaces

  • Handles skew reasonably well



Line Segment Identifier



Line Ordering



RANSAC Margin Finder

  • Random Sampling with Consensus

  • Finds a line in the presence of noise

  • Effective for finding left-aligned margins, tab stops, table columns



RANSAC Margin Finder



Margin Finder – Future Work



Margin Finder – Future Work

  • Line Wrap?



Margin Finder – Future Work

  • ABBYY FineReader handles –

    • Paragraphs
    • Newspaper columns
  • But has trouble with –

    • Hanging indents
    • Outline indentation (possibly)


Part 2: Pattern Finding



Pattern Finding

  • Apply baseline name extractor (OntoES)

  • Apply margin finder and insert markers

  • Find left and right context for each name

  • Apply common contexts to extract more names



Pattern Finding



Pattern Finding



Pattern Finding



Pattern Finding



Pattern Finding – Sample Results

  • Baseline Results

  • Precision: 40%

  • Recall: 31.25%

  • F1: 35.09%

  • Results of Most Salient Pattern

  • Precision: 51.52%

  • Recall: 53.12%

  • F1: 52.31%



Output

  • Java Advanced Imaging library

    • JPEG 2000
    • TIF
  • Bounding Boxes



Contributions

  • Most of our work is just work

    • Practical
    • Not novel
  • Possible exception: RANSAC for Margins

    • Current research topic (2009)
    • Ray Smith / Tesseract
  • Possible exception: L/R patterns



Challenges

  • Evaluation

    • More aligned data
    • Annotation tool
  • Other books



Challenges (continued)

  • Additional Patterns

    • Not just city directories (too trivial?)
    • Include other books
  • Extent

  • Sanity check on non-pattern books



Possible Approach

  • Publish the work as it is?...

  • Add centering and right alignment

  • Add another book…



My Preferred Approach

  • Build a useful interactive tool

  • Add features incrementally

  • When my friends say “wow!”, it’s time to publish.



Future Challenges

  • BYU Digital Collections

    • Not searchable for names
    • Needs further processing
  • Etc.



Work to Do…

  • Organize data into a collection

  • Index it

  • Provide a search interface




Download 497 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2020
ma'muriyatiga murojaat qiling