Corpus Annotation I martin Volk Universität Zürich


Download 457 b.
Sana14.08.2018
Hajmi457 b.


Corpus Annotation I


Overview

  • Clean-Up and Text Structure Recognition

  • Sentence Boundary Recognition

  • Proper Name Recognition and Classification

  • Part-of-Speech Tagging

  • Tagging Correction

  • Lemmatisation and Lemma Filtering

  • NP/PP Chunk Recognition

  • Recognition of local and temporal PPs

  • Clause Boundary Recognition



Starting Point

  • Raw text: no explicit markup (e.g. ComputerZeitung)

  • Text with XML markup for text structure (e.g. NZZ and TagesAnzeiger)



Clean-Up

  • remove hyphenation

  • remove line breaks

  • remove blanks in internet addresses

    • http://www. abc.de/path  http://www.abc.de/path
  • remove blanks in numbers

    • 10 000  10'000
  • coding of special characters (ä, ö, ß, é, <, …)



Text Structure Recognition

  • mark document boundaries

  • mark document identifiers

  • distinguish headers (= titles) from text

  • mark list items

  • mark specific text elements

    • Examples from the ComputerZeitung
    • reference city
    • author name and author abbreviation


Sentence Boundary Recognition

  • Sentences end

    • at full stops (=dots), at exclamation marks, at question marks, at semicolons ??, at colons??
    • at the end of a header or list item or paragraph
  • Problem

    • the full stop dot is ambiguous with the abbreviation dot and the ordinal number dot (e.g. 3. Int. Conference)
  • Solution

    • use a language-specific list of abbreviations for disambiguation
    • still: the problem with an abbreviation in sentence-final position persists.  correction after PoS-tagging


Verticalization of the text

  • one word (= token) per line (punctuation marks and tags are also tokens)

  • Reason:

    • facilitates annotation (adding information per word as columns in the line)
    • facilitates processing (accessing, counting, sorting)


Proper Name Recognition and Classification

  • Reason

  • Solve unknown word problem with names

  • Solve multi-token problem with names

  • Enable clustering



Example Text 1

  • Auferstanden aus Ruinen

  • "Getrennt marschieren und sich zusammenschlagen", könnte eine knappe Bestandsaufnahme der Unix-Szene lauten. Dabei wurde und wird der Ruf nach einem einheitlichen Unix immer wieder laut.

  • Novell-Chef Ray Noorda hat ihn mit seinem "Unify Unix!" hübsch marktschreierisch vorgetragen ... (1) Zuerst glaubte Noorda, das mit der Übernahme der Unix Systems Laboratories (USL) ... (2) Doch mußte er ... (3) Dann kam der pfiffige Netzwerker auf die Idee, ... (4) Die dritte Stufe der Novell-Rakete ... (5) Ursprünglich wollte Noorda nämlich hier, ...



Example Text 2

  • Noorda will durch neuen Merger Microsoft Paroli bieten

  • Novell kauft Mehrheit an Unix-Schmiede

  • San Francisco/München/Stuttgart. Novell-Boß Raymond J. Noorda will dem Softwaregiganten Microsoft das Fürchten lehren: Die Netzspezialisten aus Utah kaufen AT&Ts Unix-Schmiede USL. Bis Ende März 1993 soll der Deal unter Dach und Fach sein. Noorda bemüht sich schon seit längerem, sein Imperium zu erweitern.



Example Text 3

  • Als Mitverfasser des "objektorientierten Struktur-Designs" ist Anthony I. Wassermann nicht nur Universitäts-Insidern bekannt ... Dabei hat Wassermann wesentlich dazu beigetragen, daß die IDE-Entwicklungsumgebung sowohl von namhaften Forschungseinrichtungen, beispielsweise dem deutschen Fraunhofer-Institut, aber auch von Unternehmen wie Sun Microsystems, Motorola, Hewlett-Packard, SEL/Alcatel, Asea Brown Boveri, Rolls Royce oder der Swiss Bank eingesetzt werden. Die Fachwelt erhofft sich von dem neuen Werkzeug viel. Wassermanns jüngste Entwicklung, ...



Examples

  • Einer der Väter der heutigen Computer, John von Neumann warnte schon 1948:

  • sowie von Fuzzy-Spezialist Professor Dr. Hans-Jürgen Zimmermann von der Technischen Universität Aachen.

  • ``Im Juli werden die ersten Ergebnisse des San-Francisco-Projekts ausgeliefert'', veranschaulicht Julius Peter. ... ergänzt Lawsons Cheftechnologe Peter Patton.

  • Am Anfang war die Zukunftsvision von einem künftigen Operationssaals, die der Neurochirurg Volker Urban von der Dr.-Horst-Schmidt-Klinik ... räumt Urban Akzeptanzprobleme ein.



Named Entity Recognition

  • is complicated in German since all nouns are capitalized.

  • Named entity classification into

    • person names
    • geographical names (mostly cities and countries)
    • company names
    • (product names)


Recognition of person names

  • Strategy: learn – apply – forget

  • Start with a list of 16'000 person first names

  • Learn person last name, if a capitalized word follows first name

  • Apply learned last name, if it occurs standing alone.

    • Oreja, ..., beantwortete unter Zuhilfenahme von elf Übersetzern ...


Recognition of person names

  • Leads to two problems

  •  Program forgets last name after 15 sentences. If last name is used in this range it is primed for additional 5 sentences.



Results for person names

  • Recall: 93% for full names, 74% for stand-alone last names

  • Precision: 92%

  • evaluated over 990 sentences with

    • 73 full names
    • 43 stand-alone last names


Recognition of geographical names

  • Strategy: list – learn – apply

  • Start with a list from the WWW:

    • 1000 city names
    • 250 country names
  • Learn additional city names from article location

    • Bonn (pg) – Bundesregierung und SPD kamen sich ...
  • Apply all names to the corpus



Recognition of geographical names

  • Problems:

    • must include genitive forms
        • Hamburg  Hamburgs
        • Bad Harzburg  Bad Harzburgs
    • must include adjectival forms in two variants
      • cities (uninflected upper case adjective):
        • London  Londoner
        • München  Münchner
      • countries (inflected lower case adjective):


Results for geographical names

  • 990 test sentences

  •  166 geographical names

  • Recall 91% (151 names)

  • Precision 81%



Recognition of company names

  • Strategy: learn – filter – apply

  • Learn company name as sequence of capitalized words

    • following a keyword (Firma)
      • die Firma Electronic Book Technologies
    • preceding a keyword (GmbH, Ltd., Co.)
      • von J.D. Edwards & Co.


Recognition of company names

  • Learn company name as sequence of capitalized words (- cont. -)

    • as initial part of hyphenated compound
      • die Zukunft der France-Télécom-Tochter ist ...
    • after fem. determiner + geographical adjective (specific pattern for ComputerZeitung!!)
      • ... hat die Münchner Ornetix einen Server entwickelt
  • Learn company name acronyms from complex names

    • die CCS Chipcard & Communications GmbH  CCS


Recognition of company names

  • Problems:

    • determination of correct front or end boundary
    • incorrectly learned (simple) names
  • Solution:

    • filter learned simple names against morphology system Gertwol


Filter of company names

  • Accept as company name all words

    • that are unknown to Gertwol (Acotec, Belgacom)
    • that are known to Gertwol as proper names (Alcatel, Apple)
    • that are recognized by Gertwol as abbreviations (AMD, AT&T)
    • that are not in an English general dictionary


Results for company names

  • 990 test sentences  348 company names

  • Completely recognized

    • Recall 81% (283 names)
    • Precision 76%
  • First token correctly recognized

    • Recall 86%
    • Precision 80%


Overview of the Results



Recognition of Product Names (A project by Jeannette Roth)

  • Proper names refer to unique objects.

  • Product names are different.

    • A product name may refer to many 'similar' objects.
      • Mercedes, MS Word, hohes C, dentalux
  • Recognition of product names is important because they are constantly introduced into the language.



Product Names

  • Method: Learn, Filter, Apply

  • but with coordination pattern

    • Product (und|sowie|oder) Product
    • Product, Product (und|sowie|oder) Product
  • Result:

    • Precision: > 90%
    • Recall: 20-30%
  • Product Names are not marked in the corpora :-(



Corpora in Named Entity Recognition



Influence on PoS-Tagging

  • The distinction between a regular noun and a proper name is a frequent tagging error.

  • Proper name recognition eliminates most of these errors.



Conclusion

  • Interaction between recognition modules needs to be improved.

  • Coordinated constituents need be exploited

  • Other name types need be included

    • product names
    • organization names (administrative units)
    • event names (exhibitions, conferences)



Do'stlaringiz bilan baham:


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2017
ma'muriyatiga murojaat qiling