Wcre 2001, Stuttgart, October 2001


Download 445 b.
Sana09.11.2017
Hajmi445 b.


From an informal textual lexicon to a well-structured lexical database: An experiment in data reverse engineering

  • WCRE 2001,

  • Stuttgart, October 2001

  • Gérard Huet

  • INRIA


Introduction

  • We report on some experiments in data reverse engineering applied to computational linguistics resources.

  • We started from a sanskrit-to-french dictionary in TeX input format.



History

  • 1994 - non-structured lexicon in TeX

  • 1996 - available on Internet

  • 1998 - 10000 entries - invariants design

  • 1999 - reverse engineering

  • 2000 - Hypertext version on the Web, euphony processor, grammatical engine





Initial semi-structured format

  • \word{$kumAra$}{kum\=ara} m.

  • gar{\c c}on, jeune homme

  • | prince; page; cavalier

  • | myth. np. de Kum\=ara ``Prince'',

  • \'epith. de Skanda % Renou 241

  • -- n. or pur

  • -- \fem{kum\=ar{\=\i}\/} adolescente, jeune

  • fille, vierge.



Systematic Structuring with macros

  • \word{$kumAra$}{kum\=ara}

  • \sm \sem{garçon, jeune homme}

  • \or \sem{prince; page; cavalier}

  • \or \myth{np. de Kum\=ara ``Prince'',

  • épith. de Skanda} % Renou 241

  • \role \sn \sem{or pur}

  • \role \fem{kum\=ar{\=\i}\/} \sem{adolescente,

  • jeune fille, vierge}

  • \fin



Uniform cross-referencing

  • \word{kumaara}

  • \sm

  • \sem{garçon, jeune homme}

  • \or \sem{prince; page; cavalier}

  • \or \myth{np. de \npd{Kumaara} "Prince",

  • épith. de \np{Skanda}} % Renou 241

  • \role \sn \sem{or pur}

  • \role \fem{kumaarii} \sem{adolescente,

  • jeune fille, vierge}

  • \fin



Structured Computations

  • Grinding : the data is parsed, compiled into typed abstract syntax, and a processor is applied to each entry in order to realise some uniform computation. This processor is a parameter to the grinding functor.

  • For instance, a printing process may compile into a backend generator in some rendering format. It may be itself parametric as a virtual typesetter. Thus the original TeX format may be restored, but an HTML processor is easily derived as well.

  • Other processes may for instance call a grammatical processor in order to generate a full lexicon of flexed forms. This lexicon of flexed forms is itself the basis of further computational linguistics tools (segmenter, tagger, syntax analyser, corpus editors, etc).







Processing chains



Close-up view of functionalities



Tools used

  • Printable document

  • Knuth TeX, Metafont, LaTeX2e

  • Velthuis devnag font & ligature comp

  • Adobe Postscript, Pdf, Acrobat



Requirements, comments

  • The Abstract syntax ought to accommodate the freedom of style of the concrete syntax input with the minimum changes needed to avoid ambiguities

  • This resulted in 3 mutually recursive parsers (generic dictionary, sanskrit, french) with resp. (253, 14, 54) productions. Such a structure, implicit from the data, would have been hard to design

  • The typed nature of the abstract syntax (~DTD) enforced on the other hand a much stricter discipline than what TeX allowed, leading to much improvement in catching input mistakes

  • Total processing of the source document (~2Mb of TEXT) takes only 30 seconds on a current PC



Each entry is a (typed) tree



Grammatical information

  • type gender = [ Mas | Neu | Fem | Any ];

  • type number = [ Singular | Dual | Plural ];

  • type case = [ Nom | Acc | Ins | Dat | Abl | Gen | Loc | Voc ];



The verb system

  • type voice = [ Active | Reflexive ]

  • and mode = [ Indicative | Imperative | Causative | Intensive | Desiderative ]

  • and tense = [ Present of mode | Perfect | Imperfect | Aorist | Future ]

  • and nominal = [ Pp | Ppr of voice | Ppft | Ger | Infi | Peri ]

  • and verbal = [ Conjug of (tense * voice)

  • | Passive | Absolutive | Conditional | Precative

  • | Optative of voice

  • | Nominal of nominal

  • | Derived of (verbal * verbal) ];



Key points

  • Each entry is a structured piece of data on which one may compute

  • Consistency and completeness checks :

    • every reference is well defined once, there is no dangling reference
    • etymological origins, when known, are systematically listed
    • lexicographic ordering at every level is mechanically enforced
  • Specialised views are easily extracted

  • Search engines are easily programmable

  • Maintenance and transfer to new technologies is ensured

  • Independence from input format, diacritics conventions, etc.

  • The technology is scalable to much bigger corpus



Interactions lexicon-grammar

  • The index engine, when given a string which is not a stem defined in one of the entries of the lexicon, attempts to find it within the flexed forms persistent database, and if found there will propose the corresponding lexicon entry or entries

  • From within the lexicon, the grammatical engine may be called online as a cgi which lists the declensions of a given stem. It is directly accessible from the gender declarations, because of an important scoping invariant:

    • every substantive stem is within the


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2017
ma'muriyatiga murojaat qiling