1. Introduction
Download 35.29 Kb.
|
Presentation 15.pptx
Etymological Classifications of English Words Abstract Can English words be classified by recent etymological sources, according to orthographic features? That is the question this experiment set out to test. Using orthographic, phonological, and syntactic features to try to classify words by their source languages, a variety of classifiers were tested and results were moderately promising. Specifically, machine learning techniques demonstrated moderate to good success on source languages such as French, English, and Latin. However, partly because of English's minimal alphabet and haphazard transliteration patterns compared to other languages it is likely that this application unlikely a candidate for improvement.1. Introduction1. IntroductionBased on earlier proof-of-concept that languages can beidentified based only on their character-level bi-grampatterns, it was hypothesized that English words, withtheir diverse etymologies, should likewise be classifiableby source language according to orthographiccharacteristics. If a simple classifier can distinguishbetween German, Dutch, and Swedish, for example, aclassifier should also be able to distinguish between aword derived from French and word derived fromGerman. However, the diversity in source languages ofEnglish's vocabulary makes this a particularly fine-grained task.2. Source Materials and Data2. Source Materials and DataAfter consideration of several .txt format dictionarieswith etymological information, “An EtymologicalDictionary of the English Language”, by the ReverendWalter W. Skeat of the University of Cambridge,published in 1888 and freely and publicly accessible onarchive.org, was used as the source for wordetymologies.1 Though old, and likely with antiquatedinformation (and certainly antiquated languagereferences: “N. American” and “Peruvian” are amongsource languages cited), the format of the dictionary andprecedence given to word etymologies made it preferablefor extraction of etymological information. Ascomplementary data, phonological and syntactic (part ofspeech) data for 19,528 tokens was drawn from a datafile from an earlier homework.2Only words thatappeared in both sources were used for the classificationtask.3. Methods and Issues3. Methods and Issues3.1 Text CleaningPrior to any classification, etymological data wasextracted from the etymological dictionary by a series ofregular expression and string manipulations of the textfile. Though the dictionary provided full sequences ofsource languages for words3as well as definitions,references, and other information, only the word and themost recent language was ultimately included in the data,operating under the assumption that older etymologieswould have left successively less and less trace in theorthography as time proceeded. Further, due toinconsistencies in the text and typos acquired during thetransition from manuscript/type set to computer text,errors had to be accounted for and duplicate discretereferences to the same language had to be merged.43.2 Classification
. Vectorization. VectorizationScikit-learn's DictVectorizer implementation was used tocreate scarce arrays from the feature vectors created inthe previous task. The default methodology ofperforming dict.fit_transform() caused some problemswhen working with training and development sets ofdifferent vector sizes, so ultimately a vector of allfeatures (that is, all seen n-grams, POSs, and CVpatterns) was used to fit the vectorizer, then each vectorwas transformed to that fit as needed for classifier fit andprediction.ResultsResultsThough specific accuracy of the system varied between55% - 61% depending on classifier and feature set used,ultimately the system was shown to match the predictedclass to the true class about 60% of the time on anunseen and undeveloped test set. Precision and recall fordifferent languages varied quite extensively from this60% benchmark.Download 35.29 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling