1. Introduction

Download 35,29 Kb.

Sana	06.12.2021
Hajmi	35,29 Kb.
	#178894

Bog'liq
Presentation 15.pptx

Etymological Classifications of English Words Abstract Can English words be classified by recent etymological sources, according to orthographic features? That is the question this experiment set out to test. Using orthographic, phonological, and syntactic features to try to classify words by their source languages, a variety of classifiers were tested and results were moderately promising. Specifically, machine learning techniques demonstrated moderate to good success on source languages such as French, English, and Latin. However, partly because of English's minimal alphabet and haphazard transliteration patterns compared to other languages it is likely that this application unlikely a candidate for improvement.

1. Introduction

Based on earlier proof-of-concept that languages can be

identified based only on their character-level bi-gram

patterns, it was hypothesized that English words, with

their diverse etymologies, should likewise be classifiable

by source language according to orthographic

characteristics. If a simple classifier can distinguish

between German, Dutch, and Swedish, for example, a

classifier should also be able to distinguish between a

word derived from French and word derived from

German. However, the diversity in source languages of

English's vocabulary makes this a particularly fine-

grained task.

2. Source Materials and Data

After consideration of several .txt format dictionaries

with etymological information, “An Etymological

Dictionary of the English Language”, by the Reverend

Walter W. Skeat of the University of Cambridge,

published in 1888 and freely and publicly accessible on

archive.org, was used as the source for word

etymologies.1 Though old, and likely with antiquated

information (and certainly antiquated language

references: “N. American” and “Peruvian” are among

source languages cited), the format of the dictionary and

precedence given to word etymologies made it preferable

for extraction of etymological information. As

complementary data, phonological and syntactic (part of

speech) data for 19,528 tokens was drawn from a data

file from an earlier homework.2

Only words that

appeared in both sources were used for the classification

task.

3. Methods and Issues

3.1 Text Cleaning

Prior to any classification, etymological data was

extracted from the etymological dictionary by a series of

regular expression and string manipulations of the text

file. Though the dictionary provided full sequences of

source languages for words3

as well as definitions,

references, and other information, only the word and the

most recent language was ultimately included in the data,

operating under the assumption that older etymologies

would have left successively less and less trace in the

orthography as time proceeded. Further, due to

inconsistencies in the text and typos acquired during the

transition from manuscript/type set to computer text,

errors had to be accounted for and duplicate discrete

references to the same language had to be merged.4

3.2 Classification

3.2 Classification
Inside the classification file, the phonological dictionary
and the etymological dictionary were compared to find
word instances which appeared in both dictionaries. The
data for these instances — etymology, orthography,
pronunciation, and part of speech — were merged into
one data structure. Then, features were extracted from
each instance and stored in a dictionary with the format
features[feature] = True.
3.2.1. Features and Data Sets
Originally, only orthographic features were considered
for inclusion in the words' feature vectors. These
included n-grams ranging in size from n=1 to n=6.
Classification was run with only orthographic features
and n=2, 3, and 4, n=2, 3, 4, and 5, and finally n=1...6.

. Vectorization

Scikit-learn's DictVectorizer implementation was used to

create scarce arrays from the feature vectors created in

the previous task. The default methodology of

performing dict.fit_transform() caused some problems

when working with training and development sets of

different vector sizes, so ultimately a vector of all

features (that is, all seen n-grams, POSs, and CV

patterns) was used to fit the vectorizer, then each vector

was transformed to that fit as needed for classifier fit and

prediction.

Results

Though specific accuracy of the system varied between

55% - 61% depending on classifier and feature set used,

ultimately the system was shown to match the predicted

class to the true class about 60% of the time on an

unseen and undeveloped test set. Precision and recall for

different languages varied quite extensively from this

60% benchmark.

Download 35,29 Kb.

Do'stlaringiz bilan baham: