B. Mansurov and A. Mansurov

bet	5/8
Sana	30.04.2023
Hajmi	284,63 Kb.
	#1406354

1 2 3 4 5 6 7 8

Bog'liq
Uzbek Cyrillic-Latin-Cyrillic Machine Transliterat

2.4 Experiments

2.3 Data

The words in Cyrillic and their Latin variants used in our experiments are taken from an Uzbek Cyrillic
to Latin spelling dictionary [
Togʻayev et al. 1999]. The dictionary, first and foremost, includes widely
used words in the contemporary Uzbek literary language. Its authors tried not to include words that
are straightforward to spell or words whose derivational affixes are simple. However, words that are
prone to spelling errors are included in the dictionary.
The dictionary consists of 13,855 entries, and it does not include abbreviations, acronyms, proper
nouns, or words with morphological inflection (except for a few cases). After removing multi-word
phrases and entries with punctuation marks, we are left with 12,418 words. These words are shuffled
randomly and split into three sets: 9,499 words for training, 1,677 words for validation, and 1,242
words for testing.
2.4 Experiments

We experimented with a Naive Bayes and a decision tree classifier. Our model based on the decision
tree classifier had the highest scores. We think that the naivety assumption of the Naive Bayes classifier
prevented the model from learning important letter sequences. On the other hand, the decision tree
classifier is learning these sequences of letters due to our features being sequential too. We are not
worried about overfitting because changes in language happen over time and the number of words and
affixes are limited in the short term.
We used scikit-learn’s
2
implementation of the decision tree classifier with the default
parameters
3
to carry out our experiments. In order to find the best hyperparameters, we created
feature vectors consisting of a combination of zero to ten preceding and subsequent characters. For
each combination of hyperparameters we converted the train, validation, and test datasets into
feature vectors and classes. We also removed duplicate data points (within datasets) from all datasets.
We then trained and picked our best models based on the character level micro-averaged F
1
scores on
the validation set.

Download 284,63 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8