B. Mansurov and A. Mansurov
Download 284.63 Kb. Pdf ko'rish
|
Uzbek Cyrillic-Latin-Cyrillic Machine Transliterat
- Bu sahifa navigatsiya:
- 2.4 Experiments
2.3 Data
The words in Cyrillic and their Latin variants used in our experiments are taken from an Uzbek Cyrillic to Latin spelling dictionary [ Togʻayev et al. 1999]. The dictionary, first and foremost, includes widely used words in the contemporary Uzbek literary language. Its authors tried not to include words that are straightforward to spell or words whose derivational affixes are simple. However, words that are prone to spelling errors are included in the dictionary. The dictionary consists of 13,855 entries, and it does not include abbreviations, acronyms, proper nouns, or words with morphological inflection (except for a few cases). After removing multi-word phrases and entries with punctuation marks, we are left with 12,418 words. These words are shuffled randomly and split into three sets: 9,499 words for training, 1,677 words for validation, and 1,242 words for testing. 2.4 Experiments We experimented with a Naive Bayes and a decision tree classifier. Our model based on the decision tree classifier had the highest scores. We think that the naivety assumption of the Naive Bayes classifier prevented the model from learning important letter sequences. On the other hand, the decision tree classifier is learning these sequences of letters due to our features being sequential too. We are not worried about overfitting because changes in language happen over time and the number of words and affixes are limited in the short term. We used scikit-learn’s 2 implementation of the decision tree classifier with the default parameters 3 to carry out our experiments. In order to find the best hyperparameters, we created feature vectors consisting of a combination of zero to ten preceding and subsequent characters. For each combination of hyperparameters we converted the train, validation, and test datasets into feature vectors and classes. We also removed duplicate data points (within datasets) from all datasets. We then trained and picked our best models based on the character level micro-averaged F 1 scores on the validation set. Download 284.63 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling