METADATA AND FORMATTING ISSUES IN THE UZBEK-ENGLISH PARALLEL CORPUS AND EXISTING NLP TOOLS FOR THE UZBEK LANGUAGE
DOI:
https://doi.org/10.47390/ts-v3i9y2025No3Keywords:
parallel corpus, metadata, TEI, CoNLL-U, Uzbek language, NLP, morphological analysis, tagging, lemmatization, linguistic resources.Abstract
This article explores the issues of metadata formatting, the use of TEI and CoNLL-U standards, and the analysis of existing Natural Language Processing (NLP) tools for the Uzbek language in the process of creating an Uzbek-English parallel corpus. The paper discusses each stage of corpus development, including text alignment, syntactic and morphological annotation, and structural encoding. Furthermore, it evaluates the performance of Uzbek morphological analyzers, lemmatizers, and POS taggers, emphasizing their practical significance in constructing high-quality bilingual corpora. The results of the study provide a methodological basis for accurately encoding, automatically analyzing, and integrating parallel corpora into linguistic search and processing systems.
References
1. Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., Al-Smadi, M, & Eryiğit, G. (2016, June). SemEval-2016 Task 5: Aspect-based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp. 19–30). Association for Computational Linguistics.
2. Mirdjonovna, H. S., & Ilxomovna, A. X. Boltayevich, E. B., (2022, October). Methods for creating a morphological analyzer. In International Conference on Intelligent Human Computer Interaction (pp. 27-38). Cham: Springer Nature Switzerland.
3. Adalι, E., Mirdjonovna, K. S., Xolmo‘minovna, A. O., Yuldashevna, X. Z., Boltayevich, E. B., & Uktamboy O'g'li, X. N. (2023, September). The Problem of Pos Tagging and Stemming for Agglutinative Languages (Turkish, Uyghur, Uzbek Languages). In 2023 8th International Conference on Computer Science and Engineering (UBMK) (pp. 57-62). IEEE.
4. Elov, B., & Xudayberganov, N. (2024). O ‘zbek tili korpusi matnlarini pos teglash usullari. Computer Linguistics: problems, solutions, prospects, 1(1).
5. Sharipov, M., Mattiev, J., Sobirov, J., & Baltayev, R. (2022). Creating a morphological and syntactic tagged corpus for the Uzbek language. arXiv preprint arXiv:2210.15234.
6. Hamroyeva, S., Alayev, R., Xusainova, Z., & Yodgorov, U., Elov, B. (2023). O ‘zbek tili korpusi matnlarini qayta ishlash usullari. Digital transformation and artificial intelligence, 1(3), 117-129.
7. Elov, B., & Xudayberganov, N. (2024). O ‘zbek tili korpusi matnlarini pos teglash usullari. Computer Linguistics: problems, solutions, prospects, 1(1).