METADATA AND FORMATTING ISSUES IN THE UZBEK-ENGLISH PARALLEL CORPUS AND EXISTING NLP TOOLS FOR THE UZBEK LANGUAGE

Authors

  • Botir Elov
  • Ma’rufjon Amirkulov
  • Malika Suyunova

DOI:

https://doi.org/10.47390/ts-v3i9y2025No3

Keywords:

parallel corpus, metadata, TEI, CoNLL-U, Uzbek language, NLP, morphological analysis, tagging, lemmatization, linguistic resources.

Abstract

This article explores the issues of metadata formatting, the use of TEI and CoNLL-U standards, and the analysis of existing Natural Language Processing (NLP) tools for the Uzbek language in the process of creating an Uzbek-English parallel corpus. The paper discusses each stage of corpus development, including text alignment, syntactic and morphological annotation, and structural encoding. Furthermore, it evaluates the performance of Uzbek morphological analyzers, lemmatizers, and POS taggers, emphasizing their practical significance in constructing high-quality bilingual corpora. The results of the study provide a methodological basis for accurately encoding, automatically analyzing, and integrating parallel corpora into linguistic search and processing systems.

References

1. Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., Al-Smadi, M, & Eryiğit, G. (2016, June). SemEval-2016 Task 5: Aspect-based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp. 19–30). Association for Computational Linguistics.

2. Mirdjonovna, H. S., & Ilxomovna, A. X. Boltayevich, E. B., (2022, October). Methods for creating a morphological analyzer. In International Conference on Intelligent Human Computer Interaction (pp. 27-38). Cham: Springer Nature Switzerland.

3. Adalι, E., Mirdjonovna, K. S., Xolmo‘minovna, A. O., Yuldashevna, X. Z., Boltayevich, E. B., & Uktamboy O'g'li, X. N. (2023, September). The Problem of Pos Tagging and Stemming for Agglutinative Languages (Turkish, Uyghur, Uzbek Languages). In 2023 8th International Conference on Computer Science and Engineering (UBMK) (pp. 57-62). IEEE.

4. Elov, B., & Xudayberganov, N. (2024). O ‘zbek tili korpusi matnlarini pos teglash usullari. Computer Linguistics: problems, solutions, prospects, 1(1).

5. Sharipov, M., Mattiev, J., Sobirov, J., & Baltayev, R. (2022). Creating a morphological and syntactic tagged corpus for the Uzbek language. arXiv preprint arXiv:2210.15234.

6. Hamroyeva, S., Alayev, R., Xusainova, Z., & Yodgorov, U., Elov, B. (2023). O ‘zbek tili korpusi matnlarini qayta ishlash usullari. Digital transformation and artificial intelligence, 1(3), 117-129.

7. Elov, B., & Xudayberganov, N. (2024). O ‘zbek tili korpusi matnlarini pos teglash usullari. Computer Linguistics: problems, solutions, prospects, 1(1).

Downloads

Submitted

2025-10-22

Published

2025-10-23

How to Cite

Elov, B., Amirkulov, M., & Suyunova, M. (2025). METADATA AND FORMATTING ISSUES IN THE UZBEK-ENGLISH PARALLEL CORPUS AND EXISTING NLP TOOLS FOR THE UZBEK LANGUAGE. Techscience Uz - Topical Issues of Technical Sciences, 3(9), 14–21. https://doi.org/10.47390/ts-v3i9y2025No3

Similar Articles

<< < 1 2 3 4 5 

You may also start an advanced similarity search for this article.