MULTILINGUAL PARALLEL CORPUS ARCHITECTURE AND AUTOMATIC SPEECH MATCHING ALGORITHM
DOI:
https://doi.org/10.47390/ydif-y2026v2i8/n02Keywords:
parallel corpus, multilingualism, architecture, sentence alignment, natural language processing, machine translation.Abstract
This article proposes a flexible architecture and an automatic sentence alignment algorithm designed for the creation, storage, and efficient utilization of multilingual parallel corpora. The proposed architecture is built on a three-level hierarchical model consisting of work, sentence, and word layers. To identify parallel sentence pairs, a hybrid algorithm based on length similarity, positional proximity, and semantic similarity was developed.
References
1. Brown, P. F., Lai, J. C., Mercer, R. L. Aligning sentences in parallel corpora // Proceedings of ACL, 1991. https://doi.org/10.3115/981344.981366
2. Artetxe, M., Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond // Transactions of ACL, 2019. https://doi.org/10.1162/tacl_a_00288
3. Gale, W. A., Church, K. W. A program for aligning sentences in bilingual corpora Computational Linguistics. 1993. Vol. 19(1). P. 75–102.
4. Conneau, A. et al. Unsupervised cross-lingual representation learning at scale // ACL, 2020. https://doi.org/10.18653/v1/2020.acl-main.747
5. Feng, F. et al. Language-agnostic BERT sentence embedding // ACL Findings, 2022. https://doi.org/10.18653/v1/2022.acl-long.62
6. Och, F. J., Ney, H. A systematic comparison of various statistical alignment models // Computational Linguistics. 2003. Vol. 29(1). P. 19–51. https://doi.org/10.1162/089120103321337421
7. Reimers, N., Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks // EMNLP, 2019. https://doi.org/10.18653/v1/D19-1410

This work is licensed under a