Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis
https://doi.org/10.25881/18110193_2026_1_64
Abstract
The aim of the study was to improve the accuracy of semantic search of medical information in Russian by finetuning the RuBERT language model on the specialized RuMedDaNet dataset using the Matryoshka Representation Learning method to create compact and efficient vector representations of text.
Materials and Methods. The study utilized the RuMedDaNet dataset, which contains Russian-language medical texts. Various embedding training techniques were applied to optimize performance, including the “matryoshka” approach, which enables reducing the dimensionality of vector representations without loss of quality.
Results. Experiments demonstrated a significant improvement in key search metrics (NDCG, MRR) compared to the baseline RuBERT model. The language model trained in the study has been uploaded to the Hugging Face platform, where it is now available for open use.
Conclusion. The proposed RuBERT fine-tuning method was effective for search tasks in medical RAG systems. The current limitations of the approach and directions for further research are discussed.
About the Authors
I. L. KashirinaRussian Federation
DSc., Professor
Moscow
Yu. V. Starichkova
Russian Federation
PhD.
Moscow
T. K. Le
Russian Federation
Moscow
References
1. Lewis P, Perez J, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020; 33: 9459-9474. doi: 10.48550/arXiv.2005.11401.
2. Kuratov Y, Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv. Preprint posted online May 17, 2019. doi: 10.48550/arXiv.1905.07213.
3. MedBench [Internet]. Открытый набор задач в области здравоохранения [cited 2025 May 3]. Available from: https://medbench.ru/
4. Blinov P, Chertok A, Drozdov A, et al. RuMedBench: a Russian medical language understanding benchmark. Artif Intell Med. 2022; 383-392. doi: 10.1007/978-3-031-09342-5_38.
5. Kusupati A, Ordonez V, Parikh D, et al. Matryoshka representation learning. Adv Neural Inf Process Syst. 2022; 35: 30233-30249. doi: 10.48550/arXiv.2205.13147.
6. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc NAACL-HLT. 2019; 1: 4171-4186. doi: 10.18653/v1/N19-1423.
7. Yalunin A, Nesterov A, Umerenkov D. RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining. arXiv. Preprint posted online April 8, 2022. doi: 10.48550/arXiv.2204.03951.
8. Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proc EMNLP-IJCNLP. 2019; 3982-3992. doi: 10.18653/v1/D19-1410.
9. Dao T, Fu T, Ermon S, et al. FlashAttention: Fast and memory-efficient exact attention with IOawareness. Adv Neural Inf Process Syst. 2022; 35: 16344-16359. doi: 10.48550/arXiv.2205.14135.
10. Dao T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv. Preprint posted online July 17, 2023. arXiv: 2307.08691. doi: 10.48550/arXiv.2307.08691.
11. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017; 30. doi: 10.48550/arXiv.1706.03762.
12. Wang Y, et al. A theoretical analysis of NDCG type ranking measures. J Mach Learn Res. 2013; 25-54. doi: 10.48550/arXiv.1304.6480.
13. Craswell N. Mean Reciprocal Rank. In: Liu L, Özsu MT, eds. Encyclopedia of Database Systems. Springer; 2009. doi: 10.1007/978-0-387-39940-9_488.
14. TrungKiencding. Med-Bert-Matryoshka-v1 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/TrungKiencding/Med-Bert-Matryoshka-v1.
15. GigaChatEmbeddings [Internet]. [cited 2025 May 3]. Available from: https://deepwiki.com/ai-forever/gigachain/3-gigachatembeddings.
16. Sentence-transformers/all-MiniLM-L12-v2 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2.
17. Cointegrated/rubert-tiny2 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/cointegrated/rubert-tiny2.
18. Ai-forever/sbert_large_nlu_ru [Internet]. Hugging Face [cited 2025 May 3]. Available from: https:// huggingface.co/ai-forever/sbert_large_nlu_ru.
19. Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11(14), 6421. doi: 10.3390/app11146421.
20. Radyush DV. Knowledge injection methods in question answering. Russian Technological Journal. 2025; 13(3): 21-43. (In Russ). doi: 10.32362/ 2500-316X-2025-13-3-21-4300.
Review
For citations:
Kashirina I.L., Starichkova Yu.V., Le T.K. Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis. Medical Doctor and Information Technologies. 2026;(1):64-73. (In Russ.) https://doi.org/10.25881/18110193_2026_1_64
JATS XML
