Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis

I. L. Kashirina; Yu. V. Starichkova; T. K. Le

doi:10.25881/18110193_2026_1_64

Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis

I. L. Kashirina, Yu. V. Starichkova, T. K. Le

https://doi.org/10.25881/18110193_2026_1_64

Full Text:

PDF (Rus)

Generate QR code

Abstract

The aim of the study was to improve the accuracy of semantic search of medical information in Russian by finetuning the RuBERT language model on the specialized RuMedDaNet dataset using the Matryoshka Representation Learning method to create compact and efficient vector representations of text.

Materials and Methods. The study utilized the RuMedDaNet dataset, which contains Russian-language medical texts. Various embedding training techniques were applied to optimize performance, including the “matryoshka” approach, which enables reducing the dimensionality of vector representations without loss of quality.

Results. Experiments demonstrated a significant improvement in key search metrics (NDCG, MRR) compared to the baseline RuBERT model. The language model trained in the study has been uploaded to the Hugging Face platform, where it is now available for open use.

Conclusion. The proposed RuBERT fine-tuning method was effective for search tasks in medical RAG systems. The current limitations of the approach and directions for further research are discussed.

Keywords

RuBERT, fine-tuning, RuMedDaNet, medical texts, information extraction, Matryoshka Representation Learning

About the Authors

I. L. Kashirina

MIREA – Russian Technological University
Russian Federation

DSc., Professor

Moscow

Yu. V. Starichkova

MIREA – Russian Technological University
Russian Federation

PhD.

Moscow

T. K. Le

MIREA – Russian Technological University
Russian Federation

Moscow

References

1. Lewis P, Perez J, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020; 33: 9459-9474. doi: 10.48550/arXiv.2005.11401.

2. Kuratov Y, Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv. Preprint posted online May 17, 2019. doi: 10.48550/arXiv.1905.07213.

3. MedBench [Internet]. Открытый набор задач в области здравоохранения [cited 2025 May 3]. Available from: https://medbench.ru/

4. Blinov P, Chertok A, Drozdov A, et al. RuMedBench: a Russian medical language understanding benchmark. Artif Intell Med. 2022; 383-392. doi: 10.1007/978-3-031-09342-5_38.

5. Kusupati A, Ordonez V, Parikh D, et al. Matryoshka representation learning. Adv Neural Inf Process Syst. 2022; 35: 30233-30249. doi: 10.48550/arXiv.2205.13147.

6. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc NAACL-HLT. 2019; 1: 4171-4186. doi: 10.18653/v1/N19-1423.

7. Yalunin A, Nesterov A, Umerenkov D. RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining. arXiv. Preprint posted online April 8, 2022. doi: 10.48550/arXiv.2204.03951.

8. Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proc EMNLP-IJCNLP. 2019; 3982-3992. doi: 10.18653/v1/D19-1410.

9. Dao T, Fu T, Ermon S, et al. FlashAttention: Fast and memory-efficient exact attention with IOawareness. Adv Neural Inf Process Syst. 2022; 35: 16344-16359. doi: 10.48550/arXiv.2205.14135.

10. Dao T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv. Preprint posted online July 17, 2023. arXiv: 2307.08691. doi: 10.48550/arXiv.2307.08691.

11. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017; 30. doi: 10.48550/arXiv.1706.03762.

12. Wang Y, et al. A theoretical analysis of NDCG type ranking measures. J Mach Learn Res. 2013; 25-54. doi: 10.48550/arXiv.1304.6480.

13. Craswell N. Mean Reciprocal Rank. In: Liu L, Özsu MT, eds. Encyclopedia of Database Systems. Springer; 2009. doi: 10.1007/978-0-387-39940-9_488.

14. TrungKiencding. Med-Bert-Matryoshka-v1 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/TrungKiencding/Med-Bert-Matryoshka-v1.

15. GigaChatEmbeddings [Internet]. [cited 2025 May 3]. Available from: https://deepwiki.com/ai-forever/gigachain/3-gigachatembeddings.

16. Sentence-transformers/all-MiniLM-L12-v2 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2.

17. Cointegrated/rubert-tiny2 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/cointegrated/rubert-tiny2.

18. Ai-forever/sbert_large_nlu_ru [Internet]. Hugging Face [cited 2025 May 3]. Available from: https:// huggingface.co/ai-forever/sbert_large_nlu_ru.

19. Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11(14), 6421. doi: 10.3390/app11146421.

20. Radyush DV. Knowledge injection methods in question answering. Russian Technological Journal. 2025; 13(3): 21-43. (In Russ). doi: 10.32362/ 2500-316X-2025-13-3-21-4300.

Review

For citations:

Kashirina I.L., Starichkova Yu.V., Le T.K. Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis. Medical Doctor and Information Technologies. 2026;(1):64-73. (In Russ.) https://doi.org/10.25881/18110193_2026_1_64

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1811-0193 (Print)
ISSN 2413-5208 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Medical Doctor and Information Technologies

Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy