Материалы и методы

vitj

Врач и информационные технологии

Medical Doctor and Information Technologies

1811-01932413-5208

Pirogov National Medical and Surgical Center

10.25881/18110193_2026_1_64

vitj-313

Research Article

ОРИГИНАЛЬНЫЕ ИССЛЕДОВАНИЯ

ORIGINAL RESEARCH

Тонкая настройка языковой модели RuBERT для повышения точности анализа медицинских запросов

Fine-tuning the RuBERT language model to improve the accuracy of medical query analysis

https://orcid.org/0000-0002-8664-9817

Каширина

И. Л.

Kashirina

I. L.

д.т.н., профессор

Москва

DSc., Professor

Moscow

kashirina@mirea.ru

Старичкова

Ю. В.

Starichkova

Yu. V.

к.т.н.

Москва

PhD.

Moscow

starichkova@mirea.ru

Ле

Ч. К.

T. K.

Москва

Moscow

letrungkienlk4@gmail.com

МИРЭА – Российский технологический университетРоссияMIREA – Russian Technological UniversityRussian Federation

2026

29032026

016473

2026

Каширина И.Л., Старичкова Ю.В., Ле Ч.К.

Kashirina I.L., Starichkova Y.V., Le T.K.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://www.vit-j.ru/jour/article/view/313

Цель исследования состояла в повышении точности семантического поиска медицинской информации на русском языке путем тонкой настройки языковой модели RuBERT на специализированном датасете RuMedDaNet с применением метода обучения Matryoshka Representation Learning для создания компактных и эффективных векторных представлений текста.

Материалы и методы

Материалы и методы. В исследовании использовался датасет RuMedDaNet, содержащий русскоязычные медицинские тексты. Для оптимизации производительности поиска применялись различные техники обучения эмбеддингов (векторных представлений текста), включая подход «матрёшка», позволяющий уменьшить размерность векторных представлений без существенной потери качества.

Результаты

Результаты. Эксперименты показали значительное улучшение ключевых метрик поиска (NDCG, MRR) по сравнению с базовой моделью RuBERT. Обученная в исследовании языковая модель загружена на платформу Hugging Face, где теперь она доступна для открытого использования заинтересованными специалистами.

Заключение

Заключение. Предложенный метод тонкой настройки RuBERT эффективен для задач поиска в медицинских RAG (Retrieval Augmented Generation)-системах. В статье обсуждаются текущие ограничения предлагаемого подхода и направления дальнейших исследований.

The aim of the study was to improve the accuracy of semantic search of medical information in Russian by finetuning the RuBERT language model on the specialized RuMedDaNet dataset using the Matryoshka Representation Learning method to create compact and efficient vector representations of text.

Materials and Methods

Materials and Methods. The study utilized the RuMedDaNet dataset, which contains Russian-language medical texts. Various embedding training techniques were applied to optimize performance, including the “matryoshka” approach, which enables reducing the dimensionality of vector representations without loss of quality.

Results

Results. Experiments demonstrated a significant improvement in key search metrics (NDCG, MRR) compared to the baseline RuBERT model. The language model trained in the study has been uploaded to the Hugging Face platform, where it is now available for open use.

Conclusion

Conclusion. The proposed RuBERT fine-tuning method was effective for search tasks in medical RAG systems. The current limitations of the approach and directions for further research are discussed.

RuBERTтонкая настройкаRuMedDaNetмедицинские текстывекторный поискMatryoshka Representation Learning

RuBERTfine-tuningRuMedDaNetmedical textsinformation extractionMatryoshka Representation Learning

Авторы заявляют, что не получали финансовой поддержки при проведении данного исследования, написании и/или публикации данной статьи.

References1

Lewis P, Perez J, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020; 33: 9459-9474. doi: 10.48550/arXiv.2005.11401.

Kuratov Y, Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv. Preprint posted online May 17, 2019. doi: 10.48550/arXiv.1905.07213.

MedBench [Internet]. Открытый набор задач в области здравоохранения [cited 2025 May 3]. Available from: https://medbench.ru/

Blinov P, Chertok A, Drozdov A, et al. RuMedBench: a Russian medical language understanding benchmark. Artif Intell Med. 2022; 383-392. doi: 10.1007/978-3-031-09342-5_38.

Kusupati A, Ordonez V, Parikh D, et al. Matryoshka representation learning. Adv Neural Inf Process Syst. 2022; 35: 30233-30249. doi: 10.48550/arXiv.2205.13147.

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc NAACL-HLT. 2019; 1: 4171-4186. doi: 10.18653/v1/N19-1423.

Yalunin A, Nesterov A, Umerenkov D. RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining. arXiv. Preprint posted online April 8, 2022. doi: 10.48550/arXiv.2204.03951.

Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proc EMNLP-IJCNLP. 2019; 3982-3992. doi: 10.18653/v1/D19-1410.

Dao T, Fu T, Ermon S, et al. FlashAttention: Fast and memory-efficient exact attention with IOawareness. Adv Neural Inf Process Syst. 2022; 35: 16344-16359. doi: 10.48550/arXiv.2205.14135.

Dao T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv. Preprint posted online July 17, 2023. arXiv: 2307.08691. doi: 10.48550/arXiv.2307.08691.

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017; 30. doi: 10.48550/arXiv.1706.03762.

Wang Y, et al. A theoretical analysis of NDCG type ranking measures. J Mach Learn Res. 2013; 25-54. doi: 10.48550/arXiv.1304.6480.

Craswell N. Mean Reciprocal Rank. In: Liu L, Özsu MT, eds. Encyclopedia of Database Systems. Springer; 2009. doi: 10.1007/978-0-387-39940-9_488.

TrungKiencding. Med-Bert-Matryoshka-v1 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/TrungKiencding/Med-Bert-Matryoshka-v1.

GigaChatEmbeddings [Internet]. [cited 2025 May 3]. Available from: https://deepwiki.com/ai-forever/gigachain/3-gigachatembeddings.

Sentence-transformers/all-MiniLM-L12-v2 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2.

Cointegrated/rubert-tiny2 [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/cointegrated/rubert-tiny2.

Ai-forever/sbert_large_nlu_ru [Internet]. Hugging Face [cited 2025 May 3]. Available from: https://huggingface.co/ai-forever/sbert_large_nlu_ru.

Ai-forever/sbert_large_nlu_ru [Internet]. Hugging Face [cited 2025 May 3]. Available from: https:// huggingface.co/ai-forever/sbert_large_nlu_ru.

Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11(14), 6421. doi: 10.3390/app11146421.

Радюш Д.В. Методы интеграции знаний для разработки вопросно-ответных систем. Russian Technological Journal. — 2025. — №13(3). — С.21-43.

Radyush DV. Knowledge injection methods in question answering. Russian Technological Journal. 2025; 13(3): 21-43. (In Russ). doi: 10.32362/ 2500-316X-2025-13-3-21-4300.

The authors declare that there are no conflicts of interest present.