Assessing the quality of large generative models for basic healthcare applications
https://doi.org/10.25881/18110193_2025_3_64
Abstract
Large generative models (LGMs) have significant potential for healthcare and medical science. While publications are growing exponentially, LGM studies lack quality and breakthrough findings. Research articles call for standardized approaches to ensure safe and effective integration of LGMs into clinical practice. Currently, the Moscow healthcare system is testing LGMs as tools for supporting medical decision-making, which has required development of specialized methods and techniques for assessing LGM quality. This paper presents two methods for assessing the quality of large generative models. Both methods are based on analysis of literature data (over 200 sources), results from comprehensive testing of 204 LGMs, and hands-on experience in assessing model quality using a sample of more than 12,000 cases. Designed for two main LGM application scenarios, the methods incorporate a dedicated approach to building test samples, tailored and validated questionnaires, testing methodologies, and unified requirements for the composition and structure of quality assessment outputs.
About the Authors
R. V. ReshetnikovRussian Federation
PhD
Moscow
I. A. Tyrov
Russian Federation
Moscow
Yu. A. Vasilev
Russian Federation
PhD
Moscow
Yu. F. Shumskaya
Russian Federation
Moscow
A. V. Vladzymyrskyy
Russian Federation
DSc
Moscow
D. A. Akhmedzyanova
Russian Federation
Moscow
K. Yu. Bezhenova
Russian Federation
Moscow
M. D. Varyukhina
Russian Federation
PhD
Moscow
M. V. Sokolova
Russian Federation
Moscow
I. A. Blokhin
Russian Federation
PhD
Moscow
D. A. Voytenko
Russian Federation
Moscow
O. I. Mynko
Russian Federation
Moscow
M. R. Kodenko
Russian Federation
PhD
Moscow
O. V. Omelyanskaya
Russian Federation
Moscow
References
1. Singh N, Neubronner S, Kanayan S, Illanes S, Choolani M, Kemp MW. Advances, reception and potential of ChatGPT as a tool for healthcare delivery and research: a systematic review. Singapore Med J. 2025 Jul 29. doi: 10.4103/singaporemedj.SMJ-2024-173.
2. Ferreira Santos J, Ladeiras-Lopes R, Leite F, Dores H. Applications of large language models in cardiovascular disease: a systematic review. Eur Heart J Digit Health. 2025; 6(4): 540-553. doi: 10.1093/ehjdh/ztaf028.
3. Andreychenko AE, Gusev AV. Perspectives on the application of large language models in healthcare. 2023; 4(4): 48-55. (In Russ.)
4. Nazarov DM, Badaev FI. Application of large language models in healthcare. Manager zdravookhranenia. 2025; 5: 142-154. (In Russ.)
5. Vasilev YA, Reshetnikov RV, Nanova OG, Vladzymyrskyy AV, et al. Application of Large Language Models in Radiological Diagnostics: A Scoping Review. Digital Diagnostics. 2025; 6(2): 268-285. (In Russ.)] doi: 10.17816/DD678373.
6. Moëll B, Sand Aronsson F. Harm Reduction Strategies for Thoughtful Use of Large Language Models in the Medical Domain: Perspectives for Patients and Clinicians. J Med Internet Res. 2025; 27: e75849. doi: 10.2196/75849.
7. Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, et al. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak. 2025; 25(1): 117. doi: 10.1186/s12911-025-02954-4.
8. Preiksaitis C, Ashenburg N, Bunney G, Chu A, et al. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform. 2024; 12: e53787. doi: 10.2196/53787.
9. Flanagin A, Iorio A, Cacciamani G, Chen X, et al. Reporting guideline for Chatbot Health Advice studies: the CHART statement. BMC Med. 2025; 23(1): 447. doi: 10.1186/s12916-025-04274-w.
10. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, et al. The TRIPOD-LLM reporting guideline for studies using large language models: a Korean translation. Ewha Med J. 2025; 48(3): e49. doi: 10.12771/emj.2025.00661.
11. Zong H, Wu R, Cha J, Wang J, et al. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. J Med Internet Res. 2024; 26: e66114. doi: 10.2196/66114.
Review
For citations:
Reshetnikov R.V., Tyrov I.A., Vasilev Yu.A., Shumskaya Yu.F., Vladzymyrskyy A.V., Akhmedzyanova D.A., Bezhenova K.Yu., Varyukhina M.D., Sokolova M.V., Blokhin I.A., Voytenko D.A., Mynko O.I., Kodenko M.R., Omelyanskaya O.V. Assessing the quality of large generative models for basic healthcare applications. Medical Doctor and Information Technologies. 2025;(3):64-75. (In Russ.) https://doi.org/10.25881/18110193_2025_3_64