Preview

Medical Doctor and Information Technologies

Advanced search

Russian unstructured clinical texts processing and probabilistic classification of disease groups

https://doi.org/10.25881/18110193_2022_4_52

Abstract

Background. The development and implementation of medical information systems make it possible to simplify and automate many processes in medical organizations. At the same time, the amount of data on patients’ health is constantly accumulating which allows solving many problems related to the prediction and diagnosis of diseases.

Aim. To study approaches to processing of Russian unstructured medical texts and to predicting certain groups of diseases based on machine learning methods.

Initial data consisted of an array of depersonalized data from medical organizations in the Orenburg region containing 119,780 records. Three approaches to probabilistic forecasting of groups of diseases based on unstructured medical texts of patient complaints in Russian were studied: rule-based approach, logistic regression-based approach and approach using BERT transformer models.

Results. Comparative analysis showed that показывает, logistic regression-based approach combined with TfidfVectorizer method had the best results in Precision (0,8296), F1-score (0,8269) and Matthews’s correlation coefficient (0,7695).

Conclusion. Traditional rule-based approach was the least effective (Precision = 0,7182) among the studied methods, but at the same time it allowed to interpret the results of the classifier as visualization of the decision tree. Logistic regression-based approach (Precision = 0,8296) and approach using BERT transformer models (Precision = 0,8164) showed the best classification results and can be further used as a basis for building and developing medical decision support systems and find application in medical practice.

About the Authors

L. V. Legashev
Orenburg State University
Russian Federation

PhD

Orenburg



A. E. Shukhman
Orenburg State University
Russian Federation

PhD

Orenburg



I. P. Bolodurina
Orenburg State University
Russian Federation

DSc, Prof.

Orenburg



L. S. Grishina
Orenburg State University
Russian Federation

Orenburg



A. Yu. Zhigalov
Orenburg State University
Russian Federation

Orenburg



References

1. Chase HS, Mitrani LR, Lu GG, Fulgieri DJ Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC medical informatics and decision making. 2017; 17(1): 1-8.

2. Zhao SS, Hong C, Cai T, Xu C, Huang J, Ermann J et al. Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records. Rheumatology. 2020; 59(5): 1059-1065.

3. Sada Y, Hou J, Richardson P, El-Serag H, Davila J Validation of case finding algorithms for hepatocellular cancer from administrative data and electronic health records using natural language processing. Medical care. 2016; 54(2): 1-15.

4. Zheng L, Wang Y, Hao S, Shin AY, Jin B, Ngo AD et al. Web-based real-time case finding for the population health Management of Patients with Diabetes Mellitus: a prospective validation of the natural language processing–based algorithm with statewide electronic medical records. JMIR medical informatics. 2016; 4(4): 1-13.

5. Castro VM, Minnier J, Murphy SN, Kohane I, Churchill SE, Gainer V et al. Validation of electronic health record phenotyping of bipolar disorder cases and controls. American Journal of Psychiatry. 2015; 172(4): 363-372.

6. Zhong QY, Mittal LP, Nathan MD, Brown KM, Knudson González D, Cai T et al. Use of natural language processing in electronic medical records to identify pregnant women with suicidal behavior: towards a solution to the complex classification problem. European journal of epidemiology. 2019; 34(2): 153-162.

7. Hazlehurst B, Green CA, Perrin NA, Brandes J, Carrell DS, Baer A et al. Using natural language processing of clinical text to enhance identification of opioid- related overdoses in electronic health records data. Pharmacoepidemiology and drug safety. 2019; 28(8): 1143-1151.

8. Wang M, Wei Z, Jia M, Chen L, Ji H Deep learning model for multi-classification of infectious diseases from unstructured electronic medical records. BMC medical informatics and decision making. 2022; 22(1): 1-13.

9. Ling AY, Kurian AW, Caswell-Jin JL, Sledge Jr GW, Shah NH, Tamang SR Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data. JAMIA open. 2019; 2(4): 528-537.

10. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine. 2021; 4(1): 1-13.

11. Nath N, Lee SH, McDonnell MD, Lee I The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings. Computers in Biology and Medicine. 2021; 134: 1-11.

12. Li I, Goldwasser J, et al. Neural natural language processing for unstructured data in electronic health records: A review. Computer Science Review. 2022; 46: 1-29.

13. Syed S, Angel AJ, Syeda HB, Jennings CF, VanScoy J, Syed M et al. The h-ANN Model: Comprehensive Colonoscopy Concept Compilation Using Combined Contextual Embeddings. NIH Public Access, 2022; 5: 1-24.

14. Yalunin A, Nesterov A, Umerenkov D RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining. arXiv preprint arXiv:2204.03951. 2022.

15. Blinov P, Reshetnikova A, Nesterov A, Zubkova G, Kokh V RuMedBench: A Russian Medical Language Understanding Benchmark. arXiv preprint arXiv:2201.06499. 2022.

16. Funkner AA, Balabaeva K, Kovalchuk SV Negation Detection for Clinical Text Mining in Russian. MIE. 2020: 342-346.

17. Balabaeva K, Funkner AA, Kovalchuk SV Automated Spelling Correction for Clinical Text Mining in Russian. MIE. 2020: 43-47.

18. Batura TV. Mathematical linguistics and automatic text processing. Novosibirsk: RIC NSU. 2016. (In Russ.)

19. Tutubalina E, Alimova I, Miftahutdinov Z, Sakhovskiy A, Malykh V, Nikolenko S The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews. Bioinformatics. 2021; 37(2): 243-249.


Review

For citations:


Legashev L.V., Shukhman A.E., Bolodurina I.P., Grishina L.S., Zhigalov A.Yu. Russian unstructured clinical texts processing and probabilistic classification of disease groups. Medical Doctor and Information Technologies. 2022;(4):52-63. (In Russ.) https://doi.org/10.25881/18110193_2022_4_52

Views: 40


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1811-0193 (Print)
ISSN 2413-5208 (Online)