Preview

Medical Doctor and Information Technologies

Advanced search

Feature selection for medical prognostic models

https://doi.org/10.25881/18110193_2022_3_54

Abstract

It is very important to balance the processes of creating the simplest and most effective predictive models in medicine. The predictors in the model determine its quality and practical relevance but selecting them is not always easy. The aim of the study is to compare different methods of prediction selection to create medical prognostic models.
Methods. We compare simple methods, such as correlation, predictor filtering based on basic statistics, and Hosmer-Lemeshow univariate analysis, with more complex methods often used in machine learning, such as recursive feature elimination, LASSO regression, and classification trees. The predictive models were built using the binary multiple logistic regression method. Statistical analysis was carried out using the programming language R (version 3.4.2).
Results. Based on the LASSO and random forest methods, as well as the stepwise regression method, the most accurate predictive models were constructed (minimum AIC value). The Hosmer-Lemeshow method and basic methods of statistical analysis have been found to be the least effective.
Conclusion. The use of predictor selection methods often significantly reduces their number, filtering out non-informative ones, which improves the quality of the predictive model.

About the Authors

A. S. Luchinin
The Federal State-Financed Scientific Institution Kirov Research Institute of Hematology and Blood Transfusion under the Federal Medical Biological Agency
Russian Federation

PhD

Kirov



A. V. Lyanguzov
The Federal State-Financed Scientific Institution Kirov Research Institute of Hematology and Blood Transfusion under the Federal Medical Biological Agency
Russian Federation

PhD

Kirov



References

1. Altman DG, Vergouwe Y, Royston P, Moons KGM. Prognosis and prognostic research: validating a prognostic model. BMJ (Clinical research ed.). 2009; 338: b605. doi: 10.1136/bmj.b605.

2. van Beek PE, Andriessen P, Onland W, Schuit E. Prognostic Models Predicting Mortality in Preterm Infants: Systematic Review and Meta-analysis. Pediatrics. 2021; 147(5): e2020020461. doi: 10.1542/peds.2020-020461.

3. Van Calster B, Wynants L, Timmerman D, Steyerberg EW, Collins GS. Predictive analytics in health care: how can we know it works? Journal of the American Medical Informatics Association: JAMIA. 2019; 26(12): 1651-1654. doi: 10.1093/jamia/ocz130.

4. Andrade C. Sample Size and its Importance in Research. Indian Journal of Psychological Medicine. 2020; 42(1): 102-103. doi: 10.4103/IJPSYM.IJPSYM_504_19.

5. Pourhoseingholi MA, Vahedi M, Rahimzadeh M. Sample size calculation in medical studies. Gastroenterology and Hepatology from Bed to Bench. 2013; 6(1): 14-17.

6. Chowdhury MZI, Turin TC. Variable selection strategies and its importance in clinical prediction modelling. Family Medicine and Community Health. 2020; 8(1): e000262. doi: 10.1136/fmch-2019-000262.

7. Chen R-C, Dewi C, Huang S-W, Caraka RE. Selecting critical features for data classification based on machine learning methods. Journal of Big Data. 2020; 7(1): 52. doi: 10.1186/s40537-020-00327-4.

8. Staartjes VE, Kernbach JM, Stumpo V, van Niftrik CHB, Serra C, Regli L. Foundations of Feature Selection in Clinical Prediction Modeling. Acta Neurochirurgica. Supplement. 2022; 134: 51-57. doi: 10.1007/978-3-030-85292-4_7.

9. Li L. Dimension reduction for high-dimensional data. Methods in Molecular Biology (Clifton, N.J.). 2010; 620: 417-434. doi: 10.1007/978-1-60761-580-4_14.

10. Ameringer S, Serlin RC, Ward S. Simpson’s Paradox and Experimental Research. Nursing research. 2009; 58(2): 123-127. doi: 10.1097/NNR.0b013e318199b517.

11. Kim JH. Multicollinearity and misleading statistical results. Korean Journal of Anesthesiology. 2019; 72(6): 558-569. doi: 10.4097/kja.19087.

12. Zhang Z. Variable selection with stepwise and best subset approaches. Annals of Translational Medicine. 2016; 4(7): 136. doi: 10.21037/atm.2016.03.35.

13. Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997; 16(4): 385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3.

14. Rigatti SJ. Random Forest. Journal of Insurance Medicine (New York, N.Y.). 2017; 47(1): 31-39. doi: 10.17849/insm-47-01-31-39.1.

15. Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Briefings in Bioinformatics. 2019; 20(2): 492-503. doi: 10.1093/bib/bbx124.

16. Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. Journal of Clinical Epidemiology. 1996; 49(8): 907-916. doi: 10.1016/0895-4356(96)00025-x.

17. Dziak JJ, Coffman DL, Lanza ST, Li R, Jermiin LS. Sensitivity and specificity of information criteria. Briefings in Bioinformatics. 2020; 21(2): 553-565. doi: 10.1093/bib/bbz016.

18. Thompson HW, Mera R, Prasad C. The Analysis of Variance (ANOVA). Nutritional Neuroscience. 1999; 2(1): 43-55. doi: 10.1080/1028415X.1999.11747262.


Review

For citations:


Luchinin A.S., Lyanguzov A.V. Feature selection for medical prognostic models. Medical Doctor and Information Technologies. 2022;(3):54-67. (In Russ.) https://doi.org/10.25881/18110193_2022_3_54

Views: 30


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1811-0193 (Print)
ISSN 2413-5208 (Online)