id_844. IMPROVING ENSEMBLE CLASSIFICATION OF STROKE USING A FEATURE SELECTION PIPELINE
Agata Leszczak1,2, Jan Argasiński2, Luca Gherardini3, Cemal Koba2
1 Sano Centre for Computational Personalized Medicine, Computational Neuroscience Team, Czarnowiejska 36, Kraków, Polska
2 Jagiellonian University, Faculty of Biophysics, Biochemistry and Biotechnology, Gronostajowa 7, Kraków, Polska
3 Sano Centre for Computational Personalized Medicine, Computational Intelligence Team, Czarnowiejska 36, Kraków, Polska
INTRODUCTION: Stroke is a neurological injury caused by disrupted cerebral blood flow. Neuroimaging technologies, such as fMRI, are fundamental for assessing post-stroke brain changes. They provide Functional Connectivity (FC) matrices that represent high-dimensional organization; nonetheless, their complexity challenges clinical interpretation and hinders fitting for Machine Learning (ML) models, especially when few samples are available.
AIM(S): We aim to provide a rigorous feature selection pipeline to enhance classification performance by preserving only the most relevant biomarkers.
METHOD(S): From fMRI data, we extracted whole-brain FC matrices (using Schaefer parcellation with 400 regions) for 31 controls and 154 stroke subjects. To address the high dimensionality of FC matrices, we developed a dual-stage feature selection pipeline comprising a Lasso regression and a Variance Inflation Factor to preserve independent and informative features. An ensemble of ML methods is exploited to perform collective classification in a case study on brain stroke. We considered the balanced accuracy achieved by the ensemble in predicting the presence of stroke using the full and filtered feature sets to evaluate the effect of the filtration.
RESULTS: Our feature selection pipeline substantially improved classification performance compared to the non-refined feature set. The best ensemble on the original dataset achieved a balanced accuracy of 59.95%, and the best on the filtered dataset achieved 74.6%, resulting in a relative increase in balanced accuracy of 24.44%. The feature selection pipeline we proposed has proven effective in increasing the balanced accuracy of the tested ensembles.
CONCLUSIONS: Future work will focus on testing the strength of the feature selection pipeline by adding multimodal data to the models, exploring its utility in predicting long-term clinical outcomes, and validating these biomarkers in larger and more diverse cohorts.
FINANCIAL SUPPORT: This project was funded by National Science Center, Poland, grant no 2024/55/D/NZ5/02998. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 857533 and from the International Research Agendas Programme of the Foundation for Polish Science No MAB PLUS/2019/13. The project was created within the project of the Minister of Science and Higher Education “Support for the activity of Centers of Excellence established in Poland under Horizon 2020” on the basis of the contract number MEiN/2023/DIR/3796. We acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and supportwithin computational grant no. PLG/2025/01828