How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis

Published in MICCAI FAIMI, 2023

Dewinda Julianensi Rumala

MICCAI FAIMI 2023

This work has been honored with the Best Poster Presentation Award at MICCAI FAIMI 2023 workshop.

Introduction

Neglect of Longitudinal Data: In medical image analysis, the significance of longitudinal data is often overlooked. Longitudinal data, with its repeated observations over time, holds a wealth of information critical for understanding changes and patterns over the progression of diseases, offering insights into treatment efficacy.
Data Handling in Longitudinal Analysis: Proper handling of longitudinal data is a crucial aspect that often receives inadequate attention. Inconsistent or improper data handling procedures can introduce biases and compromise the robustness and reliability of the results, even in 3D-based medical image analysis.
Deep Learning Performance Challenges: Deep learning models, despite their high performance, can be misleading in the medical imaging domain. Biases, such as data leakage, can give overly optimistic results during evaluation, potentially leading to incorrect conclusions. Understanding and mitigating these challenges are vital for trustworthy AI-powered medical diagnoses.

Methods

Deep Learning model: 3D CNN (DenseNet121)
Datasets: 3D Longitudinal T1-weighted and T2-weighted MRI from ADNI (Dataset statistics see Table S1)
Data Processing: CAT12 with VBM pipeline
Purpose: Three-way classification (CN, MCI, and AD)

Table S1. Dataset Statistics. The ratio of females to males and the average age are based on the number of images rather than the number of subjects.

Collection	Data Group	No of Subjects	Female / Male	Age	No of Scans
Collection	Data Group	No of Subjects	Female / Male	Age	Before Augmentation	After Augmentation
5-Fold data	CN	41	85/65	76.68 ± 4.15	150	300
	MCI	45	50/100	74.19 ± 8.57	150	300
	AD	25	30/20	74.22 ± 8.90	50	300
Hold-out data	CN	8	17/13	74.81 ± 3.13	30	-
	MCI	11	8/22	75.42 ± 7.26	30	-
	AD	7	16/1	78.26 ± 6.52	30	-

Evaluation Scheme

Data Splitting Strategies during CV:

Subject-wise Splitting
Record-wise Splitting
Late Splitting

Illustration showing different data split strategies for longitudinal brain MRI: subject-wise, record-wise, and late splitting. — A toy example of different data split strategies for longitudinal brain MRI. (a) Subject-wise splitting groups all image scans based on the subjects into k-folds. (b) Record-wise splitting groups image scans based on different visit times into k-folds. (c) Late splitting groups image scans based on transformation technique into k-folds.

Training Setups

Table S2. Overview of the parameters used across all experiments: the learning rate was reduced using the ReduceLROnPlateau scheduler from TensorFlow by a factor of 0.1 when validation loss did not decrease after 10 epochs. Adam was used as the optimizer.

Parameter	Value
Learning rate	0.0001
Epsilon	0.0001
Beta 1	0.9
Beta 2	0.99
Epoch	100
Batch size	24

Results and Analysis

Data Splitting Strategy Impact:
- Data splitting strategy significantly influences model performance (P=0.0389).
- Record-wise split excels during CV, closely followed by late split.
- Surprisingly, subject-wise split performs poorest during CV but generalizes best to hold-out data.
MRI Sequence Influence:
- Choice between T1 and T2 MRI sequences does not significantly affect overall classification performance (P=0.7921).
- Both sequences (T1 and T2) prove suitable for the three-way classification task of CN, MCI, and AD.
Insights from GradCAM Visualization:
- Presence of shortcut learning found frem certain data splitting strategies: record-wise and late splits.

GradCAM Visualization

Gif Samples

Data Group	Subject-wise Split	Record-wise Split	Late Split
CN	[O] Correctly classified	[O] Correctly classified	[O] Correctly classified
CN	[X] Misclassified	[X] Misclassified	[X] Misclassified
AD	[O] Correctly classified	[O] Correctly classified	[O] Correctly classified
AD	[X] Misclassified	[X] Misclassified	[X] Misclassified

Discussion

How You Split Matters
- The choice of data splitting strategy during cross-validation significantly influences the performance and robustness of deep learning models in longitudinal medical image analysis.
Data Leakage and Identity Confounding
- Improper data splitting can lead to data leakage, causing identity confounding within the models. This compromises their generalization and can result in misleadingly optimistic performance assessments.
Shortcut Learning Revealed by GradCAM
- Visualization using GradCAM highlights potential shortcut learning in models, particularly from record-wise and late-wise splitting strategies. This suggests that models may learn unintended patterns during cross-validation.
Validating Robustness with Subject-Wise Split
- This study validates previous findings suggesting a promising approach—subject-wise split demonstrates relative robustness and less vulnerability to data leakage compared to record-wise and late splitting strategies. However, challenges like underperformance indicate the need for further investigation, potentially with a larger and more diverse dataset.
Future Directions
- Promoting Subject-wise split: future research should strongly consider subject-wise split for more reliable model evaluation and development.
- Investigating data variance and sensitive attributes: Further research should delve into the correlation between data splitting strategies and data variance, particularly exploring the influence of sensitive attributes such as age and sex. Understanding these relationships can lead to more nuanced and fair models.

Limitations

Generalization Challenges:
- Limited generalization to hold-out data likely influenced by data variance in sensitive attributes (e.g., age and gender imbalance).
Robustness of Subject-Wise Split:
- Despite a significant drop in performance, subject-wise splitting demonstrates relative robustness. Addressing potential underfitting by incorporating a more diverse subject pool could enhance this approach’s performance. Subject-wise splitting strategy shows more robustness but still suffers from performance drops, possibly due to underfitting, which can be mitigated with a larger and more diverse dataset.

References

Chaibub Neto, E., Pratap, A., Perumal, T.M., Tummalacherla, M., Snyder, P., Bot, B.M., Trister, A.D., Friend, S.H., Mangravite, L., Omberg, L.: Detecting the impact of subject characteristics on machine learning-based diagnostic applications. npj Digital Medicine 2(1), 99 (Oct 2019). https://doi.org/10.1038/s41746-019-0178-x
Yagis, E., Atnafu, S.W., García Seco de Herrera, A., Marzi, C., Scheda, R., Giannelli, M., Tessa, C., Citi, L., Diciotti, S.: Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Scientific Reports 11(1), 22544 (Nov 2021). https://doi.org/10.1038/s41598-021-01681-w

Acknowledgement

Special thanks to Directorate General of Higher Education and Research Technology, Indonesia, Prof. I Ketut Eddy Purnama (Sepuluh Nopember Institute of Technology, Indonesia) and Prof. Tae-Seong Kim (Kyung Hee University, South Korea).

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Dr. Dewinda J Rumala