The Alzheimer’s Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (e.g., magnetic resonance imaging [MRI], biometrics, RNA expression, etc.) from Alzheimer’s disease (AD) cases and controls that have recently been used by machine learning algorithms to evaluate AD onset and progression. While using a variety of biomarkers is essential to AD research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models. Therefore, we used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. We found that 93.457% of biomarkers, 92.549% of the gene expression values, and 100% of MRI features were strongly correlated with at least one other feature in ADNI based on our Bonferroni corrected α (p-value ≤ 1.40754 × 10−13). We provide a comprehensive mapping of all ADNI biomarkers to highly correlated features within the dataset. Additionally, we show that significant correlation within the ADNI dataset should be resolved before performing bulk data analyses, and we provide recommendations to address these issues. We anticipate that these recommendations and resources will help guide researchers utilizing the ADNI dataset to increase model performance and reduce the cost and complexity of their analyses.

Document Type


Publication Date


Notes/Citation Information

Published in Genes, v. 12, issue 11, 1661.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland.

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Digital Object Identifier (DOI)


Funding Information

This work was supported by the BrightFocus Foundation [A2020118F to Miller] and the National Institutes of Health [1P30AG072946-01 to the University of Kentucky Alzheimer’s Disease Research Center]. Data collection and sharing for this project was funded by the National Institute on Aging [R01AG046171, RF1AG051550 and 3U01AG024904-09S4 to the Alzheimer’s Disease Metabolomics Consortium].

Related Content

All custom scripts for processing and analyzing our data are available online at: https://github.com/jmillerlab/ADNI_Correlation. The tables containing the numbers of correlated features for each feature in our constructed data set are available online at: Bonferroni corrected α: https://github.com/jmillerlab/ADNI_Correlation/blob/main/data/sig-freqs/bonferroni-sig-freqs.csv. Maximally significant α: https://github.com/jmillerlab/ADNI_Correlation/blob/main/data/sig-freqs/maximum-sig-freqs.csv. The pickle file containing the mapping of non-MRI features to the non-MRI features that they are correlated with is available online at: https://drive.google.com/file/d/1uRuT6rhDVDeeBuRYPif3Ate3u1UVs-hO/view?usp=sharing.

Supplementary materials are available online at https://www.mdpi.com/article/10.3390/genes12111661/s1. They are also available for download as the additional file listed at the end of this record.

genes-12-01661-s001.zip (1240 kB)
Supplementary file