Bioinformatics and Machine Learning-Based Screening of Key Genes in Alzheimer's Disease

Bioinformatics and Machine Learning-Based Screening of Key Genes in Alzheimer's Disease

Meng Ting Hou, Juan Bao, Shu Xiong Zheng, Si Tong Li, Xi Yu Li
Copyright: © 2024 |Pages: 17
DOI: 10.4018/IJWSR.349590
Article PDF Download
Open access articles are freely available for download

Abstract

Objective To provide theoretical support for the study of AD pathogenesis and therapeutic targets. Methods The AD data were downloaded from the GEO database for differential expression analysis to obtain DEGs, followed by enrichment analysis of GO and KEGG signalling pathways, construction of machine learning models to screen key genes, and construction of risk prediction models and prediction of transcription factors based on key genes. In addition, consistent clustering analysis was performed on AD samples. Results Seven key genes were finally screened in this study, and the risk prediction model constructed on the basis of these seven genes had an AUC of 0.877. Cluster analysis classified the AD samples into two subtypes, and there was also a significant difference in immune infiltration between the two subtypes. Conclusion This study provides new perspectives and potential therapeutic targets for exploring the potential mechanisms by which mitochondrial autophagy affects AD, as well as providing directions for individualised treatment of AD.
Article Preview
Top

Methods

Data Collection and Preprocessing

The GEO database (https://www.ncbi.nlm.nih.gov/geo/) is a public repository created and maintained by the National Center for Biotechnology Information (NCBI) in 2020 for storing and sharing gene expression data and is the most comprehensive public gene expression database available today (Barrett et al., 2009). The expression matrix and platform information of AD-related datasets GSE5281 (Liang et al., 2007), GSE9770 (Readhead et al., 2018), and GSE28146 (Blalock et al., 2011) were obtained from the GEO database using the “GEOquery” package in RStudio 4.3.1 (where all subsequent analyses were performed), and the expression data were extracted for normalization and normalization correction. Then, ID transformed using the “hgu133plus2.db” package was used to convert the IDs of the genes, and finally, the three sets of data were merged to remove the batch effect.

GeneCards (https://www.genecards.org/) is an online database that integrates human genetic information from multiple authoritative data sources. From this database, the authors searched for “mitophagy” and screened and downloaded genes with scores >1.

Acquisition of DEGs and Intersecting Genes

Gene expression values were statistically analyzed using the “limma” package of RStudio to screen out differentially expressed genes. P values < 0.05 and |log2FC| > 1 were used as the inclusion criteria, where log2FC > 1 was upregulated differentially expressed genes (DEGs) and log2FC < -1 was downregulated DEGs. DEGs were intersected with mitophagy-related genes with a score > 1 to obtain intersecting genes of interest. Volcano plots and heatmaps of DEGs were plotted, as were Venn diagrams and heatmaps of intersecting genes.

2.3 GO and KEGG Enrichment Analysis

The DEGs were analyzed for gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment using the “clusterProfiler” package of R. A P value < 0.05 was considered significant, and the top 15 significance levels were plotted in bubble and bar graphs. The intersecting genes were then analyzed for GO enrichment with a P value < 0.05 indicating significance, and a chord diagram was drawn for showcase.

Screening for Key Genes

To screen key genes, importance scoring of intersecting genes was performed using Lasso regression, XGBoost, Boruta, and RF. Based on Lasso's feature selection and regression modeling, the intersecting gene expression data and classification labels were processed and modeled, and the results associated with the model were saved and visualized. The XGBoost model was constructed based on the intersecting genes, the maximum depth was set to 6, the learning rate was 0.5, 25 rounds of training were performed, and the importance score was calculated. Visualize the feature importance scores using the functions in the “xgboost” package. The Boruta algorithm was used to select features for intersecting genes, set the maximum number of iteration rounds to 500, and eliminate genes that were irrelevant or uncertain to the classification results to reduce the feature dimensions and improve the model's interpretive and predictive performance. A random forest model was created, the number of decision trees was set to 1000, the importance score was calculated, the 15 most important genes were selected, and the training error plot and feature importance score plot of the random forest model were drawn. Finally, the four results were intersected to filter out the key genes and plot the Venn diagram.

Complete Article List

Search this Journal:
Reset
Volume 21: 1 Issue (2024)
Volume 20: 1 Issue (2023)
Volume 19: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 18: 4 Issues (2021)
Volume 17: 4 Issues (2020)
Volume 16: 4 Issues (2019)
Volume 15: 4 Issues (2018)
Volume 14: 4 Issues (2017)
Volume 13: 4 Issues (2016)
Volume 12: 4 Issues (2015)
Volume 11: 4 Issues (2014)
Volume 10: 4 Issues (2013)
Volume 9: 4 Issues (2012)
Volume 8: 4 Issues (2011)
Volume 7: 4 Issues (2010)
Volume 6: 4 Issues (2009)
Volume 5: 4 Issues (2008)
Volume 4: 4 Issues (2007)
Volume 3: 4 Issues (2006)
Volume 2: 4 Issues (2005)
Volume 1: 4 Issues (2004)
View Complete Journal Contents Listing