Ferritinophagy-related prognostic genes UBE2Q1, NEDD4L, and TCP11L2 for prognosis prediction in sepsis - Scientific Reports

Sepsis-related datasets GSE65682 (platform: GPL13667) and GSE167363 (platform: GPL24676) were obtained from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database. GSE165682 encompasses 760 whole blood samples from sepsis patients, with survival data available for 479 of these patients, along with 42 blood samples from healthy individuals. The GSE167363 dataset consisted of peripheral blood mononuclear cell (PBMCs) samples from 10 patients with sepsis and two healthy individuals. Twenty ferritinophagy-related prognostic genes (FRGs) were cited in the literature, including USP24, PCBP1, ATG16L1, ATG7, SNCA, FBXW7, TRIM7, TNF, ATG5, WDR45, CYB561A3, FTH1, NCOA4, HERC2, ALOX15. BECN1, ELAVL1, ZFP36, BCAT2, and FTL. The expression quantitative trait loci (eQTL) of exposure factors and GWAS data for sepsis (ieu-b-4980) were obtained from the Integrative Epidemiology Unit (IEU) Open GWAS database (https://gwas.mrcieu.ac.uk/). The ieu-b-4980 dataset comprised 12,243,539 single nucleotide polymorphisms (SNPs) from 486,484 Europeans. The above GWAS IDs were selected following the principle of the most recent year with the highest number of SNPs, and the data were downloaded on 2 April 2024.

Initially, the R package 'limma' (version 3.54.0) was used for differential expression analysis to select DEGs between sepsis patients and controls in the GSE65682 dataset (|Logfold-change (FC)|> 0.5, p < 0.05). All DEGs and the top 10 genes upregulated in the sepsis group are shown using a volcano plot and heat map, respectively. The result was visualized using 'ggplot2' (version 3.4.4).

The FRGs scores of the sepsis group and control group in the GSE65682 dataset were evaluated based on the FRGs gene set using the ssGSEA algorithm integrated in the R package "GSVA". Sepsis samples in the GSE65682 dataset were split into high- and low-scoring groups in compliance with the median FRGs score. Kaplan-Meier (K-M) analysis was performed using the R package 'survminer' (version 0.4.9) to assess the survival discrepancies of septic patients in the two scoring groups (p < 0.05). Furthermore, key module genes linked to FRGs score were identified via WGCNA, which was supported by R package 'WGCNA' (version 1.7.1). Concretely, the median absolute deviation (MAD) of genes in GSE65682 dataset was discovered and removed, and genes with MAD values in the bottom 50% as well as ineligible samples were removed utilizing the 'goodSamplesGenes' function. Subsequently, the sepsis samples were subjected to cluster analysis using the 'hclust' function to ensure the accuracy of the WGCNA by removing outlier samples. To ensure that the gene interactions closely followed a scale-free distribution, the ideal soft threshold was determined when R exceeded 0.85, and the mean connectivity tended to be 0. Dissimilarity coefficients were calculated according to the adjacency and similarity among genes, and a systematic clustering tree was generated. Following the standards of the hybrid dynamic tree cutting algorithm, a minimum of 100 genes per module was established, and a module merging threshold of 0.3 was applied to identify the clustering modules. Using FRGs scores as traits, Pearson's correlation analysis was performed between the module eigengene (ME) scores of the clustering modules and the FRGs scores (|cor|> 0.3, p < 0.05). The clustering modules with the highest positive and negative correlations with FRGs scores were identified as the key modules. Genes included in key modules are called key module genes. The intersecting DEGs and key module genes served as the candidate genes.

To investigate the functions of the candidate genes, enrichment analyses were conducted using Gene Ontology (GO) (p < 0.05) and the Kyoto Encyclopedia of Genes and Genome (KEGG) (p < 0.05). These analyses were performed using the 'clusterProfiler' R package (version 4.7.1.003). In addition, to explore the interactions between candidate genes at the protein level, a protein-protein interaction (PPI) network was synthesized for the candidate genes using the Search Tool for the Retrieval of Interacting Genes (STRING, https://string-db.org) website (confidence = 0.4).

A univariate MR (UVMR) study was performed to identify key genes by evaluating the causality between candidate genes (exposure factor) and sepsis (outcome). In this study, SNPs were used as the IVs. Firstly, the 'extract_instruments' and 'extract_outcome_data' functions from the R package 'TwoSampleMR' (version 0.5.7) were utilized to determine exposure factors and outcomes, respectively. SNPs related to exposure factors but unrelated to outcomes were ascertained (p < 5 × 10, proxies = TRUE, rsq = 0.8). Simultaneously, SNPs with linkage disequilibrium (LD) were excluded (r = 0.001, kb = 100). Data on exposure factors and outcomes were merged via the ' harmonize _data()' function to generate SNPs for subsequent screening. Additionally, F-statistics of the SNPs were calculated, and weak SNPs with F less than 10 were eliminated. If the exposure factor had fewer than three SNPs, it was excluded. The F-statistic was calculated as the following equation:

The N was sample size for the GWAS, the K was the number of IVs, and R indicated the extent to which the IV explained the exposure factors.

The effect alleles and effect sizes were standardized using the ' harmonize _data' function. The 'mr' function was then invoked to perform UVMR analysis, employing MR Egger, Weighted median, Simple mode, Inverse variance weighted (IVW) method and Weighted mode. The IVW method is the most commonly used. The weighting formula for IVW was calculated as follows:

For the ith IV, βi denoted the estimate of the effect of the IV on the exposure factor, γi denoted the estimate of the effect of the IV on the outcome, and se denoted the standard error. If there were fewer missing values, the observations containing the missing values were deleted directly, and the na.omit function could be used to remove the missing values. If there were more missing values, multiple interpolation would be used to fill in the missing values, with mice package.

Candidate genes were considered to have significant causality with sepsis if the p-value from the IVW method was less than 0.05. Genes with hazard ratios (HRs) greater than 1 were considered risk factors for sepsis, whereas those with HRs less than 1 were considered protective factors. Scatter plots, forest plots, and funnel plots were created to present the UVMR findings.

Sensitivity analyses including heterogeneity, horizontal pleiotropy, and leave-one-out (LOO) tests were performed to assess the dependability of the UVMR study. A heterogeneity test was conducted using the 'mr_heterogeneity' function to determine whether heterogeneity existed among samples. A Q_ p value greater than 0.05 indicated no heterogeneity. To evaluate potential confounding factors in UVMR study, horizontal pleiotropy was assessed with the 'mr_pleiotropy_test' function, supplemented by the 'mr_presso' function. A P value greater than 0.05, suggested no horizontal pleiotropy. The 'mr_leaveoneout' function was employed to perform a LOO test, which examined the impact of each SNP on the outcome. Significant changes in the results upon removal of a variable indicated that the SNP had a substantial impact on the outcome. A Steiger test was employed to confirm the directionality of UVMR analysis. Only candidate genes with a correct causal direction marked as TRUE and a significance level of p < 0.05, were deemed as key genes.

First, in the GSE65682 dataset, univariate Cox regression analysis was used to identify prognostic genes associated with survival in patients with sepsis from key genes (HR ≠ 1, p < 0.01). Prognostic genes that passed the proportional hazards (PH) assumption test (p > 0.05) were selected for least absolute shrinkage and selection operator regression (LASSO) regression analysis using the R package 'glmnet' (version. 4.1-4). Samples with complete survival information in the GSE65682 dataset were randomly zoned as a training set and a validation set at cut of 7:3. In the training set, a risk model was developed based on the expression levels of prognostic genes and the overall survival (OS) information of patients with sepsis. A risk score equation was determined based on the LASSO coefficient (coef) and prognostic gene expression (expr) as follows: . Subsequently, in the training set, patients with sepsis were divided into high- and low-risk groups, according to the median risk score. Ulteriorly, in training and validation sets, K-M curves were created via the R package 'survminer' to compare survival discrepancies between two risk groups (p < 0.05). Moreover, the R package 'survivalROC' (version. 1.0.3) was used to draw receiver operating characteristic (ROC) curves to evaluate the effectiveness of the risk model by outputting the area under the curve (AUC) values for 9, 18, and 27-day. An expression heatmap of prognostic genes in the two risk groups was generated.

A nomogram model was established via the R package' rms' (version 6.5-0) in compliance with the expression of each prognostic gene. The predictive precision of the nomogram was appraised using calibration curve, decision, and ROC curves. To investigate the relation between risk score and clinical characteristics, in the training set, age, gender, diabetes, endotype class, ICUA, and pneumonia diagnoses were categorized into disparate clinical subgroups. The Wilcoxon rank-sum test was used to distinguish discrepancies in risk scores among these subgroups (p < 0.05). Stacked bar charts were generated to illustrate the proportion of patients in the two risk groups across various clinical characteristics.

The occurrence of the disease is usually inseparable from the role of the immune system; therefore, studying the distribution of immune cells in sepsis can improve our understanding of its pathogenesis. First, the relative percentages of 22 immune cells in the high- and low-risk groups of the training set were estimated using the CIBERSORT algorithm. Discrepancies in immune cell infiltration between the two risk groups were compared via Wilcoxon rank-sum test (p < 0.05). Discrepancies in the expression levels of 28 immune checkpoints between the two risk groups were estimated using the Wilcoxon rank-sum test (p < 0.05). Spearman's correlation analysis, supported by the R package" Psych ' (version 2.2.9) was used to explore the correlations between prognostic genes and discrepant immune cells as well as between prognostic genes and immune checkpoints.

To delve deeper into the signaling pathways associated with prognostic genes and the potential mechanisms of prognostic genes influencing sepsis, a Spearman correlation analysis was conducted on prognostic genes and all genes in the training set through the 'psych' package in R. These genes were then ranked based on their correlation coefficients from high to low, and GSEA was conducted on the sorted genes using the R package 'clusterProfiler' (p < 0.05). To understand the expression patterns of prognostic genes, upstream molecules targeting these genes were identified. Key miRNAs targeting prognostic genes were obtained by crossing the miRNAs predicted by the miRanda (http://www.miranda-im.pl/) and miRDB (https://mirdb.org/ mining.html) databases. Subsequently, the molecules upstream of these key miRNAs and lncRNAs were predicted using the starBase database (https://starbase.info/). Additionally, the NetworkAnalyst database (https://www.networkanalyst.ca/) was used to retrieve transcription factors (TFs) target prognostic genes. Cytoscape software (version 3.7.2) was used to construct the mRNA-miRNA-lncRNA and TF-mRNA networks. To further understand the suborganelles of prognostic genes, the UniProt database (https://www.uniprot.org/) was used for subcellular mapping of proteins coded by prognostic genes.

Disease association analysis aims to identify the causes and factors influencing diseases through systematic research. In this study, diseases associated with prognostic genes were explored using the DisGeNet database (https://www.disgenet.org/) and a prognostic gene-disease co-expression network was visualized using Cytoscape software. To investigate the phosphorylation sites of prognostic genes at the protein level, the PhosphoSitePlus database (https://www.phosphosite.org/homeAction.action) was utilized for analyzing the phosphorylation sites of the corresponding proteins. The findings were presented visually on a Laplace plot. The Drug Signatures Database (DsigDB, https://ngdc.cncb.ac.cn/databasecommons/database/id/4603) was used to predict targeted drugs that were likely to act on prognostic genes, and a prognostic gene-drug network was built using Cytoscape software.

Using the GSE167363 dataset, the 'Seurat' package in R (version 4.3.0) was used for the data filtering. Specifically, cells with fewer than 200 genes, genes present in fewer than three cells, cells expressing fewer than 350 genes or more than 3,000 genes, and genes with counts ≥ 15,000 were excluded. Additionally, the proportion of mitochondrial genes was limited to less than 15%. After data normalization, the 'vst' method was used to extract the top 2,000 highly variable genes (HVGs) between cells for further analysis. Furthermore, the samples were subjected to principal component analysis (PCA), the principal components (PCs) were extracted via the 'JackStrawPlot' and 'JackStraw' functions, and a PCA inflection plot was drawn to show the percentage of variance represented by each PC. The 'FindNeighbors' and 'FindClusters' functions were executed to do an unsupervised cluster analysis of the cells. The Uniform Manifold Approximation and Projection (UMAP) method was used to cluster the cells, resulting in cell clusters (resolution = 0.3). Cell types were obtained based on the expression of marker genes in cell clusters, which were obtained from a previous study. Simultaneously, the proportion of each cell type in the discrepant samples from the GSE167363 dataset was determined.

The expression of prognostic genes in different cell types was estimated, and key cell types with significantly different expression of prognostic genes in sepsis patients and controls were identified using the Wilcoxon rank-sum test (p < 0.05). To assess the patterns of interactions between cell types, cell communication analysis was performed using the R package 'CellChat' (version 1.6.1). In addition, the patterns of receptor-ligand interactions between different cell types were revealed. Unk 'Monocle' (version 2.26.0) package in R was used to conduct a pseudo-time analysis of key cell types to reveal the dynamic perspective of key cell differentiation state transformation and prognostic gene expression.

To explore changes in key cell-enriched pathways, GSVA was proceeded via the 'GSVA' package (version 1.42.0), with the reference gene set h.all.v2023.2. Hs.symbols.gmt, from the Molecular Signatures Database ( https://www.gsea-msigdb.org/gsea/msigdb).

Bioinformatics analysis was performed using the R software (version 4.2.2). Differential expression analysis was performed using limma (version 3.54.0) to identify differential genes between disease and normal samples; a co-expression network was constructed via WGCNA and key modules were screened; clusterProfiler (version 4.7.1.003) was used to conduct GO/KEGG functional enrichment analysis on candidate genes; TwoSampleMR was employed to perform Mendelian randomization analysis for inferring the causal relationship between genes and sepsis; the construction of a prognostic model using survival (version 3.5-3) combined univariate Cox regression and glmnet (version 4.1-4) LASSO regression, and rms (version 6.5-0) and nomogramFormula (version 1.2.0.0) were utilized to construct and validate the nomogram model; immune infiltration analysis was carried out to evaluate cell composition through CIBERSORT; for single-cell data analysis, Seurat (version 4.3.0) was used for quality control, standardization and clustering, CellChat (version 1.6.1) was applied to analyze intercellular communication, and monocle (version 2.26.0) was used to perform pseudotime analysis; in addition, GSVA (version 1.42.0) and enrichplot (version 1.18.3) were also used to conduct pathway enrichment analysis. Discrepancies between the two groups were evaluated using the Wilcoxon rank-sum test, with a p-value of less than 0.05 deemed statistically significant.

Ferritinophagy-related prognostic genes UBE2Q1, NEDD4L, and TCP11L2 for prognosis prediction in sepsis - Scientific Reports

POPULAR CATEGORY

misc

entertainment

corporate

research

wellness

athletics