CRC is a significant global health issue. The incidence of CRC is increasing among young people1,2. Despite advancements in treatment, CRC patients still face challenges related to metastasis and variable prognosis. In recent years, an increasing number of biomarkers for CRC metastasis and prognosis have been identified through analysis of clinical information and expression profiles in public databases3. However, most of these markers and models rely only on findings and constructions based on invasive biopsy results, failing to capture the changes in tumor cells during the process of metastasis. According to existing research, tumor cells can detach from existing solid tumor lesions (primary and metastatic) and enter the bloodstream as circulating tumor cells (CTCs), which can further reveal information about pathways and gene expression changes that drive tumor growth or metastasis4. This study aims to identify relevant pathways by analysing CTCs from CRC patients. Particular attention is given to the transport pathways of vitamins, nucleosides, and related molecules, as these pathways are of interest due to the need for cancer cells to adapt to a constantly changing microenvironment during tumourigenesis and metastasis. Cancer cells enhance the uptake of these molecules to meet the metabolic demands of rapid proliferation and migration5. Additionally, fatty acids and other molecules that form phospholipids through esterification reactions are crucial components of the cell membrane, influencing its physical properties such as fluidity, elasticity, and permeability, thereby promoting the migration and invasion of tumour cells. These transport pathways may be closely associated with CRC metastasis. The study further utilises machine learning methods to identify genes within these pathways that are closely linked to tumour growth or metastasis in both primary and metastatic CRC tissues, and to analyse the roles of these genes in CRC tumour growth or metastasis. This contributes to a better understanding of the mechanisms underlying CRC metastasis and establishes a model for identifying CRC metastasis, providing valuable information for clinical decision-making. Additionally, statistical analysis is performed on the data from the National Health and Nutrition Examination Survey database between 2017 and 2020 to investigate the relationship between cancer occurrence and the intake of total fat, saturated fatty acids, and non-fatty acid components in daily diet. This helps to understand the correlation between fat intake and cancer occurrence, enhancing awareness of dietary choices and providing guidance for cancer prevention.
GEO (Gene Expression Omnibushttps://www.ncbi.nlm.nih.gov/geo/) are large, open-access, and free databases containing gene expression data for various diseases. The GEO database includes the GSE131418 dataset with 1135 samples and the GSE31023 dataset with 9 samples. The GSE131418 dataset utilizes high-throughput sequencing technology based on the Rosetta/Merck Human RSTA Custom Affymetrix 2.0 microarray platform, while the GSE31023 dataset uses the Agilent-026652 Whole Human Genome Microarray 4 × 44 K v2 platform. The NHANES (National Health and Nutrition Examination Surveyhttps://wwwn.cdc.gov/nchs/nhanes) database utilizes a complex, multistage sampling method to select a certain number of participants from across the United States for annual surveys. The surveys encompass various aspects such as personal interviews, physical examinations, laboratory tests, and nutritional assessments. The study utilized survey data from the years 2017-2020.
The workflow chart has been illustrated in Fig. 1.
Multidimensional Scaling (MDS) and hierarchical clustering constitute established methodologies for similarity analysis and preliminary clustering of transcriptomic datasets. MDS provides a means of dimensionality reduction and the visualisation of inter-sample similarities, enabling the transformation of high-dimensional distance matrices into low-dimensional coordinate systems. This facilitates the graphical representation of sample relationships within two- or three-dimensional space, thereby elucidating patterns of similarity and dissimilarity among specimens. Conversely, hierarchical clustering is an unsupervised learning approach that delineates groupings of samples according to their pairwise similarity measures, progressively aggregating entities into nested clusters through iterative agglomeration or divisive partitioning.
Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC) constitute standard metrics for assessing the discriminatory capacity of binary classification models. In the present study, the 'pROC' R package was employed to generate ROC curves and compute AUC values. The ROC curve provides insight into the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across varying decision thresholds, thereby illustrating the model's sensitivity to positive class identification relative to its specificity in distinguishing negative class instances. The AUC, as a quantitative metric to assess overall model performance, represents the cumulative probability that the model ranks a randomly selected positive instance higher than a randomly selected negative instance.
The Variable Importance Package (VIP) constitutes a comprehensive toolkit within the R programming language, specifically engineered to quantify the contribution of individual features to machine learning models. This methodology facilitates the extraction of feature importance through diverse analytical approaches, contingent upon the structural and parametric attributes inherent to the model in question. By interrogating the weights, coefficients, or analogous metrics associated with each feature, the VIP package systematically evaluates their influence on predictive outcomes. The assessment of feature importance is typically executed by scrutinising perturbations in model performance, such as alterations in prediction errors or variances in output distributions. This analytical framework commonly employs techniques including permutation importance and Shapley Additive Explanations (SHAP) values. Permutation importance operates by randomly permuting feature values while retaining their original distribution, thereby quantifying the decrement in model accuracy attributable to feature irrelevance. Conversely, SHAP values derive from cooperative game theory, allocating contributions to features based on their marginal effects across all possible coalitions of input variables. In addition to computational analyses, the VIP package incorporates visualisation tools to represent feature importance graphically, enabling a more intuitive interpretation of the relative impact of variables on model predictions. These visualisations serve to elucidate the hierarchical significance of features, thereby facilitating informed decision-making in model refinement and interpretability.
The Regression Modeling Strategies (rms) package constitutes a comprehensive suite within the R programming environment, specifically engineered for regression modeling and survival analysis. This methodology facilitates the systematic construction of statistically rigorous risk and survival models through the strategic selection of predictor variables and the precise definition of event occurrences and temporal parameters via model-specific functions. Furthermore, the rms framework encompasses rigorous evaluation and validation protocols, enabling comparative analyses of survival models and the derivation of predictive frameworks characterised by both precision and reliability. By interrogating model outputs through statistical metrics, the package permits quantitative assessments of model concordance with observed outcomes, thereby ensuring robustness in clinical or observational applications.
Colorectal cancer cell lines Caco-2 and T84 were obtained from Cell Bank of Type Culture Collection of Chinese Academy of Sciences (Shanghai, China) and cultured in Dulbecco's Modified Eagle's Medium (DMEM, Gibco, USA) supplemented with 20% FBS (HyClone, USA). The medium was changed every 2 days. Cells were passaged and digested with 0.25% trypsin for cell fusion as a single layer.
The antibodies used: anti-SLC27A1 s were purchased from Sangon Biotech(Shanghai, China), anti-βactinand other antibodie were purchased from Beyotime(Shanghai, China). SLC27A1 Human Pre-designed siRNA Set were purchased from MedChemExpress(New Jersey, USA).
Colorectal cancer cell lines were lysed using Cell Lysis Buffer (50 mM, pH 7.5) which contains the following components: 150 mM NaCl, 1% Triton X-100, 2 mM sodium pyrophosphate, 25 mM β-glycerophosphate, 1 mM EDTA, 1 mM Na₃VO₄, and 0.5 µg/mL leupeptin (Beyotime, P0013). The protein concentration of the samples was determined with the Enhanced BCA Protein Assay Kit (P0009B, Beyotime). Subsequently, equal amounts of protein were separated by 12% SDS-PAGE. The proteins were then transferred to PVDF membranes (Roche Incorp., Germany), and these membranes were incubated overnight at 4 °C with either rabbit polyclonal anti-SLC27A1 (D161127, Sangon Biotech) or mouse monoclonal anti-actin (A0208, Beyotime). The protein bands on the blots were visualized using HRP-conjugated secondary anti-rabbit or anti-mouse antibodies along with BeyoECL Plus (P0018, Beyotime).
The cells were handled in the same way as in the MTT assay. However, they were plated in 6 - well plates at a density of 5 × 10⁵ cells per well and left to culture overnight. After that, a yellow pipette tip was employed to create a straight scratch across the cell layer. Subsequently, the cells were rinsed three times with PBS (pH 7.4). Next, the cells were incubated for 24 h in DMEM that was supplemented with 1% FBS. The scratches were observed every 24 h using a microscope camera system, and the area of the scratches was measured with Image Pro Plus 6.0.
25 µg Matrigel basement membrane (356237, Corning, USA) was paved in the up chamber of the 6.5 mm transwell chamber with 8.0 μm-pore Polycarbonate Membrane Insert (3422, Corning Costa, USA). 600 µl serum-free medium to transwell upper and lower chambers for 12 h. The cells were then plated in the up chamber with serum-free medium at a density of 1 × 10 cells/well and incubated in a cell incubator for 24 h. The cells which did not cross the basement membrane in the up chamber were removed from the membrane, stained with 0.1% crystal violet solution for 30 min, and count the number of invasive cells.
All experiments (Western blot, motility, and invasion) were performed with 3 biological replicates (independent cell cultures) and 3 technical replicates (assays per biological sample) to ensure statistical robustness.
R constitutes a freely distributed computational environment and programming language, operating under the GNU General Public License, specifically engineered for statistical inference and data visualisation. It comprises an extensive array of methodologies, encompassing linear and nonlinear regression modelling, hypothesis testing frameworks, time series analysis, discriminant function analysis, clustering algorithms, and multidimensional graphical representation techniques. The analytical workflows delineated in this study were implemented using R software (version 4.1.2; https://www.r-project.org/), which provides a robust, extensible architecture for reproducible statistical computation and evidence-based inference.