In this study, leveraging our recently assembled multicentre UK cohort of 227 patients with primary cSCC with known metastasis outcomes and associated clinical archival tissue, we present the development and evaluation of cSCCNet, a two-step DL model for predicting metastatic risk from WSI of primary cSCC. In order to eliminate the need for time-consuming pathologist annotations, cSCCNet first selects the prognostically relevant area within a WSI and then predicts metastatic risk. We show that our histology AI model outperforms conventional clinicopathologic classifications and our recently developed 20-GEP molecular model, and is an independent predictor from histopathological classifications.
cSCCNet consists of two models: Model 1 for 'automated area selection' and Model 2 for 'prediction of metastatic risk' (Fig. 1a). As WSI often contain artefacts and normal tissue that can potentially confound the prediction of tumour characteristics and behaviour, our Model 1 automatically selects ROI. ROI are defined as tumour, intratumoral inflammatory cells, and peri-tumoral stroma, as these have been shown to contribute to tumour progression. Tiles within ROI are then extracted and used as input for Model 2 to predict metastatic risk for the sample of interest. Tile-level prediction is performed first to determine the predictive risk of each tile within a WSI. Informative tiles (i.e., confidently labelled high-risk or low-risk for metastasis by the model) are identified, and data is then amalgamated to generate a tumour-level prediction of metastasis (Fig. 1a).
To develop Model 1, WSI from all 227 cSCC from four centres (Supplementary Fig. 1a) were first annotated by an expert dermatopathologist (HR) to select the ROI, defined as all tumour regions, intratumoral inflammatory cells, and thin rim of peri-tumoral stroma (Fig. 1b). Non-overlapping image tiles of 512 × 512 pixels at 20x magnification were extracted, resulting in 167,814 ROI and 295,665 non-ROI tiles. Following colour normalisation (Fig. 1c) and the separation of the hold-out testing cohort, Model 1 was trained on the remaining 145,425 ROI and 237,069 non-ROI tiles from the training cohort (n = 187 cSCC). The split into training and testing cohorts is explained in Supplementary Fig. 1a. KerasTuner was used for systematic comparison of different DL architectures and training parameters (Supplementary Fig. 1b,c). The best performing model was based on ResNet50, with dropout 0.2 and initial learning rate of 1e-4. Using 5-fold cross-validation, the mean k-fold achieved tile-level accuracies of >90% in training and validation, with consistent performances across folds, and no evidence of overfitting. The final model was re-trained on the entire training cohort (187 cSCC) for 40 epochs. The optimal threshold for area selection was determined based on accuracy compared to the histopathologist-annotated ROI in the training cohort. A tile prediction score of 0.65 was selected as cutoff for selection.
To evaluate the performance of Model 1, predictions were generated on 80,985 tiles from the WSI in the testing cohort (n = 40 cSCC), which were not previously seen by the model. The 22,389 tiles within the histopathologist-annotated ROI had a median (IQR) prediction score of 0.97 (0.84-0.99), whereas the 58,596 outside the ROI tiles had a median prediction score of 0.01 (6e-4-0.07) (Fig. 2a).
As determined in the training phase, tiles with a Model 1 prediction score ≥0.65 were classified as 'ROI', and tiles with lower scores were classified as 'non-ROI'. Using this predefined threshold, Model 1 achieved an AUC of 0.97 (95% CI 0.97-0.98) in identifying ROI compared to pathologist annotations. Visual inspection of heatmaps by an expert dermatopathologist (HR) confirmed that all the relevant areas were adequately included across all WSI, with negligible inclusion of non-tumour regions (Fig. 2b).
For training Model 2, a total of 129,187 ROI tiles were obtained from pathologist-annotated WSI of 172 cSCC meeting inclusion criteria: 80,380 tiles from metastasising (n = 64) and 48,807 tiles from non-metastasising (n = 108) cSCC. Tumour size varied, with a median (IQR) of 1064 (555-1634) and 317 (148-591) tiles per metastasising and non-metastasising cSCC, respectively. To avoid overfitting to larger tumours, 500 tiles were randomly selected per tumour, resulting in 27,920 and 32,711 tiles from metastasising and non-metastasising tumours, respectively. Tile labels were inherited from tumour-level labels, as either 'Metastasising' or 'Non metastasising'.
Using KerasTuner, the best performing model was based on ResNet50, pretrained on Imagenet, with initial learning rate 1e-4, batch size 64, dropout 0.2, sigmoid activation function in the last dense layer, binary cross-entropy as a loss function, and ADAM algorithm for optimisation. Comparisons to additional architectures (Inception, Resnet 101, ResnetV2), learning rates, tile sizes (256 × 256 pixels), omission of colour normalisation, or no pre-training on Imagenet did not improve model performance (Supplementary Fig. 1c). Additionally, the dual model cSCCNet outperformed a single model, based on all tiles of the entire WSI (Supplementary Fig. 1d-f).
To assess generalisability, five-fold cross-validation was performed using the best performing model. The mean k-fold achieved tile-level accuracies of 0.92 for training and 0.76 for validation after 20 epochs (Fig. 3a and Supplementary Fig. 2a, b). Following five-fold cross-validation, the final model was re-trained on the entire training cohort (172 cSCC) for 20 epochs (Supplementary Fig. 2c).
Next, we used the training cohort to select a threshold for Model 2. Median (IQR) tile scores were 0.99 (0.88-1.00) and 0.01 (1e-3-0.07) for tiles from metastasising and non-metastasising cSCC, respectively. To select a tumour-level threshold, various aggregate scores were compared. Excluding tiles with borderline scores (0.3-0.7) achieved greater separation between the two groups. A median tile score >0.2 achieved 99% accuracy (correct for all training cases, except one non-metastasising cSCC). Applying both models in series in the training cohort, with ROI tiles selected by Model 1 analysed by Model 2, achieved 98% tumour-level accuracy in predicting which tumours metastasised (correct for 63/64 metastasising and 106/108 non-metastasising tumours, Supplementary Fig. 3a-d).
cSCCNet performance was next evaluated on the testing cohort (n = 40 cSCC) using both models applied in series and the predefined thresholds (Threshold 1: scores ≥0.65; Threshold 2: median >0.2 after excluding tiles with Model 2 scores of 0.3-0.7). Model 1 selected 12,295 tiles from metastasising primaries and 9,856 tiles from non-metastasising primaries. Model 2 predictions had median (IQR) values of 0.87 (0.45-0.99) for tiles from metastasising primaries and 0.02 (1e-3-0.17) for tiles from non-metastasising primaries (Fig. 3b). cSCCNet correctly classified 38/40 cases: 13/14 metastasising cSCC were classified as high-risk and 25/26 non-metastasising cSCC were classified as low-risk by the model (Fig. 3c). Data from most cases (n = 38/40) was available for comparison with clinicopathologic classifications, including UICC8/AJCC8, BWH and British Association of Dermatologists' cSCC guidelines (BAD), and with our published 20-GEP test. cSCCNet achieved an AUC of 0.95 (95% CI 0.87-1), exceeding that of the 20-GEP test (AUC 0.80, 95% CI 0.67-0.94), although this difference was not significant. cSCCNet significantly outperformed all clinicopathologic classifications (AUC range: 0.69-0.71, DeLong test, p < 0.006) (Fig. 3d, Table 1). On comparison, using data from the whole cohort (172 training and 40 testing samples), cSCCNet maintained superior performance in predicting metastasising and non-metastasising cases (AUC = 0.98), followed by the 20-GEP signature (0.86), whilst the clinicopathologic classifications had inferior performances (0.74-0.78) (Supplementary Fig. 4a).
Performance in predicting risk of cSCC metastasis, based on cSCCNet prediction, the 20-GEP model outcome derived from k-nearest neighbours analysis, the 8th edition Union for International Cancer Control/AJCC staging manual (UICC8/AJCC8) stages T3 or higher, BWH stages T2b or higher, and the British Association of Dermatologists' cSCC guidelines (BAD) based on 'High/Very high' risk or 'Very High' risk only. The GEP signature and clinicopathologic classifications were not available for all tumours; the column on the right shows AUC results for the 35 tumours with complete data. AUC: area under the receiver operating characteristic curve; FN: false negatives; FP: false positives; NPV: negative predictive value; PPV: positive predictive value; TN: true negatives; TP: true positives. The 95% confidence intervals are in brackets.
Upon investigating other benchmarking measures, cSCCNet achieved the highest accuracy (95%) and specificity (96%) in predicting which tumours metastasised in the testing cohort, outperforming the other risk stratification tools (Table 1). cSCCNet reached 93% sensitivity, superior to all other criteria except BAD 'High/Very high' risk category. The Pearson correlation between the 20-GEP test and cSCCNet score was 0.66 (p = 6e-6) for 37 cases, indicating a potential association between histopathological and molecular features (Supplementary Fig. 4b, c). On univariate analysis, features predictive of metastasis (p < 0.05) in the testing cohort included the cSCCNet classification, 20-GEP, UICC8/AJCC8, BWH, BAD Very High risk grade, tumour diameter, differentiation, thickness, and presence of lymphovascular invasion. Age, sex, site of primary cSCC, and presence of perineural invasion were not statistically significant in the testing cohort; however, all were significant (p < 0.05) when assessed in the entire cohort (n = 212), suggesting an impact of sample size (Supplementary Fig. 4d-i). On multivariate analysis, cSCCNet was an independent predictor of metastasis from UICC8/AJCC8 (multivariate Wald test, p = 002) and BWH (p = 6.9e-4) (Supplementary Fig. 4j,k).
To evaluate whether inter-centre variability affects model performance, we trained a risk prediction model (Model 2) on cases from only three study centres, and tested this model on the fourth centre (i.e., not seen during training).
Two centre-split experiments were performed: Model BCD (trained on centres B, C, and D, and tested on centre A) and Model ABD (trained on centres A, B, and D, and tested on centre C). Results are presented in Supplementary Fig. 5. Although performance declined using the centre-split models, both models retained reasonable predictive ability when testing on entirely unseen centres, especially Model ABD, with accuracy of 73% and sensitivity of 85%, with poorer specificity of 58%. Of note, the training cohorts in the centre-split models were very unbalanced, with a lower proportion of metastasising cases likely contributing to poorer performance. These findings support our training strategy for cSCCNet, which incorporates cases from all four centres to optimise data diversity and model generalisability.
Heatmaps of Model 2 outputs were interrogated for both metastasising and non-metastasising cases. Significant intratumoral heterogeneity was observed in some cSCC, with both low- and high-risk areas present within the same WSI (Fig. 3e). An expert dermatopathologist (HR) reviewed the most predictive tiles in correctly classified cases, with review of all H&E in the testing cohort and the available IHC slides. This preliminary observational analysis was not aimed to fully explain model scores, but rather to explore whether histopathological features may explain low or high scores across the cohort.
The model consistently assigned high scores (indicating higher risk) to areas of poorly differentiated carcinoma, which were often characterised by deeply basophilic staining secondary to large nuclei and scant cytoplasm (Fig. 4A, B). Additionally, areas with necrosis, single cell infiltration (Fig. 4C, D), acantholysis, or prominent desmoplasia surrounding carcinoma (Fig. 4E, F) often received borderline or high scores. Conversely, low scores (indicating lower metastatic risk) were assigned to regions containing predominantly near-normal epidermis, well-differentiated carcinoma (Fig. 4G, H), lymphocyte aggregation at the tumour edge (peritumoral infiltrate) (Fig. 4I, J), or cystic regions. Regions with dense, deeply eosinophilic stroma and keratin (Fig. 4G, H) were also consistently assigned low scores. Of note, tumour areas containing abundant blood vessels (Fig. 4K, L) often received high scores; however, it was unclear whether vascularisation itself was being recognised as a poor prognostic feature or whether the vessels were mimicking poorly differentiated carcinoma. Certain model predictions could not be fully explained, suggesting that cSCCNet may be detecting features beyond known histopathological risk factors.
Multiplex IHC was performed on further 5 metastasising and 5 non-metastasising cases, allowing improved separation of different cell types and more detailed assessment of cell type composition in individual tiles (Fig. 5). Keratinocytes were identified by anti-AE1/AE3 (stained with DAB). T lymphocytes were highlighted with anti-CD3 (stained in green). The third cell marker, αSMA (alpha-smooth muscle actin, in red), is expressed by several cell types, including cancer-associated fibroblasts, tumour stroma, and by cells surrounding blood vessels, including capillaries. Qualitative analysis revealed greater T cell infiltration within metastasising tumours (i.e., intratumoral infiltrate) (Fig. 5C, D) compared to non-metastasising tumours (Fig. 5G, H). Quantitative analysis using HALO-AI estimated the median (IQR) proportion of CD3-positive cells within tumour regions (tumour-infiltrating T cells) as 6% (3-9%) within metastasising cSCC and 2% (2-3%) in non-metastasising cSCC (Fig. 5K) although this difference did not reach statistical significance (Mann-Whitney U test, p = 0.09), likely due to the small sample size.
Two cases in the testing cohort were misclassified by the model. One non-metastasising scalp cSCC received a high model score (0.75, Supplementary Fig. 6a). On histopathological review, it was poorly differentiated, invaded beyond the subcutis, and was classified as high-grade by UICC8/AJCC8 (T3) and BWH (T2b). Examination of cSCCNet heatmaps revealed that Model 1 had failed to select >60% of the ROI, and that the small number of tiles passed to Model 2 were deeply basophilic. In this case, we attributed the misclassification to sampling bias and difficulty of the case. One metastasising pinna cSCC with incomplete excision margins received a low model score (0.10, Supplementary Fig. 6b). The majority of the tumour was moderately-differentiated with good keratinisation; however, there was extension beyond cartilage. A small area of poorly differentiated carcinoma was present and was correctly classified as 'high-risk' by the model. It was staged UICC8/AJCC8 T3 and BWH T2b. Of note, this tumour had initial incomplete margins and underwent re-excision.