Data
Table 1 summarises the characteristics of the training data. Of the 480 651 individuals included, 7168 (1.5%) had MALO diagnosed during follow-up. The median follow-up was 27.7 (interquartile range 22.6-32.2) years, corresponding to 12.3 million person years of follow-up. Figure 1 describes the observed MALO events in the training data. A total of 7168 MALO events occurred during follow-up, of which 1331 (19%) occurred during the first 10 years, giving a cumulative incidence at 10 years of 0.27% (95% confidence interval 0.26% to 0.29%).
Graphical summary of major adverse liver outcomes (MALO) in training data. Top: histogram of distribution of event times for 7168 incident cases of MALO, of which 1331 occurred in first 10 years. Middle: estimated cumulative incidence (%) of MALO with 95% confidence region. 10 year cumulative incidence was 0.27% (95% confidence interval 0.26% to 0.29%). Bottom: number of person years at risk of MALO over time, in thousands. Cumulative total of 12.3 million person years were observed, with 4.6 million person years in first 10 years of follow-up
Table 2 shows a comparison between the three datasets used in this study. Notably, the 10 year incidence of MALO was higher in the validation datasets: 0.40% in the UK Biobank and 0.42% in FINRISK compared with 0.27% in the training data. The participants in the UK Biobank were notably older than those in the other two cohorts, and concentrations of γ-glutamyl transferase and aspartate aminotransferase were higher in the validation data than in the training data.
Comparison of characteristics in training data (AMORIS) and validation datasets (UK Biobank and FINRISK). Values are median (interquartile range) unless stated otherwise
Model development
We developed one flexible parametric survival model for MALO and one for the competing event of non-MALO death. We used these two models to calculate (using numerical integration) the predicted 10 year risk of MALO, accounting for the competing event. These predictions represent the “full model.”
We then logit-transformed the predicted risks of the full model. We chose this transformation instead of log-log and probit because it produced the highest R2 between the predicted values of the full model and the final simplified model. When we did the backwards elimination on the logit-transformed predicted risks of the full model, γ-glutamyl transferase and aspartate aminotransferase were consistently the two strongest predictors. The third most important predictor was alanine aminotransferase, although third place ranking varied slightly during the bootstrap validation among alanine aminotransferase, albumin, cholesterol, and platelets, with alanine aminotransferase generally performing better. Sex and age—always known and “free” to collect—were automatically included in this simplified model. We observed strong interaction effects between these and a few of the laboratory based predictors, and they are known to be strongly associated with the competing risk. Age, sex, γ-glutamyl transferase, aspartate aminotransferase, and alanine aminotransferase could together approximate the predictions of the full model with an R2 of 96%. As this approximation was very good, we did not need to consider any additional biomarkers.
On the basis of these decisions, we finalised the simplified model with the predictors γ-glutamyl transferase, aspartate aminotransferase, alanine aminotransferase, age, and sex, which became the Cirrhosis Outcome Risk Estimator (CORE) model. A web calculator is available at www.core-model.com, and R code for calculating CORE is available at https://github.com/rickstra/CORE. The mathematical formula for CORE is provided in the supplementary materials.
Figure 2 summarises the model predicted risks in the training data. On average, the predicted 10 year risk of MALO was 0.26%; 4%, 0.5%, 0.2%, and 0.05% of participants had a predicted risk of at least 1%, 5%, 10%, and 20%, respectively.
Two representations of predicted risks by CORE in training data. Predicted risks are separated into participants who had diagnosis of major adverse liver outcomes (MALO) within 10 years and those who did not. Top: density distributions, which shows how CORE model separates participants by outcome. Bottom: cumulative distributions—ie, proportions of participants with predicted risks at or above value indicated on horizontal axis. For example, 57% of those with diagnosis of MALO in 10 years had predicted risk of ≥1%, whereas proportion was only 0.4% of those without diagnosis. Note that horizontal axis is on log10 scale
Model performance
Figure 3 shows receiver-operator characteristic curves comparing CORE and FIB-4 in the training data. At every level of sensitivity, CORE has a better specificity than FIB-4; for every level of specificity, CORE has a higher sensitivity than FIB-4. For example, FIB-4 at the cut-off value of 1.30 had a sensitivity of 0.53 and a specificity of 0.86. The equivalent cut-off value for CORE, which gives a fixed sensitivity of 0.53, instead has a specificity of 0.96; and the cut-off value that gives a fixed specificity of 0.86 has a sensitivity of 0.73. Similarly, the FIB-4 cut-off value of 2.67 had a sensitivity of 0.18 and a specificity of 0.992, whereas for CORE this was 0.34 and 0.998, respectively, when the other was fixed to match FIB-4. The area under each receiver-operator characteristic curve was 0.88 (95% confidence interval 0.87 to 0.89) for CORE and 0.79 (0.78 to 0.80) for FIB-4.
Figure 4 presents time dependent areas under the curves in the training and validation cohorts. The area under the curve of CORE (optimism corrected using bootstrap validation) ranges from 0.94 (95% confidence interval 0.91 to 0.98) at one year to 0.88 (0.87 to 0.89) at 10 years. This is consistently much higher than the area under the curve for FIB-4, which ranges from 0.86 (0.84 to 0.89) at one year to 0.79 (0.78 to 0.80) at 10 years. In the validation cohorts, the 10 year area under the curve for CORE was lower than in the training data: 0.81 (0.77 to 0.87) in FINRISK and 0.79 (0.78 to 0.80) in the UK Biobank. However, in the UK Biobank (for which FIB-4 was available), the 10 year area under the curve for FIB-4 was 0.73 (0.72 to 0.74), similarly lower than CORE as in the training data.
Time dependent area under curve (AUC) statistics for time horizons from 1 to 10 years, comparing CORE and FIB-4 in training and validation datasets, with 95% confidence intervals. Note that FIB-4 could not be calculated in FINRISK as platelet count was not available. Values for CORE in training data (AMORIS) have been optimism corrected using bootstrap validation
Figure 5 shows calibration plots for CORE in the training and validation cohorts. Overall, the calibration was good in all three datasets, showing good agreement between predicted and observed risks. Some indication of overprediction exists in AMORIS and FINRISK for the especially high risks (>30%), but very few individuals make up those levels of risk (as indicated by the very wide error bars). The UK Biobank data instead show very good calibration at those high risks but indicate slight overprediction at the intermediate risks around 3-7%.
Calibration plots for CORE in training data and two validation datasets. Predicted risks are stratified into groups for which 10 year cumulative incidence is estimated, representing average observed risk in each group. Points are placed at average predicted risks within each group and observed risk; heights of error bars represent 95% confidence intervals of observed risk; widths of error bars indicate range of predicted risks included in respective group. Shaded regions are approximate 95% prediction intervals based on bootstrapping and locally estimated scatterplot smoothing in training data. Values in training data (left) have been optimism corrected using bootstrap validation
The top panel of figure 6 shows decision curves in the training data for four different theoretical treatment strategies for people at risk of MALO within 10 years: treating everyone as if they will develop MALO, treating none, or using either FIB-4 or CORE for risk stratification followed by treatment of those above the risk threshold indicated on the horizontal axis. “Treatment” in this regard can be seen as the need for further evaluation—for example, by referral to hepatology or for vibration controlled transient elastography. We can see that, for all potential risk thresholds below 20%—meaning the scenarios in which the benefit of correctly treating one person who will have MALO outweighs the harm of unnecessarily treating four or more people who will not—using CORE is the optimal strategy of the four for balancing the benefits and harms of true positives and false positives.
Decision curves comparing net benefit of CORE and FIB-4. Top: net benefit in training data comparing net benefit of default strategies (“treat all” and “treat none”) with using risk stratification with either CORE or FIB-4. Bottom: net benefit of CORE in both training data and validation data, plus FIB-4 in training data. Net benefit for CORE in training data (AMORIS) have been optimism corrected using bootstrap validation
In the bottom panel of figure 6, we have added the decision curves for using CORE in the two validation datasets. We see that the net benefit is consistent overall across all three datasets. As the validation datasets had a higher incidence of MALO than the training data, the net benefit starts higher when the risk threshold is low but—as the discrimination performance was lower—the net benefit drops off faster and coincidentally converges with the net benefit of the training data. This indicates that CORE would bring a similar net benefit in all three populations.
In supplementary figure D, we break down the types of MALO events (between compensated cirrhosis, decompensated cirrhosis, and hepatocellular carcinoma) detected by CORE and FIB-4 in the training data. We here use hypothetical cut-off values of 0.4% and 5% for CORE, which would refer the same number of individuals as the established FIB-4 cut-off values of 1.30 and 2.67, respectively. We see that CORE finds more cases of all three types and especially more cases of compensated cirrhosis when using the lower cut-off values (0.4% and 1.30).
As the current screening recommendations are aimed at certain risk groups, we repeated the internal validation of CORE in a subpopulation of 52 202 people who would qualify for screening according to European Association for the Study of the Liver guidelines.8 How this subpopulation was derived is explained in the supplementary materials. This subpopulation had a 10 year risk of MALO of 0.59% (95% confidence interval 0.53% to 0.66%) and an estimated 10 year area under the curve of 0.894 (95% confidence interval 0.855 to 0.929.)
Because the estimated predictive performance in the training data was based on partially imputed data of aspartate aminotransferase, alanine aminotransferase, and γ-glutamyl transferase, we also calculated the area under the curve of a complete case dataset with respect to these three variables as a sensitivity analysis. The complete case sample size was 398 149 (83% of total) and the estimated 10 year area under the curve was 0.882 (0.869 to 0.891).