Conventional machine learning-based prediction models did not outperform the International IgA Nephropathy Prediction Tool
Article information
Abstract
Background
Immunoglobulin A nephropathy (IgAN) is a major cause of end-stage kidney disease (ESKD). The International IgA Nephropathy Prediction Tool (IIgAN-PT) predicts IgAN prognosis, but improvement in the prediction performance using machine learning (ML)-based methods is needed.
Methods
We analyzed 4,425 biopsy-confirmed patients with IgAN and ≥6 months of follow-up from nine tertiary university hospitals in Korea. The study population was divided into development and validation cohorts. Using the collected 87 clinicodemographic and pathological variables, ML-based prediction models for ESKD or estimated glomerular filtration rate decline (50% reduction or < 15 mL/min/1.73 m2) were constructed: 1) the conventional CatBoost model, 2) the optimized CatBoost model with Cox proportional hazards, 3) the deep Cox proportional hazards model, and 4) the deep Cox mixture model. The area under the curve (AUC) and calibration plots were used to investigate the discriminative and calibration performance of the models, which were then compared with those of the IIgAN-PT full model.
Results
The full model showed excellent performance (AUC [95% confidence interval] for 5-year outcome, 0.896 [0.853–0.940]), with acceptable calibration results. The ML-based models showed good performance in predicting adverse kidney outcomes and revealed acceptable discrimination performance in the external validation (AUC [95% confidence interval] for the 5-year outcome: 1) 0.829 [0.791–0.866]; 2) 0.847 [0.804–0.890]; 3) 0.823 [0.784–0.862]; and 4) 0.832 [0.794–0.870]), although the models showed underestimation in calibration analysis of the external validation cohort. With the validation data, the overall performance of the IIgAN-PT was non-inferior to that of the ML-based model.
Conclusions
Our ML-based models showed good performance in predicting adverse kidney outcomes in patients with IgAN but they did not outperform the IIgAN-PT.
Introduction
Immunoglobulin A nephropathy (IgAN) is the most prevalent primary glomerulonephritis worldwide [1]. The clinical presentation and overall prognosis of IgAN are extremely heterogeneous. IgAN may be worsened by high blood pressure, significant proteinuria, the presence of kidney dysfunction, or unfavorable pathologic characteristics. Approximately one-third of patients with IgAN progress to end-stage kidney disease (ESKD) in their middle age, ranking the disease as one of the important causes of socioeconomic burden related to kidney failure, especially in Asian countries [2–4]. However, a certain portion of patients with IgAN exhibit a benign course without notable deterioration of kidney function. Therefore, the current KDIGO (Kidney Disease: Improving Global Outcomes) guideline for glomerular diseases recommends stratifying the kidney progression risk of patients with IgAN based on clinical and histologic data and quantifying progression risk at diagnosis using the International IgA Nephropathy Prediction Tool (IIgAN-PT) [5,6]. The IIgAN-PT is the prediction model that includes the largest number of patients with IgAN from various regions of the world [5]. The prognostic performance of the IIgAN-PT has been also validated in certain external cohorts, including children, supporting the validity of the model [7–10].
Artificial intelligence (AI) provides an emerging opportunity to develop automatic clinical/pathological image annotations, construct clinical decision support systems, and build robust prediction models. However, the ability of AI to handle complex high-dimensional data without being affected by characteristics of parameters or statistical assumptions remains to be determined [11]. Machine learning (ML)-based methods, a subfield of AI that teaches machines to learn from past data without explicit programming, have also been trialed for the prognostic IgAN model [12–14], yet, a widely validated deep learning (DL)-based model has not been established. Additional studies implementing the AI approach to integrate the complex clinicopathological information of patients with IgAN may improve the performance of prognostic strategies for the disease.
This study aimed to develop ML-based models to predict the prognosis of IgAN. We trained and validated ML- and DL-based models using a comprehensive collection of 87 demographic, clinical, and pathologic variables from a large-scale multicenter cohort in South Korea. We also validated the full IIgAN-PT model in the Korean population and compared the performance of the AI models with that of the IIgAN-PT model derived from the conventional Cox proportional hazards model.
Methods
Ethics considerations
The study was approved by the Institutional Review Board of Seoul National University Hospital/Seoul National University Bundang Hospital/SMG-SNU Boramae Medical Center (No. H-2103-091-1205), Severance Hospital (No. 4-2021-0376), The Catholic University of Korea, Yeouido St. Mary’s Hospital (No. SC21RIDI0090), Asan Medical Center (No. 2021-1333), Kyungpook National University Hospital (2021-04-036), Chungbuk National University Hospital (No. 2021-09-004), and Gangwon National University Hospital (No. KNUH-A-2021-08-012-001). Data on all study participants were collected from each hospital and sent to the central analysis laboratory after anonymization using the standard protocol. The requirement for informed consent was waived because this was a retrospective observational study without medical intervention. The study was conducted in accordance with the principles of the Declaration of Helsinki.
Study setting
This multicenter study included biopsy-confirmed IgAN cases from nine tertiary hospitals throughout Korea. We first collected the diverse demographic, clinical, and pathological characteristics of patients with IgAN by reviewing their electronic health records. Next, we implemented a multiple ML-based approach to construct a prediction model for kidney disease progression in IgAN. Finally, we compared the discriminative and calibration performances of the models with those of IIgAN-PT.
Study population
We included all available biopsy-confirmed native IgAN cases from the electronic medical records of the study hospitals (Fig. 1). Patients who progressed to the adverse kidney outcome within 6 months were excluded because such acute aggravation is not the target of the current study. The development cohort for the ML-based model included patients with IgAN from Seoul National University Hospital, Seoul National University Bundang Hospital, and SMG-SNU Boramae Medical Center. The three hospitals are all affiliated with the Seoul National University College of Medicine and may share a distinct medical environment; thus, combining the data from other hospitals as the validation cohort strengthened the external validation cohort.
Study outcome
The study outcome included a decrease in estimated glomerular filtration rate (eGFR) of less than half of the baseline or ESKD, defined as kidney replacement therapy or eGFR of <15 mL/min/1.73 m2. The study population was censored at the time of outcome or loss to follow-up.
Data collection for model variables
A total of 87 demographic, clinical, and pathological variables were collected and included in the model. For instance, we reviewed all variables included in the full IIgAN-PT model; these were social habits (e.g., smoking), various laboratory test results (e.g., serum electrolyte levels, serum protein/albumin levels, and complete blood counts including white blood cells, hemoglobin, and platelets), anthropometric measures (e.g., body mass index), pathological features, including light microscopy findings (e.g., global sclerosis, segmental sclerosis, or cellular or fibrocellular crescent) and electron or immunofluorescence microscopy. Because our longitudinal cohort covered a long period, some patients with IgAN were diagnosed before their institution adopted the Oxford classification for the pathologic diagnosis of IgAN. The pathologic parameters were assessed by each pathologist in the study hospitals and we retrospectively collected the pathology reports. Supplementary Table 1 (available online) provides a complete list of the collected variables.
Machine learning-based model construction
We used two ML-based and two DL-based models to construct a prognostic prediction model for IgAN. For the ML-based model, the conventional CatBoost [15] and optimized CatBoost with the Cox proportional hazards were trained using the collected data. CatBoost is a gradient-boosted decision tree model [16] with ordered target statistics and boosting and is a powerful tool for classification and regression. As a decision tree-based algorithm, it is well-suited to ML tasks involving categorical, heterogeneous data and can also compute feature importance [17]. CatBoost with the Cox proportional hazards is a model with a modified loss function for survival regression. Unlike common supervised tasks in which the target variable is known and observed during the entire period in the training dataset, survival regression can handle partially observed or censored target variables. Therefore, unlike the CatBoost method, which requires the development of a separate model for each time section to handle censored data, CatBoost with the Cox proportional hazards can handle various time sections using a single model that optimizes the log partial likelihood derived from the hazard function for Cox proportional hazards.
For the DL-based model, deep logistic hazards [18] and deep Cox mixture [19] were used for survival regression. Deep logistic hazard is a discrete-time survival prediction method with neural networks that parameterizes discrete hazards and optimizes the survival likelihood. We use a multilayer perceptron with two hidden layers to implement deep logistic hazards. The deep Cox mixture is another survival prediction method that generalizes the proportional hazards assumption via a mixture model by assuming that there are latent groups and that within each group, the proportional hazards assumption holds. This method is not restricted by the strong assumption of proportional hazards, which allows the model to choose these latent groups and build a more expressive survival prediction model.
The variables that contributed to the prognostic ability of the models were weighted by feature importance analysis in the ML-based models, including the CatBoost model for 5-year adverse kidney outcomes and the CatBoost model with Cox proportional hazards.
During the model production, we preprocessed the original data, including missing value filling, data standardization, and data normalization. Both CatBoost and the CatBoost with the Cox proportional hazards were implemented using PyCaret [20] and the official CatBoost Python package. Deep logistic hazards and deep Cox mixtures were implemented using the pycox python [18] package and the official deep Cox mixture repository. Missing values were masked (categorical) or averaged (numerical) according to the data type to maintain simplified method for further application in external datasets. The training/validation ratio for the DL methods was 9:1, and performance stability was assessed by bootstrapping. The assessment of Cox assumption of the Cox-based DL models used visualization of the Kaplan-Meier survival curves and checked whether the survival curves crossed in follow-up duration which may indicate violation of the assumption (Supplementary Fig. 1, available online).
Statistical analysis
For validation, the prediction scores extracted from the ML-based models were used to inspect the discriminative and calibration performance of the validation set. For the CatBoost model with the Cox proportional hazards function, the predictor for survival analysis was available as the IIgAN-PT, allowing calculation of the c-index. All four models provided prediction scores at specific time points of the outcomes, and we extracted the prediction scores to calculate the receiver-operating characteristic area under the curve (ROC-AUC) values at the 1-, 3-, 5-, and 10-year points to assess discriminative power. Calibration was performed using a calibration plot to assess the true and expected risks for 5-year adverse outcomes. The ROC-AUC values were directly compared to those of the IIgAN-PT calculated by the full model, within those with complete information on the variables required to apply the IIgAN-PT (e.g., Oxford classification) using the Delong test. As a sensitivity analysis, we additionally constructed ML-based models within those with complete information for the IIgAN-PT application and again compared the results in the validation set with the available data. Statistical significance was set at p < 0.05 significance. Clinical statistical analysis was performed using R software (version 3.6.2; R Foundation for Statistical Computing). Censoring of the data was considered to occur in a random manner.
Results
Baseline characteristics
A total of 5,075 biopsy-confirmed IgAN cases were screened in this study. Supplementary Table 2 (available online) summarizes the characteristics of the cohort. The overall characteristics differed between the study hospitals, and the median age of patients ranged from 32 to 44 years. Approximately 5% and 30%–40% of the study participants had diabetes mellitus and hypertension, respectively. The treatment history of immunosuppressive drugs at the time of biopsy was mostly less than 10%, while the proportion of those treated with renin-angiotensin-aldosterone blockade ranged from 24% to 58%.
After excluding patients with IgAN and a follow-up of less than 6 months, we constructed development and validation datasets comprising 2,439 and 1,986 patients with IgAN, respectively; Table 1 summarized the characteristics of these patients. The median follow-up duration was 5.8 years (interquartile range [IQR], 2.6–10.1 years) with a median biopsy date of March 2011 (IQR, April 2004–March 2016). The median follow-up duration in the development cohort and that in the validation cohort was 3.8 years (IQR, 1.5–7.3 years) with a median biopsy date of August 2015 (IQR, January 2011–June 2018). Among them, 1,240 and 1,125 patients with IgAN had complete information on the Oxford classifications, respectively; thus, they were included in the additional analysis with model development within the full Oxford classification information (Supplementary Table 3, available online).
Performance of the IIgAN-PT
We first applied the IIgAN-PT full model to the collected dataset, which contained the complete Oxford classification information (n = 2,178). In study subjects with complete information for IIgAN-PT, IIgAN-PT showed acceptable performance, with AUC values of0.836 (95% CI, 0.752–0.920), 0.873 (95% CI, 0.840–0.906), 0.857 (95% CI, 0.828–0.885), and 0.799 (95% CI, 0.757–0.840) for 1-, 3-, 5-, and 10-year outcomes, respectively. The overall calibration was acceptable when inspected using a calibration plot (Supplementary Fig. 2, available online).
Performance of the machine learning-based models
We then developed an ML-based model for 2,439 patients with or without Oxford classification information, and its performance was tested in the validation set (n = 1,717). In the validation set, the conventional CatBoost, optimized CatBoost with the Cox proportional hazards, deep logistic hazard, and deep Cox mixture models provided AUC values mostly ranging from 0.7 to 0.8 (Table 2, Fig. 2). The result of a single model did not show prominent superiority over the others, although the conventional CatBoost model showed low discriminative power (AUC, 0.512) toward the 10-year outcome data. When assessing the calibration of the developed models, the four models showed generally acceptable calibration results, as no significant deviation was identified in the calibration plots. However, a slight underestimation of the risk of adverse kidney outcomes was identified in the models developed using these four methods. When the composite outcome was divided into ESKD or eGFR 50% reduction, the performance was better towards ESKD outcome than the eGFR 50% reduction (Supplementary Table 4, available online).
Feature importance
We inspected the feature importance, which refers to the variables that the constructed models mostly referred to for their prediction (Fig. 3) in the models constructed using ML-based methods. In the CatBoost model for 5-year outcomes and the CatBoost model with Cox proportional hazards, when we tested the results in the discovery cohort and the cohort with complete information for the IIgAN-PT, the notable variables included serum creatinine, global sclerosis (%), eGFR, and proteinuria levels as the variables ranked among the top five variables. The number of glomeruli in the entire biopsy specimen, serum uric acid, blood urea nitrogen, serum albumin, and segmental sclerosis (%) were the variables that appeared in the top 20 variables in all four models.
Performance comparison between the IIgAN-PT and machine learning-based models
We compared the model performance within the validation dataset (n = 1,125) with the complete information for the IIgAN-PT (Table 3, Fig. 4). The IIgAN-PT again showed acceptable discriminative performance within the validation cohort, as the AUC values ranged from 0.834 to 0.896 for adverse outcomes at 1, 3, 5, and 10 years. The performances were similar to those of the ML-based methods, although no modeling results were statistically superior to the discriminative performance of the IIgAN-PT. Similarly, the calibration results were acceptable for both the IIgAN-PT and ML-based models. However, some underestimation of the risks of adverse kidney outcomes was noted in all tested ML-driven models.
Discussion
In this study, we developed ML-driven prediction models for the prognosis of IgAN kidneys by incorporating various clinicopathological variables. The constructed models demonstrated good discrimination and calibration performance in the external validation. As a reference, the full IIgAN-PT model showed excellent performance in our large-scale cohort study. The overall performance of the IIgAN-PT was non-inferior to that of ML-based models, additionally supporting the clinical utility of the IIgAN-PT in patients with IgAN.
Accurate prediction of IgAN kidney prognosis is crucial for appropriate risk stratification, scheduling follow-up visits, determining treatment strategies, and counseling patients. The IIgAN-PT is the most widely validated prognostic model for IgAN, and the full model includes age, blood pressure, baseline eGFR, proteinuria amounts, treatment history by renin-angiotensin-aldosterone blockades or by immunosuppressive drugs, the Oxford classification and with or without ethnicity [5]. The IIgAN-PT has been well validated in various cohorts [7–9]. However, the Korean population was not included in the development of the data, and some underestimation of kidney risk was suspected in a previous report [8]. Herein, we demonstrated the IIgAN-PT also showed acceptable predictive performances in this multicenter Korean IgAN cohort.
There is a relevant question regarding whether AI can develop a more advanced prediction model for IgAN, as this approach has recently proliferated and opened a new field of clinical prediction modeling. The ML-based approach is now actively used in the clinical image reading systems [21,22] and has shown excellent performance in risk stratification, combining hundreds of complex clinical features [23,24]. As the prediction of IgAN kidney prognosis may be improved from additional medical information, the AI-based approach is a promising method for developing a model with better prognostic performance. A previous deep learning-based model showed a non-inferior predictive performance to that of the IIgAN-PT; however, a superior finding has not yet been reported [11]. In the current study, we developed multiple ML-based models, enhanced by deep learning-related approaches, including a wide range of variables of IgAN patients at the time of diagnosis. These models generally demonstrated acceptable performance for the prognosis of IgAN. However, the clinical utility of IIgAN-PT was well validated in our cohort, and its performance was non-inferior to that of the models despite trialing multiple ML- and DL-based methods. In addition, the results showing the validity of the IIgAN-PT for 10-year kidney outcomes support that the model can be useful in predicting the long-term prognosis of IgAN patients [9]. Considering the generalizability, interpretability, and accessibility that had been demonstrated in the IIgAN-PT model, our current AI models seem to be unable to beat the IIgAN-PT model without securing outperformance in predicting kidney prognosis of IgAN. Therefore, the current study supports the clinical utility of the IIgAN-PT, as the model is easy to use without collecting extensive medical information, unlike AI-based models.
The ML-based models failed to show superior performance compared with the IIgAN-PT, despite the inclusion of a wide range of medical information, which can be explained by several factors. First, the variables included in the full IIgAN-PT model are not mere predictors but have significant causal effects on kidney prognosis or directly reflect kidney health. Elevated blood pressure or high amounts of proteinuria are not only common in chronic kidney disease but also directly damage the kidney [25,26], and the baseline eGFR reflects the underlying kidney function impairment. The Oxford classification includes mesangial proliferation, subsequent glomerular alteration, or active inflammation such as crescent formation, and final tubulointerstitial pathology; thus, it reflects the overall pathophysiologic aspect of IgAN progression from the initial stages to late pathologic consequences [27]. Constructed from these very important clinicopathologic features, variables not included in the IIgAN-PT may have only a minor impact on the prognosis of IgAN; thus, combining the effects of the variables by ML-based methods may have only a small advantage. Next, IgAN cohorts are relatively small compared to big data, which are widely used when applying AI-based methods. Although some AI-based methods are targeted at constructing prediction models for middle-to small-sized data, the superiority of ML- or DL-based methods may be weakened in datasets with a few thousand samples. A larger dataset may be required to develop a superior prognostic model; however, collecting standardized medical information from multiple cohorts and countries is challenging.
This study had some limitations that should be addressed in future research. First, as noted above, the study sample size may not have been sufficient to develop a superior model for an ML-based approach, even though we included >3,000 patients with IgAN from multiple hospitals. A multinational consortium may collect a wide range of clinical information to develop an ML-based prediction model for IgAN using a larger sample size. Second, AI can deal with additional complex data, such as digital pathologic images, combined multiomics data, and time-sequenced information [11]. Rather than the current analysis using cross-sectional baseline information, additional studies may include datasets having multiple dimensions with complex features for which the AI-based approach has superiority. Third, this study included a population with a single ethnic background. Similar to the original multinational cohort used for IIgAN-PT development, the AI-based approach may also be trialed for those of various ethnicities. Lastly, some heterogeneity in collection of the study variables (e.g., pathology parameters) might have existed because we retrospectively collated the information from the study hospitals.
In conclusion, the IIgAN-PT performance was validated in the current large-scale IgAN cohort in Korea. Although ML-based prediction models may provide acceptable prediction performance for IgAN prognosis, a prediction model combining diverse baseline features may not be sufficient to develop an advanced model that is superior to IIgAN-PT. Future efforts, including large-scale and high-level data on IgAN, are warranted to improve the performance of the IgAN prognostic prediction models.
Supplementary Materials
Supplementary data are available at Kidney Research and Clinical Practice online (https://doi.org/10.23876/j.krcp.23.212).
Notes
Conflicts of interest
All authors have no conflicts of interest to declare.
Funding
This study was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2023-00219548) and cooperative research funding from the Korean Nephrology Research Foundation 2020.
Data sharing statement
The data presented in this study are available from the corresponding author upon reasonable request.
Authors’ contributions
Conceptualization, Formal analysis, Methodology: SP, YK, KCM, YGK, HL
Data curation: CHB, HC, JIP, ESK, JPL, SHP, HWK, SSH, HJC, DKK
Funding acquisition: SP, KCM, YGK, HL
Investigation: SP, CHB, HC, JIP, ESK, JPL, SHP, HWK, SHH, HJC, DKK
Writing–original draft: SP, YK, KCM, YGK, HL
Writing–review & editing: All authors
All authors read and approved the final manuscript.