Interobserver agreement analysis among renal pathologists in classification of lupus nephritis using a digital pathology image dataset: after a third evaluation
Article information
Abstract
Background
Lupus nephritis is well-known for low concordance in classification. Furthermore, there has been no agreement analysis among Korean renal pathologists regarding lupus nephritis. Inconsistent diagnosis leads to confusion and increases medical costs, as well as failure of appropriate therapeutic interventions. This study aimed to assess the level of agreement among Korean renal pathologists regarding classification.
Methods
Representative glomerular images from patients diagnosed with lupus nephritis were obtained from five hospitals. Twenty-five questions were formulated, and multiple-choice questions with 14 options, consisting of characteristic histopathological findings of lupus nephritis were provided. Three rounds of surveys were conducted and educational sessions were conducted before the second and third surveys.
Results
The agreement was calculated using Fleiss’ κ and the means for each round of questions were as follows: Survey 1, 0.42 (range, 0.18–0.61), Survey 2, 0.42 (range, 0.19–0.64), and Survey 3, 0.47 (range, 0.23–0.65). Although κ after the first education session showed no significant difference compared to the initial κ (p = 0.95), after the second education session, κ increased significantly compared to the initial κ (p < 0.001). The κ for each item generally increased with each education session, but they were not statistically significant (p = 0.46, p = 0.17). Additionally, the rankings of agreement, for each item, were relatively consistent.
Conclusion
This study conducted an interobserver agreement analysis of Korean pathologists for lupus nephritis, with the goal of increasing agreement through education. Although the education increased overall agreement, items like “mesangial hypercellularity,” “endocapillary hypercellularity,” and “neutrophils and/or karyorrhexis” remained inconsistent attributable to innate subjectivity and ineffective education.
Introduction
With recent developments in artificial intelligence (AI) using deep learning, AI-related studies in the field of digital pathology are expanding. One of the factors that must be considered in AI-based pathology image analysis studies is that high reproducibility and accuracy of the pathology diagnosis are the basic premises of AI learning. If there is no consensus as to what is the “gold standard” for diagnosis, and different pathologists make different diagnoses for the same pathology image, the basis for AI learning will be compromised, and the reliability of results will be low [1,2]. Therefore, a high level of diagnostic agreement is a prerequisite for AI-powered analytics.
Factors that affect diagnostic agreement among pathologists include clarity of the definition of diagnostic terms, cognitive biases, method of evaluation of pathological findings (quantitative or qualitative), institutional variability, cross-training, and the level of experience or training of the pathologists [1]. Several studies have investigated or improved agreement in the diagnosis of renal disease [3–10]. In particular, the diagnosis of lupus nephritis, which has a major impact on clinical prognosis and treatment decisions, is poor among pathologists [4,6,7,9].
Two previous studies on diagnostic agreement have been conducted by The Renal Pathology Study Group of the Korean Society of Pathologists (RPS-KSP) [11,12]. These studies standardized the terminologies of renal pathology and improved diagnostic agreement through training with a virtual slide atlas. In the Nephrotic Syndrome Study Network (NEPTUNE) study, intra- and interobserver variability were significantly reduced after two rounds of web-based cross-training [5]. Web-based cross-training provides pathologists with the opportunity to meet across time and space, which can dramatically improve diagnostic agreement.
In this study, we identified interobserver variability among pathologists for the accurate diagnosis and classification of lupus nephritis and attempted to improve agreement through educational training. Thus, we aimed to improve the quality and accuracy of the diagnosis of lupus nephritis and provide a gold standard for future AI-powered studies.
Methods
Ethical approval
The study protocol was approved by the Institutional Review Board (IRB) of Wonju Severance Christian Hospital (No. CR321144). Written consent to publish this study was waived by the IRB board due to its retrospective nature and lack of access to the patient clinical information.
Case selection and survey
Histopathological slides from patients diagnosed with lupus nephritis were gathered from four hospitals: Severance Hospital, Gangnam Severance Hospital, Wonju Severance Christian Hospital, and CHA University CHA Bundang Medical Center. Representative glomerular images were chosen from the slides and captured using a digital camera at 400× magnification (Olympus) at the discretion of each institutional pathologist. Twenty-five questions were formulated for the questionnaire. Each question referenced four images of a glomerulus stained with hematoxylin and eosin (H&E), periodic acid-Schiff (PAS), trichrome, or periodic acid-methenamine silver (PAMS). Multiple-choice questions with 14 options consisting of characteristic histopathological findings of lupus nephritis were provided. Google Forms was used as the survey platform (Fig. 1) and the survey was conducted among members of the RPS-KSP. The entire membership was provided with a web link to the questionnaire. In addition to responding to the questionnaire, participants were asked how many years they had practiced as a renal pathologist and how many renal biopsies they reported per year. Three surveys were administered with intervals of 3 months between Survey 1 and Survey 2, and 7 months between Survey 2 and Survey 3. Educational sessions were held 2 to 4 weeks before Survey 2 and Survey 3. The first education was approximately 10 minutes for RPS-KSP members only and the second education was approximately 35 minutes for RPS-KSP members and clinicians. Both were delivered via online lecture. The content included previous survey results, a literature review of discordant findings, and a brief overview of diagnostic pitfalls. After the sessions, an educational presentation file was provided to the members of the RPS-KSP that included the highest discrepancies and their authorized definitions.
Statistical analysis
Fleiss’ kappa (Fleiss’ κ) evaluates agreement when there are more than two raters. Fleiss’ κ was calculated for each question and item (0 = no agreement, 1 = perfect agreement). Fleiss’ κ was also calculated for those with more than 10 years of experience. Specifically, the presence of a histopathological finding was coded as 1 and the absence of the histopathological finding as 0. For question-by-question agreement, each question with 14 options as subjects was analyzed, while for item-by-item agreement, 25 questions as subjects were analyzed. A κ-value of <0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, or >0.8 was considered to reflect “poor,” “fair,” “moderate,” “good,” or “very good” agreement, respectively. Statistical significance was set at p < 0.05.
Initial pre-educational session agreement was compared to post-educational session agreement, twice, comparing pre-educational session (Survey 1) to post-first educational session (Survey 2) and pre-educational session (Survey 1) to post-second educational session (Survey 3). Paired t tests showed a significance level of 0.05.
In cases where the intrinsic issue of kappa agreement analysis, influenced by balanced marginal distribution, led to the expectation of high agreement because almost everyone gave the same response, there were instances where kappa was low or negative. To address this issue, additional Gwet’s AC1 statistics were computed [13].
All analyses, except for Gwet’s AC1 statistics, were performed using IBM SPSS (version 27.0; IBM Corp.). Gwet’s AC1 statistics were calculated using the following website: https://play93.shinyapps.io/Gwet_Scott/. For Gwet’s AC1, <0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, or >0.8 were considered to reflect “slight,” “fair,” “moderate,” “substantial,” or “almost perfect” agreement, respectively.
Results
Fourty-three RPS-KSP members responded more than once. There were 31 respondents to Survey 1, 28 respondents to Survey 2, and 19 respondents to Survey 3. Of these, 16, 14, and 12, respectively, had more than 10 years of experience reporting renal biopsies for the differential diagnosis of internal medicine conditions.
The number of renal biopsies reported per year varied among the respondents. Seven respondents reported more than 300 biopsies per year, one reported 200 to 300, 11 reported 100 to 200, 12 reported 51 to 100, and seven reported 50 or fewer. Among the highly experienced pathologists, six reported more than 300 renal biopsies per year, six reported 100 to 200 per year, four reported 51 to 100 per year, and four reported 50 or fewer per year.
The κ-values for each question are presented in Supplementary Fig. 1 (available online) and Supplementary Table 1 (available online). The κ-values by item are presented in Fig. 2 and Supplementary Table 2 (available online). The mean ± standard deviation (SD) of the κ-values, respectively, for the surveys of 25 questions were as follows: Survey 1, 0.417 ± 0.011; Survey 2, 0.412 ± 0.010; and Survey 3, 0.472 ± 0.013. The overall κ-value of Survey 3 was significantly higher than that of Surveys 1 and 2 (p < 0.001 and p = 0.001, respectively). The κ-values for highly experienced pathologists, who had practiced in renal pathology for more than 10 years, were generally higher than those of all the pathologists (Fig. 3). The mean ± SD of the κ-values of the surveys for highly experienced pathologists were as follows: Survey 1, 0.475 ± 0.019; Survey 2, 0.427 ± 0.011; and Survey 3, 0.474 ± 0.015. The kappa value of Survey 3 for the highly experienced pathologists was significantly higher than that of Survey 2 (p = 0.009). Question 8 showed poor agreement with a kappa of 0.2 or less for all three concordance assessments (Supplementary Fig. 1, available online), but substantial agreement of 0.6 or more for Gwet’s AC1 (Supplementary Fig. 2, available online).

The κ-values for each item from all pathologists (inexperienced and experienced).
There was “fair” agreement for endocapillary hypercellularity and neutrophils and/or karyorrhexis (κ < 0.4), and “poor” agreement for mesangial hypercellularity (κ < 0.2) across all three surveys. However, the agreement for endocapillary hypercellularity and mesangial hypercellularity increased slightly after the educational sessions.

The κ-values for each item for the experienced pathologists.
The κ-values for highly experienced pathologists were generally higher than those of all the pathologists. The agreement from Survey 3 increased after the educational sessions compared with the previous survey.
The mean ± SD of the κ-values, respectively, for the surveys of 14 items of lupus nephritis were as follows: Survey 1, 0.251 ± 0.033; Survey 2, 0.276 ± 0.042; and Survey 3, 0.309 ± 0.015. The κ-values for highly experienced pathologists were higher than those of all the pathologists. Overall item agreement was not statistically different for all the pathologists after the educational sessions. However, the agreement for Survey 3 increased for highly experienced pathologists after the educational sessions compared to the previous survey. There was “fair” agreement between endocapillary hypercellularity and neutrophils and/or karyorrhexis, with values of less than 0.4 in both Fleiss’ κ and Gwet’s AC1 analyses among all the pathologists and the highly experienced pathologists (Table 1, Figs. 2–4). Mesangial hypercellularity showed poor agreement with both Fleiss’ κ and Gwet’s AC1 value of 0.2 or less in all three surveys. The agreement between identification of mesangial hypercellularity and endocapillary hypercellularity had a greater increase after the two educational sessions than before, but the agreement between the identification of neutrophils and/or karyorrhexis had a greater decrease. For highly experienced pathologists only the ability to identify mesangial hypercellularity increased after two educational sessions, while identification of endocapillary hypercellularity and neutrophils and/or karyorrhexis decreased from pre-educational session levels (Fig. 3; Supplementary Fig. 3, available online).

Agreement of the 14 lupus nephritis descriptor items by Surveys (1, 2, and 3), all pathologists (inexperienced and experienced), and experienced pathologists

Gwet’s AC1 values for each item for all pathologists (inexperienced and experienced).
Normal, global sclerosis, spike or intramembranous hole formation, fibrous crescent, and double contour that had a “poor” or even a negative κ-value (κ < 0.2), showed “almost perfect” agreement on Gwet’s AC1 analysis (AC1 > 0.8). However, the agreement values for endocapillary hypercellularity and mesangial hypercellularity were less than “fair.”
In Survey 3 it was found that the ability to identify segmental sclerosis and adhesion between the tuft and capsule had lower κ-values than Survey 1, with κ-values for Survey 3 of 0.231 and 0.345, respectively, and κ-values for Survey 1 of 0.289 and 0.361, respectively. However, Gwet’s AC1 values for the two items, segmental sclerosis and adhesion between the tuft and capsule, were higher in Survey 3 (0.722 and 0.722, respectively) than in Survey 1 (0.685 and 0.688, respectively). Items such as normal, global sclerosis, spike or intramembranous hole formation, fibrous crescent, and double contour showed highly unbalanced marginal distributions and thus the κ-values were of no value. For Gwet’s AC1 value, for these items, the results were in almost perfect agreement, with scores of at least 0.8 (Fig. 4; Supplementary Fig. 4, available online).
The possibility is raised that the sincerity of those who dropped out of one of the three surveys may differ from those who completed all three surveys. In order to make a rigorous comparison of pre- and post-educational agreement, it is necessary to analyze agreement only among those who completed all three surveys (Supplementary Table 3, available online). When only those who completed all three surveys were analyzed, the agreement for each item was slightly higher than the agreement for participants who responded to one or more of the surveys, and the trend remained similar to before (Fig. 5). The increase in agreement from pre- to post-education varied by item. Of the three items with the lowest agreement, two (mesangial hypercellularity and endocapillary hypercellularity) increased in agreement after education (Gwet’s AC1; 0.184 and 0.329, to 0.194 and 0.334, respectively) and one (neutrophils and/or karyorrhexis) decreased in agreement (0.574 to 0.357). The kappa and Gwet’s AC1 value of Survey 2 for the highly experienced pathologists was significantly less than that of Survey 1 (p = 0.015 and 0.004, respectively).

The κ and Gwet’s AC1 values for agreement among all-three-survey responders for each item.
When the respondents were narrowed down to all-three-time respondents, the agreement increased for most items. However, the trend remained similar to before.
The agreement between experienced and inexperienced pathologists was compared across all-three-survey respondents. The Gwet’s AC1 value of experienced varied from item to item when compared to inexperienced (Supplementary Fig. 5, available online). The definition of the experienced was narrowed than before, to more than 10 years of renal pathology practice and diagnosing at least 100 renal biopsies per year. The difference in agreement between experienced and inexperienced varied by item. Before the education, the experienced (n = 6) had six items with higher AC1 values than the inexperienced (n = 8; mesangial hypercellularity, endocapillary hypercellularity, fibrous crescent, wire loop lesion and/or hyaline thrombi, and double contour), but after the education, it decreased to four (endocapillary hypercellularity, spike or intramembranous hole formation, fibrocellular and fibrous crescent) (Fig. 6). There was no significant difference between the two groups in terms of overall agreement.

Comparison of κ and Gwet’s AC1 values between experienced and inexperienced among all-three-survey responders.
Experienced was defined as practicing for at least 10 years and diagnosing at least 100 cases per year. There was no significant difference between the two groups in terms of overall agreement.
Discussion
There are few studies on concordance between pathologists in the diagnosis of lupus nephritis, and the concordance is low [6]. To the best of our knowledge, this is the first study to assess concordance in the identification of the pathological lesions of lupus nephritis, in Korea. Since the 2018 International Society of Nephrology/Renal Pathology Society (ISN/RPS) revision of the classification of lupus nephritis, some histopathological descriptors that comprise the activity and chronicity indices have been modified or redefined [14]. With the ISN/RPS revision, the definitions of mesangial hypercellularity, crescent, adhesion, and fibrinoid necrosis were revised, and endocapillary proliferation was renamed endocapillary hypercellularity. This is the first study to evaluate concordance using the new histopathological descriptors from the 2018 ISN/RPS revision, along with other histopathological features used for the diagnosis of lupus nephritis.
Dasari et al. [6] systematically reviewed the inter-pathologist agreement on lupus nephritis and concluded that the concordance was “poor” to “moderate.” In their review, leukocyte infiltration, a similar term to neutrophils in the modified activity/chronicity index, exhibited “poor” agreement, which is in line with our results (κ-value for neutrophils and/or karyorrhexis, <0.4). However, the agreement for “endocapillary hypercellularity” was lower than previous studies, which showed “moderate” agreement (intraclass correlation coefficient [ICC] or κ-value, >0.4) [6,7,9,15], despite two educational sessions. This is likely to be due to the inclusion of mesangial hypercellularity as an option, unlike in previous studies, or unclear definitions. Most studies used a crude assessment by scoring the percentage of involvement of the total glomeruli in the slide, according to a cutoff [7,9,15], whereas this study used a more rigorous evaluation of endocapillary hypercellularity per glomerulus. Although mesangial hypercellularity and endocapillary hypercellularity often coexist, the 2018 revision does not provide criteria for distinguishing between them. The Oxford Working Group reported that the concordance of segmental endocapillary hypercellularity was “fair” [3]. They also reported that mesangial cellularity was difficult to score in segments with endocapillary hypercellularity; therefore, they scored them as “indeterminate” for mesangial cellularity in the presence of global endocapillary hypercellularity. Cellular and fibrous crescents showed an increase in agreement from “poor” to “moderate” previously (cellular ICC, 0.5 and 0.55 ± 0.07; fibrous ICC, 0.25 ± 0.09 and 0.58) to “good” to “almost perfect” agreement in this study (cellular κ, >0.6; fibrous Gwet’s AC1, >0.9) [15,16]. This can be hypothesized as being attributed to the lowering of the cutoff for extracapillary proliferation to 10% from 25% [14], which reduced uncertainty by encouraging identification of lesions that were previously borderline to be determined to be crescentic, thereby improving agreement. Second, a more detailed definition of a fibrocellular/fibrous crescent [14], which was not previously available, may have assisted in improving the concordance. Despite fibrinoid necrosis being defined in detail for the first time in the revision [14], it is possible that the similar degree of agreement (“poor” agreement on κ-value, 0.32 to 0.47; “substantial” agreement on Gwet’s AC1, 0.61 to 0.76) seen with fibrinoid necrosis/karyorrhexis (ICC: 0.26, 0.48, and 0.45 ± 0.09) [6] is because it is now determined separately, as opposed to being combined with karyorrhexis. In both the NEPTUNE and the Nephrotic Syndrome Study Network Digital Pathology Scoring System studies, the agreement was higher than before, after grouping individual descriptors [5,10].
Mesangial hypercellularity is not a component of the activity index and chronicity index but is a key feature that can be diagnostic of class II lupus nephritis when present with appropriate immunofluorescence or electron microscopic findings and has not been addressed in previous lupus nephritis concordance studies [17]. The definition of mesangial hypercellularity in the ISN/RPS revision was taken from the definition of immunoglobulin A (IgA) nephropathy in the Oxford classification, and the cutoff was increased from three cells to four cells, which was emphasized in the educational sessions of this study. Despite a more detailed definition and a minimal increase in concordance after two educational sessions, mesangial hypercellularity had the lowest agreement among the items; however, this has been frequently observed in other studies [18–20]. According to concordance studies on IgA nephropathy, there was “moderate” to “poor” agreement in determining the presence of mesangial hypercellularity in more than half of the biopsied glomeruli, suggesting that the agreement for the presence of mesangial hypercellularity in a single glomerulus is expected to be even lower. Furthermore, it is not yet known whether a clear-cut distinction between mesangial hypercellularity and endocapillary hypercellularity can be made in class III and IV lesions [14]. It is also unclear whether the cutoff of four cells for mesangial hypercellularity is for mesangial cells alone or if it also includes inflammatory cells [14]. More specific definitions will be required in the future (Supplementary Table 4, available online).
After analysis, it was found that some items had low κ-values despite the high agreement observed, and this was due to the ‘prevalence paradox’ of Fleiss’ κ [13,21,22]. To compensate for the uneven distribution of responses, as the ‘prevalence paradox’ of Fleiss’ κ can cause the agreement value to be too low compared to the observed agreement, we performed Gwet’s AC1 analysis. Given that the limitations of kappa, which have been pointed out in previous studies, are also evident in some items of this study, Gwet’s AC1 is a more appropriate measure of agreement than Fleiss’ κ, especially when the agreement is high [23–25].
It is noteworthy that even with the narrower definition of experienced pathologists, less than half of the items have higher agreement than inexperienced, with insignificant difference, and the difference is even smaller after education. This is different from previous studies that have shown high concordance with experts [5,6], and suggests that, at least in Korean nephropathologists, the level of experience does not necessarily correlate with higher concordance in lupus nephritis glomeruli. However, this study also found that agreement increased in some items with educational sessions. This emphasizes the importance of regular training of pathologists, at least in some items.
This study is more detailed and systematic, uses digital images to assess the agreement between the components of the activity index and chronicity index of lupus nephritis for each glomerulus, and is the first concordance study to use the definitions of the 2018 ISN/RPS revision. It is also more objective and general than an agreement assessment based on a small number of pathologists, as it includes a relatively large number of pathologists and a high response rate. This study included four images, H&E, PAS, trichrome, and PAMS, to represent the diagnostic setting. Educational sessions were successful in improving agreement and the benefits were immediately applicable in the clinic as the majority of the pathologists worked at multiple institutions.
This study has some limitations. It included only glomeruli and did not evaluate the degree of agreement for tubulointerstitial and vascular lesions. Glomerular selection bias was unavoidable. Few glomeruli reported global sclerosis or spikes; therefore, the reliability of degree of agreement of these two items is questionable. A post hoc review of the glomerular images revealed that there were no typical images in which spikes or global sclerosis were easily identifiable. Therefore, additional images should be included in future assessments. The education was a one-way lecture, which seems to be less effective than the interactive open-round meeting. Especially for experienced pathologists, an interactive open-round meeting would be more effective, where the attendants could comment on each other and discuss problematic points in depth and may lead to a better agreement. Finally, the study was limited to Korean patients and pathologists.
The treatment of lupus nephritis is based on histopathological classification and the activity/chronicity index, and appropriate treatment affects patient prognosis. In addition to effectively training a machine-learning model, the training data must be highly reliable, which is difficult to achieve when the histopathological diagnostic agreement is low between pathologists. This study showed improvement in agreement after two educational sessions. This is immediately applicable in clinical practice and is the basis for the development of accurate AI models.
Supplementary Materials
Supplementary data are available at Kidney Research and Clinical Practice online (https://doi.org/10.23876/j.krcp.24.185).
Notes
Conflicts of interest
All authors have no conflicts of interest to declare.
Funding
This study was supported by a grant from the KOREAN NEPHROLOGY RESEARCH FOUNDATION (Renal Pathology Research Grant 2021 to ME). The sponsor had no role in the study design, data collection, or analyses.
Acknowledgments
We would like to thank Dr. Dongwook Kim for his help with the statistical analysis and the RPS-KSP members for their active participation.
Data sharing statement
The data presented in this study are available from the corresponding author upon reasonable request.
Authors’ contributions
Conceptualization: JYP, BJL, ME, SEC
Data collection: All authors
Formal analysis: JYP, SEC, NJ
Funding acquisition: ME
Writing–original draft: JYP, SEC
Writing–review & editing: JYP, SEC, ME
All authors read and approved the final manuscript.