Evaluation - LNDb Challenge

Challenge Overview¶

The LNDb challenge is made up of four different parts related to the automatic classification of CT scans according to the 2017 Fleischner society pulmonary nodule guidelines for patient follow-up recommendation:

Main Challenge - Fleischner Classification: From chest CT scans, participants must predict the correct follow-up according to the 2017 Fleischner guidelines;
Sub-Challenge A - Nodule Detection: From chest CT scans, participants must detect pulmonary nodules;
Sub-Challenge B - Nodule Segmentation: Given a list of >3mm nodule centroids, participants must segment the nodules in the corresponding chest CT scans;
Sub-Challenge C - Nodule Texture Characterization: Given a list of nodule centroids, participants must classify nodules into three texture classes - solid, sub-solid and GGO.

Participants may choose whether to participante only in the main challenge, in a single or multiple sub-challenges or in all challenges. Each task will be evaluated separately and a prize for the best performing method in each challenge will be awarded.

Main Challenge - Fleischner Classification¶

The main challenge is the automatic classification of CT scans according to the 2017 Fleischner society pulmonary nodule guidelines for patient follow-up recommendation. The Fleischner guidelines are widely used for patient management in the case of nodule findings and are composed of 4 classes:

0) No routine follow-up required or optional CT at 12 months according to patient risk;

1) CT at 6-12 months required;

2) CT at 3-6 months required;

3) CT, PET/CT or tissue sampling at 3 months required.

The Fleischner score can be computed directly from the radiologist nodule annotations according to a set of rules taking into account the number of nodules (single or multiple), their volume (\<100mm³, 100-250mm³ and ⩾250mm³) and texture (solid, part solid and GGO).

Because most CTs on LNDb contain annotations from several radiologists, these annotations were joined to obtain a single Fleischner score per CT as provided in trainFleischner.csv. The Fleischner scores were computed as follows:

Findings annotated by different radiologists in the same scan were considered to be a unique finding if the Euclidean distance between their centroids was smaller or equal than the maximum equivalent diameter of the two findings. For findings of equivalent diameter smaller than 3mm, an equivalent diameter of 3mm was considered;
Findings marked as a nodule by a radiologist were considered to be a nodule independent of other radiologist annotations;
Nodule volume was computed from the segmentation. If multiple radiologists identified the nodule, the volume considered was the average of the volume of the segmentation of each radiologist;
Nodule texture was recast from the five classes in the LNDb annotation (1-GGO, 2-intermediate, 3-part solid, 4-intermediate, 5-solid) into the three classes of the Fleischner guidelines by considering GGO as 1-2, part solid as 3 and solid as 4-5. If multiple radiologists identified the nodule, the average texture was computed and the three classes of the Fleischner guidelines were computed by considering GGO as \<2.3(3), part solid as 2.3(3)-3.6(6) and solid as >3.6(6).
The resulting nodule list was used to calculate the Fleischner score. A script to compute the Fleischner score for a given CT and respective class probabilities is available for download (calcFleischner.py).

The evaluation of the submitted methods for the main challenge takes into account only the final Fleischner score obtained. For this, participants should submit a csv file that contains one scan per line. Each line holds the LNDb CT ID and the predicted probability of the scan belonging to each of the Fleischner classes. An example submission is available for download (predictedFleischner.csv). For the purpose of score calculations the class with maximum probability will be treated as the predicted Fleischner class. If two classes have equal and maximum probability, the class with higher index will be treated as the predicted Fleischner class.

The submitted Fleischner score predictions are compared to the ground truth and agreement is computed according to Fleiss-Cohen weighted Cohen's kappa :

where pij is the proportion of cases rated by observer 1 as class i and by observer 2 as class j. ✻ is a wildcard so that p✻j is the proportion of cases rated by observer 2 as class j. wij is the weight for class combination ij according to

for a rating consisting of k classes (C1,C2,...,Ck).

Sub-Challenge A - Nodule Detection¶

Sub-challenge A is the automatic detection of pulmonary nodules in LNDb CTs. All nodules, independent of their size and characteristics (including nodules \<3mm) should be detected.

Ground truth nodule locations were obtained using the same methodology as for the main challenge:

Findings annotated by different radiologists in the same scan were considered to be a unique finding if the Euclidean distance between their centroids was smaller or equal than the maximum equivalent radius of the two findings or smaller or equal than 3mm;
Findings marked as a nodule by a radiologist were considered to be a nodule independent of other radiologist annotations;
The centroid of a finding marked by several radiologists was considered as the average of the centroid of each individual segmentation.

The evaluation of the submitted methods for sub-challenge A takes into account the sensitivity and false positive rate per scan. For this, participants should submit a csv file that contains one nodule candidate per line. Each line holds the LNDb CT ID, the xyz coordinates of the finding in world coordinates and a float between 0 and 1 corresponding to the predicted probability of the finding being a nodule and. An example submission is available for download (predictedNodulesA.csv).

The submitted nodule candidates are then compared to the ground truth annotations. A candidate is considered a true positive if it matches a nodule from the ground truth. A candidate is considered a match if the Euclidean distance between the predicted centroid and a ground truth nodule centroid is smaller or equal than the maximum equivalent diameter of the ground truth nodule. For nodules of equivalent diameter smaller than 3mm, an equivalent diameter of 3mm is considered. True nodules with no match are considered false negatives. A candidate is considered a false positive if there is no finding which respects this rule. A candidate that matches a non-nodule will also be considered a false positive.

In similarity to previous challenges on lung nodule detection, analysis is performed on the free receiver operating characteristic (FROC) curve. To obtain a point on the FROC curve, only the findings with a degree of suspicion (probability of being a true nodule) is above a threshold t are selected, and the sensitivity and average number of false positives per scan is determined. All thresholds that define a unique point on the FROC curve are evaluated. The point with the lowest false positive rate is connected to (0,0). For points with false positive rate higher than the computed maximum, the sensitivity of the point with highest false positive rate is considered. Mean sensitivity is computed at 7 predefined false positive rates: 1/8, 1/4, 1/2, 1, 2, 4, and 8 FPs per scan:

where s(i) is the sensitivity for false positive rate i. This performance metric was introduced in the ANODE09 challenge and is described in detail in the ANODE09 paper.

To account for observer variability, average sensitivity is computed at different agreement levels. Two different FROC curves are computed considering: 1) all nodules (agreement level 1); 2) nodules marked by at least two radiologists (agreement level 2). In this way, the more consensual nodules (marked by a higher number of radiologists) have a superior weight on the final score (as they appear in all agreement levels). Detection performance score is thus computed as the average of the FROC average sensitivity at each agreement level:

where smean(a) *is the mean sensitivity at the predefined false positive rates for agreement level *a.

Sub-Challenge B - Nodule Segmentation¶

Sub-challenge B is the automatic segmentation of pulmonary nodules ⩾3mm in LNDb CTs.

For the test set CTs, a list of nodule centroids will be provided mixed with a high number of false positives (in order not to invalidate the main challenge and sub-challenge A). Note that this list should not be used for any purpose related to the main challenge and sub-challenge A and its use will mean the disqualification of the team/participant. The list will be given on a csv file (testNodules.csv) that contains one finding per line. Each line holds the LNDb CT ID, the finding's unique ID and the xyz coordinates of the finding in world coordinates. For evaluation, only the lines corresponding to true nodules ⩾3mm will be considered.

Participants should submit a 80x80x80 cube with voxelsize 0.6375mm centered on the nodule centroid with the predicted segmentation for each nodule. A script to extract such a cube from a CT given the nodule centroid (getNoduleCubes.py) as well as an example submission (predictedNodulesB.zip) are provided for download. Each cube must be saved in a separate file in NumPy format and named LNDb-XXXX_findingN where XXXX is the LNDb CT ID and N is the finding's ID according to testNodules.csv. Cubes should be populated with 0s and 1s according to whether it belongs or not to the nodule. For the purpose of score calculations (both distance metrics and volume metrics) the biggest interconnected object will be treated as the predicted nodule segmentation.

To measure the degree of accuracy of the segmentation, three distance metrics will be computed:

Modified Jaccard index (J*) computed as a measure of overlap between the predicted surface volume (V) and the reference surface volume (Vr), giving a measurement value between 0 (full overlap) and 1 (no overlap).
;
Mean average distance (MAD) between the predicted surface (S) and the reference surface (Sr) defined as:
,
where dmean(S1,S2) is the mean of distances between every surface voxel in S1 and the closest surface voxel in S2;
Hausdorff distance (HD) between the predicted surface (S) and the reference surface (Sr) defined as:

where dmax(S1,S2) is the maximum of distances between every surface voxel in S1 and the closest surface voxel in S2.

To measure the degree of accuracy of the segmentation for extraction of clinical indices (volume), three metrics are computed comparing the predicted volume and the reference volume:

Modified Pearson correlation coefficient r*=1-r where r is the Pearson correlation coefficient between predicted and reference volume;
Bias (b) computed as the mean absolute difference of predicted and reference volume;
Standard deviation (σ) of the difference of predicted and reference volumes.

Note that J*, MAD and HD will be computed in reference to the segmentation of each radiologist and then averaged per nodule. In this way, a nodule annotated by several radiologists has the same weight for the final score as a nodule annotated by a single radiologist. However, r*, b and σ will be computed in comparison to the average volume computed from the segmentations of all radiologists.

The final score through which all submitted methods will be ranked is the calculated as average of all six metrics after normalization according to the maximum among all participants so that each individual metric takes a value between 0 (worst case among all participants) and 1 (perfect fit between the reference and predicted segmentation):

Sub-Challenge C - Nodule Texture Characterization¶

Sub-challenge C is the automatic characterization of texture of pulmonary nodules. Three texture classes will be considered following the classification in the Fleischner guidelines : 1) Ground glass opacities (GGO), 2) Part solid nodules, 3) Solid nodules.

Ground truth nodule texture was recast from the five classes in the LNDb annotation (1-GGO, 2-intermediate, 3-part solid, 4-intermediate, 5-solid) into the three classes by considering GGO as 1-2, part solid as 3 and solid as 4-5. If multiple radiologists identified the nodule, the average texture was computed and the three classes of the Fleischner guidelines were computed by considering GGO as \<2.3(3), part solid as 2.3(3)-3.6(6) and solid as >3.6(6).

Participants should submit a csv file that contains one nodule per line. Each line holds the LNDb CT ID, the finding's unique ID, the xyz coordinates of the finding in world coordinates and three columns representing the predicted probability of the nodule belonging to either of the three texture classes. An example submission is available for download (predictedNodulesC.csv). For the purpose of score calculations the class with maximum probability will be treated as the predicted texture class. In the case of two classes having the maximum score, the class with lower index will be treated as the predicted texture class.

The submitted texture predictions are compared to the ground truth and agreement is computed according to Fleiss-Cohen weighted Cohen's kappa :

for a rating consisting of k classes (C1,C2,...,Ck).

Cross-Validation¶

The complete dataset was divided into 5 subsets taking into account a balanced Fleischner score distribution among the four classes. One subset was reserved for test (released upon paper submission to ICIAR 2020). The other 4 subsets should be used for 4-fold cross-validation. The list of CTs belonging to each subset is available in a csv file (trainFolds.csv) where each column corresponds to the CTs belonging to each subset.