Skip to content

IUCompPath/PermRanker

Repository files navigation

PermRanker

About

PermRanker is a statistically-aware ranking framework for fair and informative benchmarking of AI models, designed to be agnostic to computational workloads (e.g., segmentation, classification, registration) and underlying data types (e.g., 2D pathology images, 3D MRI scans, or even non-imaging data).

PermRanker's functionality is based on two stages: (i) a ranking score, based on case-wise cumulative rankings aggregated across multiple metrics and testing cases, and (ii) a rigorous statistical significance analysis via pairwise permutation testing across the ranked order of the AI models.

PermRanker has served as the official ranking mechanism for over 33 international challenges between 2017 and 2025, including the BraTS, FeTS, and ISLES challenges, in conjunction with the Annual Scientific Meeting of the Medical Image Computing and Computer Assisted Interventions (MICCAI) Society. While mainly applied in biomedical AI challenges, PermRanker aims to address the unmet need of fair and informative benchmarking of AI models beyond this scope, tackle actual real-world conditions, and hence contribute in streamlining clinical translation of AI models.

Algorithm

The Ranker class compares the performance of different methods based on a set of metrics. It takes as input a dictionary of CSV files, where each file represents a method and contains the scores for a set of subjects on a set of metrics.

The ranking algorithm consists of the following steps:

  1. Combine CSVs and Scores: The class first combines all the input CSV files into a single DataFrame. This DataFrame has a hierarchical column structure, where the top level represents the metrics and the bottom level represents the subjects.

  2. Rank Methods: The class then ranks the methods based on their scores for each metric and subject. The ranking can be done using different methods, such as 'average', 'min', 'max', 'first', or 'dense'.

  3. Handle Metric Reversal: For metrics where lower values are better (e.g., error rates), the class can reverse the ranks so that lower scores get higher ranks.

  4. Aggregate Ranks: The class then aggregates the ranks across all metrics for each subject to get a per-subject average rank for each method.

  5. Calculate Cumulative Rank: The per-subject average ranks are then summed up to get a cumulative rank for each method.

  6. Determine Final Rank: The methods are then ranked based on their cumulative ranks to determine the final ranking.

  7. Perform Permutation Test: Finally, the class performs a permutation test to determine the statistical significance of the differences in the ranks of the methods. The permutation test is a non-parametric method that does not make any assumptions about the distribution of the data.

The output of the Ranker class is a pair of DataFrames: one containing the final rankings of the methods, and another containing the p-values from the permutation test.

Permutation Test

The permutation test is a non-parametric method for testing the statistical significance of an observed difference between two groups. In this case, the two groups are the ranks of two different methods.

The null hypothesis is that the two methods are equivalent, and any observed difference in their ranks is due to chance. The alternative hypothesis is that the two methods are not equivalent, and the observed difference in their ranks is statistically significant.

The test works by repeatedly shuffling the ranks between the two methods and calculating the difference in their sums. The p-value is the proportion of permutations that result in a difference as or more extreme than the observed difference.

Installation

(base) user@location $> git clone https://github.com/IUCompPath/PermRanker.git 
(base) user@location $> cd PermRanker
(base) user@PermRanker $> conda create -p ./venv python=3.12 -y
(base) user@PermRanker $> conda activate ./venv
(PermRanker/venv) user@PermRanker $> pip install uv # a faster dependency manager
(PermRanker/venv) user@PermRanker $> uv pip install -e .

Verify installation

(PermRanker/venv) user@PermRanker $> ranker --help

Usage

Inputs

  1. A folder containing CSVs or a comma-separate list of files with extracted metrics where each row is a subject and each column is a metric. Each CSV corresponds to a method (or a challenge participant's results) that needs to be compared and should have the following format:

    SubjectID,metric_a,metric_b,metric_c,...
    000,float_000_a,float_000_b,float_000_c,...
    001,float_001_a,float_001_b,float_001_c,...
    002,float_002_a,float_002_b,float_002_c,
    ...
    N,float_N_a,float_N_b,float_N_c,...
    • Only the SubjectID column is mandatory. The rest of the columns can be named as desired.
    • If the number of subjects or metrics are inconsistent across the CSVs, the package will raise an error and not proceed with the analysis.
    • Each metric should be a float value where higher values are better. If there is a metric where lower values are better, please add it to the --metrics-for-reversal CLI argument.
    • For examples, please check the sample data folder.
  2. Metrics for reversal normalization: a comma-separated list of metrics that need to be normalized in reverse. For metrics such as Hausdorff Distance and communication cost (used in the FeTS Challenge) which are defined as "higher is worse", PermRanker can normalize in reverse order.

    • This is checked in a case-insensitive manner, so C,F is equivalent to c,f.
    • The check is done by checking for the presence of the string in the metric header, rather than a "hard" check. For example, passing hausd will match hausd* in the metric headers, and will be case-insensitive. This is done to allow for flexibility in the metric names.
    • The metric string needs to be present. For example, passing dsc will not match for dice* in the metric headers.
  3. Ranking method: the ranking method used to rank the methods. The available options are [ref]:

    • average (default): average rank of the group
    • min: lowest rank in the group
    • max: highest rank in the group
    • first: ranks assigned in order they appear in the array (not recommended)
    • dense: similar to min, but rank always increases by 1 between groups

Outputs

  1. pvals.csv: a CSV file containing the p-values showcasing the significance of each method compared to the others.
  2. ranks.csv: a CSV file containing the ranks of each method. This contains the following columns:
    1. method: the method name, which corresponds to the input CSV file name.
    2. cumulative_rank: the sum of the ranks of each method.
    3. final_rank: the final rank of each method.
    4. ${metric}_${subject}: the rank of each method for each subject.

Example command line usage

(PermRanker/venv) user@PermRanker $> ranker ./data/ \  # [REQUIRED] the input folder containing CSVs
./data/output/ \  # [REQUIRED] the output folder to save the outputs (pvals.csv and ranks.csv)
--metrics-for-reversal "C,F" \  # [OPTIONAL] the metrics that need to be reversed normalized
--iterations 1000  # [OPTIONAL] the number of iterations to perform for the permutation analysis

To get detailed help, please run ranker --help.

Acknowledgements

This tool was partly supported by the Informatics Technology for Cancer Research (ITCR) program of the National Cancer Institute (NCI) at the National Institutes of Health (NIH) under award numbers U01CA242871 and U24CA279629 (PI: Spyridon Bakas). The content of this tool is solely the responsibility of the authors and does not represent the official views of the NIH.

Citation

@article{bakas2025permranker,
	author={Bakas, Spyridon and Thakur, Siddhesh and Baid, Ujjwal and Linardos, Akis and Pati, Sarthak and Doshi, Jimit and Shinohara, Russell T},
	title={My Model Is Better Than Yours! Statistically-aware Ranking for Fair Benchmarking of AI Models},
	year={2025}
}

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages