Feature: Calibration Error Metrics and Losses

**Is your feature request related to a problem? Please describe.**

Currently, MONAI does not provide built-in support for calibration metrics, such as Expected Calibration Error, or auxiliary calibration losses that can  be used to improve the calibration of medical image segmentation networks. I have implemented these features, using the MONAI framework, which can be found at [this repo](https://github.com/cai4cai/Average-Calibration-Losses/), with unit testing, and corresponding publications [here](https://papers.miccai.org/miccai-2024/091-Paper3075.html) and [here](https://arxiv.org/abs/2506.03942v1)

---

**Describe the solution you’d like**

I propose adding my calibration metrics and handlers, as well as auxiliary calibration losses to MONAI. 

1. **Calibration Metrics**

   * [`CalibrationErrorMetric`](https://github.com/cai4cai/Average-Calibration-Losses/blob/main/src/metrics/calibration.py#L109) (subclass of `CumulativeIterationMetric`) supporting Expected (ECE), Average (ACE), and Maximum (MCE) reductions, with batched, per-class, and background-exclusion settings.

2. **Differentiable Calibration Losses**

   * [`HardL1ACELoss`](https://github.com/cai4cai/Average-Calibration-Losses/blob/main/src/losses/hardl1ace.py) and its compound variants (`HardL1ACEandCELoss`, `HardL1ACEandDiceLoss`, `HardL1ACEandDiceCELoss`)
   * [`SoftL1ACELoss`](https://github.com/cai4cai/Average-Calibration-Losses/blob/main/src/losses/softl1ace.py) and its compound variants (`SoftL1ACEandCELoss`, `SoftL1ACEandDiceLoss`, `SoftL1ACEandDiceCELoss`)
     These losses implement the L1 Average Calibration Error (ACE) in hard- and soft-binned form, with options for background exclusion, one-hot encoding, custom activation, and class weighting.

3. **Ignite Handlers**

   * [`CalibrationError`](https://github.com/cai4cai/Average-Calibration-Losses/blob/main/src/handlers/calibration.py#L150) inheriting from `IgniteMetricHandler`, to attach calibration metrics to training and evaluation engines, with automatic logging and CSV export of per-image details.
   
Optionally, I also have visualisation method for reliability diagrams and reliability dataset histograms, however these  may be harder to integrate nicely into the current MONAI framework.

**Visualisation Utilities (maybe)**

   * Plotting functions [`draw_case_reliability_diagrams`](https://github.com/cai4cai/Average-Calibration-Losses/blob/5ad8418632b2e2afe683a015d87541d52433f702/src/visualize/reliability_diagrams.py#L23) and [`draw_dataset_reliability_diagrams`](https://github.com/cai4cai/Average-Calibration-Losses/blob/5ad8418632b2e2afe683a015d87541d52433f702/src/visualize/reliability_diagrams.py#L463) for reliability diagrams and histograms, with customisable figure size, colormaps, and annotations. And their associated [ReliabilityDiagramMetric](https://github.com/cai4cai/Average-Calibration-Losses/blob/5ad8418632b2e2afe683a015d87541d52433f702/src/metrics/calibration.py#L354) and [ReliabilityDiagramHandler](https://github.com/cai4cai/Average-Calibration-Losses/blob/5ad8418632b2e2afe683a015d87541d52433f702/src/handlers/calibration.py#L205)

---

_All components are already implemented and thoroughly unit-tested_ in [Average-Calibration-Losses](https://github.com/cai4cai/Average-Calibration-Losses/tree/main/src/tests), and integrate into MONAI’s bundle-based pipelines out of the box.

---

**Describe alternatives you’ve considered**

* **Third-party libraries**:  Calibration Error Metric do exist in [torchmetrics](https://lightning.ai/docs/torchmetrics/stable/classification/calibration_error.html) and [net:cal](https://github.com/EFS-OpenSource/calibration-framework/tree/main?tab=readme-ov-file) but are not implemented via the CumulativeIterationMetric used in MONAI. Differentiable calibration losses for semantic segmentation are not implemented elsewhere.

Integrating these features into MONAI directly will give users a standardised, well-tested interface for calibration, reduce duplication, and promote reproducible, robust model evaluation.

---

I’m happy to contribute these components as a PR to MONAI. Please let me know which of these features would be useful and how best to align with MONAI’s architecture and naming conventions, and any feedback on API design or testing guidelines!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Calibration Error Metrics and Losses #8505

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Calibration Error Metrics and Losses #8505

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions