Category: [B2]; Team name: NPL; Dataset: Chordonomicon by PierrickLeroy · Pull Request #238 · geometric-intelligence/TopoBench

PierrickLeroy · 2025-11-24T00:43:49Z

Checklist

My pull request has a clear and explanatory title.
My pull request passes the Linting test.
I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
My PR follows PEP8 guidelines. (refer to comment below)
My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
I linked to issues and PRs that are relevant to this PR.

Description

This pull request introduces a benchmark task based on musical chord synergy and redundancy.
The goal is to assess how well topological models capture the structure of chords, which are naturally occurring higher-order objects.
The task consists of predicting a theoretically grounded information-theoretic quantity, the local O-information [1], computed from real musical data, thereby enabling evaluation at the hyperedge level.

Concretely, we propose a hyperedge regression task with the following characteristics:

Labels: local O-information [1]
Hyperedge features: empirical frequencies of chords
Node features: one-hot encoding of notes

Dataset

We integrated a chord dataset, derived from the CHORDONOMICON dataset [3], which contains over 600 000 songs annotated with their chord progressions. We preprocess this dataset by standardizing chord notation, mapping every chord to a common alphabet of $12$ notes in the base version.
We then aggregate occurrences to construct a hypergraph in which each of the 226 hyperedges corresponds to a unique chord. The resulting processed datasets are publicly available on Hugging Face: link.
We additionally provide the pre-aggregation data, which can support alternative tasks such as predicting musical genre from hyperedges.

The dataset is available in 2 versions, depending on how musical scales are treated:

single_scale: notes are merged across octaves (e.g., C♯2 and C♯3 are treated as the same pitch class), yielding 12 distinct notes.
all_scales: octave information is preserved, and notes at different octaves are treated as distinct, yielding 38 total notes. In this case the number of hyperedges is 4313.

The choice of which dataset to load is made in the configuration file (chordonomicon.yaml) or directly with an argument in the dataset class (ChordonomiconDataset).

Issue

This benchmark task introduces local O-information in TDL evaluation: it is a mathematically rigorous measure of synergy and redundancy in multivariate systems.
Using it as a regression target, the goal is to set up a task for which:

The ground truth has clear information-theoretic meaning.
Model performance directly reflects ability to capture and transfer information across different orders of interactions (see below).

Expressivity of TDL models

Local O-information [1], derived from the O-information [2], is an interesting quantity because it assess for each hyperedge whether the information it contains is redundant (recoverable from lower-order interactions) or synergistic (emerging only from higher-order interactions).
Concretely, the local O-information for an $n$-tuple $x^n$ is given as follow (see eq 4 in [1] for more details):

$$\omega(x^n)= (n - 2)h(x^n) + \sum_{j=1}^n h(x_j) - h(x^n_{-j})$$

where $h$ is the information-content function, corresponding to an hyperedge, $x_j$ is the marginalisation for variable $j$ (over $(n-1)$ variables) and $x^n_{-j}$ is the marginalisation over $j$, that is, a function of all variables except $j$.

Computing local O-information requires contrasting information across different orders, hence we think it might be a good evaluation of model expressivity, in the same spirit of the WL tests for GNNs.

Additional context

Limitations

Due to the structured nature of musical harmony, chords follow patterns and only cover a fraction of all possible combinations of the 12 pitch classes.
As a result, the number of hyperedges in the single-scale setting remains relatively modest (226).
When all scales are included, the number increases substantially (4,313), but the corresponding empirical frequencies become more variable and contain more outliers, which in turn introduces additional noise into the labels.

References

[1] "Quantifying high-order interdependencies on individual patterns via the local O-information: Theory and applications to music analysis", Scagliarini et al
[2] "Quantifying high-order interdependencies via multivariate extensions of the mutual information", Rosas et al
[3] "CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions", Kantarelis et al

1. Find the segments of chords based on the <verse> label. 2. Inference the tone for each segment. 3. Transpose the chords into the Roman numeral expression.

Now the file includes a systematic analysis. But still haven't used for every song in the dataframe. Also includes a dictionary for search the mapping between chord-scales. (See the last part)

review-notebook-app · 2025-11-24T00:43:54Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…tasets

PierrickLeroy · 2025-11-24T22:33:20Z

topobench/nn/readouts/base.py

small possible workaround to allow edge level prediction, at least with NoReadOut

PierrickLeroy · 2025-11-24T22:33:40Z

topobench/nn/readouts/identical.py

small possible workaround to allow edge level prediction, at least with NoReadOut

xuanchen-liu-97 and others added 10 commits November 18, 2025 12:32

Cleaning chords data

c6d599e

1. Find the segments of chords based on the <verse> label. 2. Inference the tone for each segment. 3. Transpose the chords into the Roman numeral expression.

update data cleaning

9a92db5

Now the file includes a systematic analysis. But still haven't used for every song in the dataframe. Also includes a dictionary for search the mapping between chord-scales. (See the last part)

dataset intergration first part

03e956b

correct none data problem

0921b31

slight dataset modifications

16bb74e

bypass restrictive task levels if using the dummy readout NoReadOut

aa50f0c

Add files via upload

5739eb3

allow edge task level with noreadout

5a5fb10

uploaded real data, pipeline passes

c30a047

reducing number of epoch speeds up pipeline test

ba85db3

levtelyatnikov added the category-b2 Submission to TDL Challenge 2025: Mission B, Category 2. label Nov 24, 2025

PierrickLeroy added 3 commits November 24, 2025 16:01

delete processing notebook

0621096

improve code on PEP8 standards

cbdc29e

added the possibility to switch before single scale and all scales da…

e1eca87

…tasets

PierrickLeroy commented Nov 24, 2025

View reviewed changes

improved PEP

54849fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Category: [B2]; Team name: NPL; Dataset: Chordonomicon#238

Category: [B2]; Team name: NPL; Dataset: Chordonomicon#238
PierrickLeroy wants to merge 14 commits intogeometric-intelligence:mainfrom
PierrickLeroy:music

PierrickLeroy commented Nov 24, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Nov 24, 2025

Uh oh!

PierrickLeroy Nov 24, 2025 •

edited

Loading

Uh oh!

PierrickLeroy Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

PierrickLeroy commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Dataset

Issue

Expressivity of TDL models

Additional context

Limitations

References

Uh oh!

review-notebook-app bot commented Nov 24, 2025

Uh oh!

PierrickLeroy Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PierrickLeroy Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PierrickLeroy commented Nov 24, 2025 •

edited

Loading

PierrickLeroy Nov 24, 2025 •

edited

Loading

PierrickLeroy Nov 24, 2025 •

edited

Loading