Mondrian fix by shi-zq · Pull Request #1835 · online-ml/river

shi-zq · 2026-04-22T09:15:09Z

It resolves an issue with the leaf case in the split method (#1801) and fixes a shared reference bug in replant by properly copying bounding box dictionaries by value rather than by reference (#1834).

Since this is my first pull request, please let me know if I missed any project conventions or if anything needs to be adjusted. I'm happy to make changes

PR online-ml#1801 fixed the missing copy issue for the branch case. However, a similar bug still exists in the leaf case. When a leaf is split and generates two new leaves, the leaf that contains the new sample gets updated later, but the other leaf—which is supposed to inherit the old leaf's state—fails to copy the bounding boxes correctly.

Fix missing dictionary copy in Mondrian tree replant When replanting a MondrianNode, `memory_range_min` and `memory_range_max` were being assigned by reference rather than by value. Because these attributes are dictionaries, this shared reference caused unintended bounding box corruption when the original leaf's boundaries were modified. Appending `.copy()` to both dictionary assignments ensures they are copied by value, completely resolving the bounding box overlap issue. Fixes online-ml#1834

This reverts commit 08b40c3.

shi-zq · 2026-04-26T09:54:19Z

Additionally, looking at onelearn/datasets/loaders.py, it appears that the paper applies Min-Max scaling (to the [0, 1] range) for numerical features and one-hot encoding for categorical data. This preprocessing step might account for the discrepancy in accuracy reported in #1825. I also want to add that the onelearn version assumes the total number of classes is known a priori, whereas the river version dynamically counts only the seen classes (as mentioned in #1170 ).

MaxHalford · 2026-04-27T10:01:22Z

Your fix makes sense. Could you try seeing what happens if you scale the features in [0, 1]? Also, once you're done, you'll have to add a release note to unreleased.md.

MaxHalford · 2026-04-27T10:15:22Z

Benchmark results

I ran the benchmarks from #1825 against main (which already includes the #1825 fix) vs this PR branch.

Benchmark script

"""Benchmark for Mondrian tree fix (PR #1835)."""

import time
from river import datasets, forest, metrics
from river.tree.mondrian.mondrian_tree_classifier import MondrianTreeClassifier


def bench_single_tree_phishing():
    model = MondrianTreeClassifier(seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for x, y in datasets.Phishing():
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"MondrianTreeClassifier on Phishing: {metric.get():.4%}  ({elapsed:.2f}s)")


def bench_amf_phishing():
    model = forest.AMFClassifier(n_estimators=10, seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for x, y in datasets.Phishing():
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"AMFClassifier(n=10) on Phishing:    {metric.get():.4%}  ({elapsed:.2f}s)")


def bench_amf_bananas():
    model = forest.AMFClassifier(n_estimators=10, seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for x, y in datasets.Bananas():
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"AMFClassifier(n=10) on Bananas:     {metric.get():.4%}  ({elapsed:.2f}s)")


def bench_amf_elec2():
    model = forest.AMFClassifier(n_estimators=10, seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for i, (x, y) in enumerate(datasets.Elec2()):
        if i >= 10_000:
            break
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"AMFClassifier(n=10) on Elec2[:10k]: {metric.get():.4%}  ({elapsed:.2f}s)")


if __name__ == "__main__":
    bench_single_tree_phishing()
    bench_amf_phishing()
    bench_amf_bananas()
    bench_amf_elec2()

Results

Benchmark	`main`	`pr/1835`	Δ
`MondrianTreeClassifier` on Phishing	85.28%	73.44%	-11.84pp
`AMFClassifier(n=10)` on Phishing	89.92%	84.16%	-5.76pp
`AMFClassifier(n=10)` on Bananas	89.23%	70.34%	-18.89pp
`AMFClassifier(n=10)` on Elec2[:10k]	82.89%	84.96%	+2.07pp

MaxHalford · 2026-04-27T10:27:53Z

Benchmark results with MinMaxScaler

Same benchmarks as above but with preprocessing.MinMaxScaler() piped before the model, since Mondrian trees are sensitive to feature scale.

Benchmark script

"""Benchmark for Mondrian tree fix (PR #1835) with MinMaxScaler."""

import time
from river import datasets, forest, metrics, preprocessing, compose
from river.tree.mondrian.mondrian_tree_classifier import MondrianTreeClassifier


def run_bench(name, model, dataset, limit=None):
    pipe = compose.Pipeline(preprocessing.MinMaxScaler(), model)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for i, (x, y) in enumerate(dataset):
        if limit and i >= limit:
            break
        y_pred = pipe.predict_one(x)
        metric.update(y, y_pred)
        pipe.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"{name}: {metric.get():.4%}  ({elapsed:.2f}s)")


if __name__ == "__main__":
    run_bench("MondrianTreeClassifier on Phishing", MondrianTreeClassifier(seed=1), datasets.Phishing())
    run_bench("AMFClassifier(n=10) on Phishing", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Phishing())
    run_bench("AMFClassifier(n=10) on Bananas", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Bananas())
    run_bench("AMFClassifier(n=10) on Elec2[:10k]", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Elec2(), limit=10_000)

Results

Benchmark (with MinMaxScaler)	`main`	`pr/1835`	Δ
`MondrianTreeClassifier` on Phishing	83.20%	82.08%	-1.12pp
`AMFClassifier(n=10)` on Phishing	90.08%	87.52%	-2.56pp
`AMFClassifier(n=10)` on Bananas	85.57%	84.04%	-1.53pp
`AMFClassifier(n=10)` on Elec2[:10k]	82.34%	81.67%	-0.67pp

Scaling narrows the gap substantially.

shi-zq · 2026-04-27T18:49:05Z

I forgot to mention that the min-max scaling is applied in batched mode rather than in a streaming fashion. Since the author of the paper applies it to the data before sending it to the forest in onelearn/datasets/loaders.py, I am doing it this way as well.

I downloaded the phishing, elec2 (10k), and bananas datasets using River, applied min-max scaling and one-hot encoding, and stored them as phishing_processed.arff, elec2_processed.arff, and bananas_processed.arff. I saved them in ARFF format because it is easier to integrate with my previous code, which uses ARFF instead of CSV.

I ran the experiments using seeds 0 through 9 (10 runs) and collected the mean and standard deviation. The parameters used were: tree=10, step=1.0, dirichlet=0.5.

It is interesting to note that the different mechanisms of applying min-max scaling lead to different accuracies. Since I am not entirely familiar with the exact mechanism of min-max scaling in stream mode, I reused your code and placed my results at the end. I have also included tables below detailing the differences between the models and the batched vs. stream scaling versions.

Table 1: Model Comparison (Batched Min-Max Data)

Model	`phishing_processed.arff`	`elec2_processed.arff`	`bananas_processed.arff`
RiverML (Mean ± Std)	88.21% ± 2.00%	89.99% ± 0.26%	70.36% ± 0.13%
OneLearn (Mean ± Std)	86.41% ± 1.77%	90.03% ± 0.27%	70.22% ± 0.18%
Difference	+1.80%	-0.04%	+0.14%

Table 2: Scaling Mechanism Comparison (RiverML)

Scaling Method	`phishing_processed.arff`	`elec2_processed.arff`	`bananas_processed.arff`
RiverML (Batched Min-Max)	88.21%	89.99%	70.36%
RiverML (Stream Min-Max)	87.52%	81.67%	84.04%
Difference	+0.69%	+8.32%	-13.68%

Personally, applying min-max scaling in a streaming fashion seems much more reasonable. In a real data stream, you cannot assume the global minimum and maximum of a feature in advance, and concept drift could easily change those boundaries over time.

However, this approach becomes problematic with categorical features, as we cannot apply standard min-max scaling to categorical data. Because of this, I'm honestly not sure what the best approach is here and would appreciate your input on how to proceed.

P.S. I will add the release notes to unreleased.md once we decide on the best path forward. Also, since onelearn is quite an old package, I am using Python 3.8 to run these experiments.

MaxHalford · 2026-04-27T19:42:52Z

Thank you for these thorough benchmarks, which are helpful. Can you just confirm the results you obtained are taking into account the change you made in this PR? I just want to confirm that your fix brings an improvement.

Personally, applying min-max scaling in a streaming fashion seems much more reasonable. In a real data stream, you cannot assume the global minimum and maximum of a feature in advance, and concept drift could easily change those boundaries over time.

Indeed, the batch min-max is not realistic and goes against River's guiding principles. For now regular streaming min-max is fine. To be really performant, we should consider some kind of rolling min-max scaling, to account for drift.

However, this approach becomes problematic with categorical features, as we cannot apply standard min-max scaling to categorical data. Because of this, I'm honestly not sure what the best approach is here and would appreciate your input on how to proceed.

Indeed categorical features and trees don't go nicely together. But it's a separate problem that we can tackle elsewhere. You don't have to worry about it here.

shi-zq · 2026-04-27T20:18:41Z

Please note that for testing, I actually used the mondrian-test branch. The only difference is that this version does not include the fix for the regressor tree, as it is outside the scope of my thesis. Additionally, I temporarily disabled the depth update for the Mondrian tree to allow for faster execution.

I apologize for the oversight! I forgot to switch branches before running because I currently have some datasets processing in the background. If you'd prefer, I can switch to the mondrian-fix branch and re-test everything once my current runs are finished.

Fix	`mondrian-test`	`mondrian-fix`
Replant fix	`e01eaf6` (missing regressor fix)	`e750211`
Split fix	`2e66f26` (missing regressor fix)	`7987e60`
Depth update	`e01eaf6`	Not done

Co-authored-by: Max Halford <maxhalford25@gmail.com>

MaxHalford · 2026-04-27T20:50:49Z

So far, all the fixes you have suggestions led to an accuracy improvement. So yes I'd like to see if this is the case with this fix too :)

shi-zq · 2026-04-28T07:17:14Z

Here are the results of the mondrian-fix branch.

Just a few notes to ensure I am doing this correctly. This is the version of River I tested: git+https://github.com/shi-zq/river.git@5f2cb4cc3999ae4c2ffb8186412ecf7fe75d05b3

These are the parameters I used for the test:

SEEDS_TO_TEST = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
TREES_TO_TEST = 10
STEP_VAL = 1.0
DIRICHLET_VAL = 0.5
USE_AGGREGATION = True

However, the results are identical, and there is no difference between the two branches.

Model	`phishing_processed.arff`	`elec2_processed.arff`	`bananas_processed.arff`
RiverML (mondrian-test) (Mean ± Std)	88.21% ± 2.00%	89.99% ± 0.26%	70.36% ± 0.13%
RiverML (mondrian-fix) (Mean ± Std)	88.21% ± 2.00%	89.99% ± 0.26%	70.36% ± 0.13%
Difference	0.00%	0.00%	0.00%

shi-zq · 2026-04-28T09:11:47Z

cpdef object go_downwards_classifier_c(
    object root,
    dict x,
    int y_idx,
    int n_classes,
    double dirichlet,
    bint use_aggregation,
    double step,
    bint split_pure,
    int iteration,
    int max_nodes,
    int n_nodes,
    object rng_random,
    object rng_choices,
    object rng_uniform,
    object split_fn,
):
    """Full _go_downwards loop for classifier in Cython.

    Returns (leaf_node, new_root_or_None, n_nodes_added).
    """
    cdef object current_node = root
    cdef dict extensions
    cdef list counts, ext_features, ext_weights
    cdef double extensions_sum, split_time, split_time_candidate, T
    cdef double x_f, range_min_f, range_max_f, threshold, child_time
    cdef int count_val, branch_no, nodes_added
    cdef bint do_split_check, is_right_extension, was_leaf, is_leaf
    cdef object left, right, parent, feature
    cdef object new_root = None
    nodes_added = 0

    if iteration == 0:
        update_downwards_classifier_c(
            current_node, x, y_idx, dirichlet, use_aggregation, step,
            False, n_classes,
        )
        return current_node, new_root, nodes_added

    branch_no = -1
    while True:
        # Compute range extension
        extensions_sum, extensions = range_extension_c(
            current_node.memory_range_min, current_node.memory_range_max, x
        )

        # Compute split time
        split_time = 0.0
        if max_nodes >= 0 and (n_nodes + nodes_added) >= max_nodes:
            pass  # max_nodes reached, no split
        elif extensions_sum > 0:
            do_split_check = split_pure #check the split_pure pameter
            if not do_split_check:
                counts = current_node.counts
                count_val = <int>counts[y_idx] if y_idx < len(counts) else 0
                if current_node.n_samples != count_val:
                    do_split_check = True
            if do_split_check:
                T = -log(1.0 - <double>rng_random()) / extensions_sum
                split_time_candidate = <double>current_node.time + T
                is_leaf = current_node.is_leaf
                if is_leaf:
                    split_time = split_time_candidate
                else:
                    child_time = <double>current_node.children[0].time
                    if split_time_candidate < child_time:
                        split_time = split_time_candidate

        if split_time > 0:
            # Select split feature weighted by extensions (sorted for determinism)
            ext_features = sorted(extensions.keys())
            ext_weights = [extensions[f] for f in ext_features]
            feature = rng_choices(ext_features, ext_weights, k=1)[0]

            x_f = <double>x[feature]
            range_min_f = <double>current_node.memory_range_min[feature]
            range_max_f = <double>current_node.memory_range_max[feature]
            is_right_extension = x_f > range_max_f
            if is_right_extension:
                threshold = <double>rng_uniform(range_max_f, x_f)
            else:
                threshold = <double>rng_uniform(x_f, range_min_f)

            was_leaf = current_node.is_leaf
            current_node = split_fn(
                current_node, split_time, threshold,
                feature, is_right_extension,
            )
            nodes_added += 2

            if current_node.parent is None:
                new_root = current_node
            elif was_leaf:
                parent = current_node.parent
                if branch_no == 0:
                    parent.children = (current_node, parent.children[1])
                else:
                    parent.children = (parent.children[0], current_node)

            update_downwards_classifier_c(
                current_node, x, y_idx, dirichlet, use_aggregation, step,
                True, n_classes,
            )

            left, right = current_node.children
            if is_right_extension:
                current_node = right
            else:
                current_node = left

            update_downwards_classifier_c(
                current_node, x, y_idx, dirichlet, use_aggregation, step,
                False, n_classes,
            )
            return current_node, new_root, nodes_added
        else:
            update_downwards_classifier_c(
                current_node, x, y_idx, dirichlet, use_aggregation, step,
                True, n_classes,
            )
            if current_node.is_leaf:
                return current_node, new_root, nodes_added
            else:
                feature = current_node.feature
                if feature in x:
                    if <double>x[feature] <= <double>current_node.threshold:
                        branch_no = 0
                        current_node = current_node.children[0]
                    else:
                        branch_no = 1
                        current_node = current_node.children[1]
                else:
                    branch_no, current_node = current_node.most_common_path()

Just a minor observation regarding performance: since split_pure is usually set to False, doing the split check before computing the range and split time would provide a nice little performance boost. This is because calculating the range is the most computationally expensive part of a Mondrian tree. I haven't touched the Cython implementation, though, as I'm not very familiar with it yet.

MaxHalford · 2026-04-28T13:51:55Z

That's a good observation. It brings a small ~5% improvement when I profile it. I've opened a PR.

Out of curiosity, do you have access to Claude Code? It's quite powerful for running this kind of benchmark.

shi-zq · 2026-04-28T14:52:05Z

I wouldn't expect a massive improvement since pure nodes are usually the leaves, so the performance gain depends heavily on the tree depth. For a depth of 100, it's probably only around a 1% gain. However, for highly unbalanced data where a large chunk of early samples share the same class, you would definitely save on those range computations. I only brought it up because I noticed the discrepancy while reading the code line-by-line and comparing it with onelearn.

Regarding Claude Code, no, I don't have access to it. I mostly use AI for writing small Python scripts and fixing my grammar.

P.S. Is there anything else I need to do for this PR?

MaxHalford · 2026-04-28T14:55:29Z

Makes sense, thanks for the additional explanations. All improvements, even minor ones, are acceptable!

We can merge this PR once the CI is green, and for that the doctests have to be updated. Could you change the doctests to use MinMaxScaler? I think it's important that we clarify that it is encouraged to use it. We should also document in the docstring why this is the case, like what @kulbachcedric did here.

@shi-zq

When split_pure=False (default), check node purity before computing range extensions. Pure nodes will never split, so the expensive range_extension_c call can be skipped entirely. Benchmarks show ~3-5% speedup on datasets with 50+ features. Credit: @shi-zq for the observation in #1835. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ondrian-fix

shi-zq · 2026-04-28T16:13:39Z

Thanks for the clarification! I was actually just about to ask how to get the CI green. I took a look at the sample, added an explanation for why we should use MinMaxScaler, and updated the final value in the doctest.

One minor detail: for the Bananas dataset, your script outputs 84.04%, but when I run those same settings in the doctest, I get 84.05%. Also, there is a huge improvement for the regressor, going from 0.279747 to 0.427341

MaxHalford · 2026-04-28T17:43:51Z

For the regression it's not an improvement (lower is better). But it's probably just noise and is acceptable.

MaxHalford · 2026-04-29T05:02:53Z

Thank you for the PR! Now you're officially a contributor to the package :)

shi-zq added 3 commits April 22, 2026 10:29

Fix also for the regressor tree

defd513

shi-zq requested review from MaxHalford and smastelini as code owners April 22, 2026 09:15

kulbachcedric linked an issue Apr 22, 2026 that may be closed by this pull request

Random trees embeddings #1386

Closed

shi-zq added 2 commits April 24, 2026 09:46

disable depth update

08b40c3

Revert "disable depth update"

4238576

This reverts commit 08b40c3.

kulbachcedric removed a link to an issue Apr 27, 2026

Random trees embeddings #1386

Closed

shi-zq added 3 commits April 27, 2026 20:59

Merge branch 'online-ml:main' into mondrian-fix

160a029

Update unreleased.md

93afc8a

Update unreleased.md

25b01d8

MaxHalford approved these changes Apr 27, 2026

View reviewed changes

Comment thread docs/releases/unreleased.md Outdated

Update docs/releases/unreleased.md

5f2cb4c

Co-authored-by: Max Halford <maxhalford25@gmail.com>

MaxHalford mentioned this pull request Apr 28, 2026

Skip range_extension_c for pure nodes in Mondrian classifier #1841

Merged

2 tasks

update the doctests

d97d777

shi-zq added 4 commits April 28, 2026 17:58

Merge branch 'main' into mondrian-fix

8a08817

update doctest also for the tree

20e5d69

Merge branch 'mondrian-fix' of https://github.com/shi-zq/river into m…

deb41b4

…ondrian-fix

Update the parameter of mondrian tree classifier

976397d

ruff format

44ebb14

MaxHalford merged commit e09dd75 into online-ml:main Apr 28, 2026
1 check passed

shi-zq deleted the mondrian-fix branch April 28, 2026 19:32

Uh oh!

Conversation

shi-zq commented Apr 22, 2026

Uh oh!

shi-zq commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxHalford commented Apr 27, 2026

Uh oh!

MaxHalford commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Results

Uh oh!

MaxHalford commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results with MinMaxScaler

Results

Uh oh!

shi-zq commented Apr 27, 2026

Table 1: Model Comparison (Batched Min-Max Data)

Table 2: Scaling Mechanism Comparison (RiverML)

Uh oh!

MaxHalford commented Apr 27, 2026

Uh oh!

Uh oh!

shi-zq commented Apr 27, 2026

Uh oh!

MaxHalford commented Apr 27, 2026

Uh oh!

shi-zq commented Apr 28, 2026

Uh oh!

shi-zq commented Apr 28, 2026

Uh oh!

MaxHalford commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shi-zq commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxHalford commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shi-zq commented Apr 28, 2026

Uh oh!

MaxHalford commented Apr 28, 2026

Uh oh!

Uh oh!

MaxHalford commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shi-zq commented Apr 26, 2026 •

edited

Loading

MaxHalford commented Apr 27, 2026 •

edited

Loading

MaxHalford commented Apr 27, 2026 •

edited

Loading

MaxHalford commented Apr 28, 2026 •

edited

Loading

shi-zq commented Apr 28, 2026 •

edited

Loading

MaxHalford commented Apr 28, 2026 •

edited

Loading