Skip to content

Mondrian fix#1835

Merged
MaxHalford merged 15 commits intoonline-ml:mainfrom
shi-zq:mondrian-fix
Apr 28, 2026
Merged

Mondrian fix#1835
MaxHalford merged 15 commits intoonline-ml:mainfrom
shi-zq:mondrian-fix

Conversation

@shi-zq
Copy link
Copy Markdown
Contributor

@shi-zq shi-zq commented Apr 22, 2026

It resolves an issue with the leaf case in the split method (#1801) and fixes a shared reference bug in replant by properly copying bounding box dictionaries by value rather than by reference (#1834).

Since this is my first pull request, please let me know if I missed any project conventions or if anything needs to be adjusted. I'm happy to make changes

shi-zq added 3 commits April 22, 2026 10:29
PR online-ml#1801 fixed the missing copy issue for the branch case. However, a similar bug still exists in the leaf case. When a leaf is split and generates two new leaves, the leaf that contains the new sample gets updated later, but the other leaf—which is supposed to inherit the old leaf's state—fails to copy the bounding boxes correctly.
Fix missing dictionary copy in Mondrian tree replant

When replanting a MondrianNode, `memory_range_min` and `memory_range_max`
were being assigned by reference rather than by value. Because these
attributes are dictionaries, this shared reference caused unintended
bounding box corruption when the original leaf's boundaries were modified.

Appending `.copy()` to both dictionary assignments ensures they are
copied by value, completely resolving the bounding box overlap issue.

Fixes online-ml#1834
@kulbachcedric kulbachcedric linked an issue Apr 22, 2026 that may be closed by this pull request
@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 26, 2026

Additionally, looking at onelearn/datasets/loaders.py, it appears that the paper applies Min-Max scaling (to the [0, 1] range) for numerical features and one-hot encoding for categorical data. This preprocessing step might account for the discrepancy in accuracy reported in #1825. I also want to add that the onelearn version assumes the total number of classes is known a priori, whereas the river version dynamically counts only the seen classes (as mentioned in #1170 ).

@MaxHalford
Copy link
Copy Markdown
Member

Your fix makes sense. Could you try seeing what happens if you scale the features in [0, 1]? Also, once you're done, you'll have to add a release note to unreleased.md.

@MaxHalford
Copy link
Copy Markdown
Member

MaxHalford commented Apr 27, 2026

Benchmark results

I ran the benchmarks from #1825 against main (which already includes the #1825 fix) vs this PR branch.

Benchmark script
"""Benchmark for Mondrian tree fix (PR #1835)."""

import time
from river import datasets, forest, metrics
from river.tree.mondrian.mondrian_tree_classifier import MondrianTreeClassifier


def bench_single_tree_phishing():
    model = MondrianTreeClassifier(seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for x, y in datasets.Phishing():
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"MondrianTreeClassifier on Phishing: {metric.get():.4%}  ({elapsed:.2f}s)")


def bench_amf_phishing():
    model = forest.AMFClassifier(n_estimators=10, seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for x, y in datasets.Phishing():
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"AMFClassifier(n=10) on Phishing:    {metric.get():.4%}  ({elapsed:.2f}s)")


def bench_amf_bananas():
    model = forest.AMFClassifier(n_estimators=10, seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for x, y in datasets.Bananas():
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"AMFClassifier(n=10) on Bananas:     {metric.get():.4%}  ({elapsed:.2f}s)")


def bench_amf_elec2():
    model = forest.AMFClassifier(n_estimators=10, seed=1)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for i, (x, y) in enumerate(datasets.Elec2()):
        if i >= 10_000:
            break
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        model.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"AMFClassifier(n=10) on Elec2[:10k]: {metric.get():.4%}  ({elapsed:.2f}s)")


if __name__ == "__main__":
    bench_single_tree_phishing()
    bench_amf_phishing()
    bench_amf_bananas()
    bench_amf_elec2()

Results

Benchmark main pr/1835 Δ
MondrianTreeClassifier on Phishing 85.28% 73.44% -11.84pp
AMFClassifier(n=10) on Phishing 89.92% 84.16% -5.76pp
AMFClassifier(n=10) on Bananas 89.23% 70.34% -18.89pp
AMFClassifier(n=10) on Elec2[:10k] 82.89% 84.96% +2.07pp

@MaxHalford
Copy link
Copy Markdown
Member

MaxHalford commented Apr 27, 2026

Benchmark results with MinMaxScaler

Same benchmarks as above but with preprocessing.MinMaxScaler() piped before the model, since Mondrian trees are sensitive to feature scale.

Benchmark script
"""Benchmark for Mondrian tree fix (PR #1835) with MinMaxScaler."""

import time
from river import datasets, forest, metrics, preprocessing, compose
from river.tree.mondrian.mondrian_tree_classifier import MondrianTreeClassifier


def run_bench(name, model, dataset, limit=None):
    pipe = compose.Pipeline(preprocessing.MinMaxScaler(), model)
    metric = metrics.Accuracy()
    t0 = time.perf_counter()
    for i, (x, y) in enumerate(dataset):
        if limit and i >= limit:
            break
        y_pred = pipe.predict_one(x)
        metric.update(y, y_pred)
        pipe.learn_one(x, y)
    elapsed = time.perf_counter() - t0
    print(f"{name}: {metric.get():.4%}  ({elapsed:.2f}s)")


if __name__ == "__main__":
    run_bench("MondrianTreeClassifier on Phishing", MondrianTreeClassifier(seed=1), datasets.Phishing())
    run_bench("AMFClassifier(n=10) on Phishing", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Phishing())
    run_bench("AMFClassifier(n=10) on Bananas", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Bananas())
    run_bench("AMFClassifier(n=10) on Elec2[:10k]", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Elec2(), limit=10_000)

Results

Benchmark (with MinMaxScaler) main pr/1835 Δ
MondrianTreeClassifier on Phishing 83.20% 82.08% -1.12pp
AMFClassifier(n=10) on Phishing 90.08% 87.52% -2.56pp
AMFClassifier(n=10) on Bananas 85.57% 84.04% -1.53pp
AMFClassifier(n=10) on Elec2[:10k] 82.34% 81.67% -0.67pp

Scaling narrows the gap substantially.

@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 27, 2026

I forgot to mention that the min-max scaling is applied in batched mode rather than in a streaming fashion. Since the author of the paper applies it to the data before sending it to the forest in onelearn/datasets/loaders.py, I am doing it this way as well.

I downloaded the phishing, elec2 (10k), and bananas datasets using River, applied min-max scaling and one-hot encoding, and stored them as phishing_processed.arff, elec2_processed.arff, and bananas_processed.arff. I saved them in ARFF format because it is easier to integrate with my previous code, which uses ARFF instead of CSV.

I ran the experiments using seeds 0 through 9 (10 runs) and collected the mean and standard deviation. The parameters used were: tree=10, step=1.0, dirichlet=0.5.

It is interesting to note that the different mechanisms of applying min-max scaling lead to different accuracies. Since I am not entirely familiar with the exact mechanism of min-max scaling in stream mode, I reused your code and placed my results at the end. I have also included tables below detailing the differences between the models and the batched vs. stream scaling versions.

Table 1: Model Comparison (Batched Min-Max Data)

Model phishing_processed.arff elec2_processed.arff bananas_processed.arff
RiverML (Mean ± Std) 88.21% ± 2.00% 89.99% ± 0.26% 70.36% ± 0.13%
OneLearn (Mean ± Std) 86.41% ± 1.77% 90.03% ± 0.27% 70.22% ± 0.18%
Difference +1.80% -0.04% +0.14%

Table 2: Scaling Mechanism Comparison (RiverML)

Scaling Method phishing_processed.arff elec2_processed.arff bananas_processed.arff
RiverML (Batched Min-Max) 88.21% 89.99% 70.36%
RiverML (Stream Min-Max) 87.52% 81.67% 84.04%
Difference +0.69% +8.32% -13.68%

Personally, applying min-max scaling in a streaming fashion seems much more reasonable. In a real data stream, you cannot assume the global minimum and maximum of a feature in advance, and concept drift could easily change those boundaries over time.

However, this approach becomes problematic with categorical features, as we cannot apply standard min-max scaling to categorical data. Because of this, I'm honestly not sure what the best approach is here and would appreciate your input on how to proceed.

P.S. I will add the release notes to unreleased.md once we decide on the best path forward. Also, since onelearn is quite an old package, I am using Python 3.8 to run these experiments.

@MaxHalford
Copy link
Copy Markdown
Member

Thank you for these thorough benchmarks, which are helpful. Can you just confirm the results you obtained are taking into account the change you made in this PR? I just want to confirm that your fix brings an improvement.

Personally, applying min-max scaling in a streaming fashion seems much more reasonable. In a real data stream, you cannot assume the global minimum and maximum of a feature in advance, and concept drift could easily change those boundaries over time.

Indeed, the batch min-max is not realistic and goes against River's guiding principles. For now regular streaming min-max is fine. To be really performant, we should consider some kind of rolling min-max scaling, to account for drift.

However, this approach becomes problematic with categorical features, as we cannot apply standard min-max scaling to categorical data. Because of this, I'm honestly not sure what the best approach is here and would appreciate your input on how to proceed.

Indeed categorical features and trees don't go nicely together. But it's a separate problem that we can tackle elsewhere. You don't have to worry about it here.

Comment thread docs/releases/unreleased.md Outdated
@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 27, 2026

Please note that for testing, I actually used the mondrian-test branch. The only difference is that this version does not include the fix for the regressor tree, as it is outside the scope of my thesis. Additionally, I temporarily disabled the depth update for the Mondrian tree to allow for faster execution.

I apologize for the oversight! I forgot to switch branches before running because I currently have some datasets processing in the background. If you'd prefer, I can switch to the mondrian-fix branch and re-test everything once my current runs are finished.

Fix mondrian-test mondrian-fix
Replant fix e01eaf6 (missing regressor fix) e750211
Split fix 2e66f26 (missing regressor fix) 7987e60
Depth update e01eaf6 Not done

Co-authored-by: Max Halford <maxhalford25@gmail.com>
@MaxHalford
Copy link
Copy Markdown
Member

So far, all the fixes you have suggestions led to an accuracy improvement. So yes I'd like to see if this is the case with this fix too :)

@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 28, 2026

Here are the results of the mondrian-fix branch.

Just a few notes to ensure I am doing this correctly. This is the version of River I tested: git+https://github.com/shi-zq/river.git@5f2cb4cc3999ae4c2ffb8186412ecf7fe75d05b3

These are the parameters I used for the test:

SEEDS_TO_TEST = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
TREES_TO_TEST = 10
STEP_VAL = 1.0
DIRICHLET_VAL = 0.5
USE_AGGREGATION = True

However, the results are identical, and there is no difference between the two branches.

Model phishing_processed.arff elec2_processed.arff bananas_processed.arff
RiverML (mondrian-test)
(Mean ± Std)
88.21% ± 2.00% 89.99% ± 0.26% 70.36% ± 0.13%
RiverML (mondrian-fix)
(Mean ± Std)
88.21% ± 2.00% 89.99% ± 0.26% 70.36% ± 0.13%
Difference 0.00% 0.00% 0.00%

@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 28, 2026

cpdef object go_downwards_classifier_c(
    object root,
    dict x,
    int y_idx,
    int n_classes,
    double dirichlet,
    bint use_aggregation,
    double step,
    bint split_pure,
    int iteration,
    int max_nodes,
    int n_nodes,
    object rng_random,
    object rng_choices,
    object rng_uniform,
    object split_fn,
):
    """Full _go_downwards loop for classifier in Cython.

    Returns (leaf_node, new_root_or_None, n_nodes_added).
    """
    cdef object current_node = root
    cdef dict extensions
    cdef list counts, ext_features, ext_weights
    cdef double extensions_sum, split_time, split_time_candidate, T
    cdef double x_f, range_min_f, range_max_f, threshold, child_time
    cdef int count_val, branch_no, nodes_added
    cdef bint do_split_check, is_right_extension, was_leaf, is_leaf
    cdef object left, right, parent, feature
    cdef object new_root = None
    nodes_added = 0

    if iteration == 0:
        update_downwards_classifier_c(
            current_node, x, y_idx, dirichlet, use_aggregation, step,
            False, n_classes,
        )
        return current_node, new_root, nodes_added

    branch_no = -1
    while True:
        # Compute range extension
        extensions_sum, extensions = range_extension_c(
            current_node.memory_range_min, current_node.memory_range_max, x
        )

        # Compute split time
        split_time = 0.0
        if max_nodes >= 0 and (n_nodes + nodes_added) >= max_nodes:
            pass  # max_nodes reached, no split
        elif extensions_sum > 0:
            do_split_check = split_pure #check the split_pure pameter
            if not do_split_check:
                counts = current_node.counts
                count_val = <int>counts[y_idx] if y_idx < len(counts) else 0
                if current_node.n_samples != count_val:
                    do_split_check = True
            if do_split_check:
                T = -log(1.0 - <double>rng_random()) / extensions_sum
                split_time_candidate = <double>current_node.time + T
                is_leaf = current_node.is_leaf
                if is_leaf:
                    split_time = split_time_candidate
                else:
                    child_time = <double>current_node.children[0].time
                    if split_time_candidate < child_time:
                        split_time = split_time_candidate

        if split_time > 0:
            # Select split feature weighted by extensions (sorted for determinism)
            ext_features = sorted(extensions.keys())
            ext_weights = [extensions[f] for f in ext_features]
            feature = rng_choices(ext_features, ext_weights, k=1)[0]

            x_f = <double>x[feature]
            range_min_f = <double>current_node.memory_range_min[feature]
            range_max_f = <double>current_node.memory_range_max[feature]
            is_right_extension = x_f > range_max_f
            if is_right_extension:
                threshold = <double>rng_uniform(range_max_f, x_f)
            else:
                threshold = <double>rng_uniform(x_f, range_min_f)

            was_leaf = current_node.is_leaf
            current_node = split_fn(
                current_node, split_time, threshold,
                feature, is_right_extension,
            )
            nodes_added += 2

            if current_node.parent is None:
                new_root = current_node
            elif was_leaf:
                parent = current_node.parent
                if branch_no == 0:
                    parent.children = (current_node, parent.children[1])
                else:
                    parent.children = (parent.children[0], current_node)

            update_downwards_classifier_c(
                current_node, x, y_idx, dirichlet, use_aggregation, step,
                True, n_classes,
            )

            left, right = current_node.children
            if is_right_extension:
                current_node = right
            else:
                current_node = left

            update_downwards_classifier_c(
                current_node, x, y_idx, dirichlet, use_aggregation, step,
                False, n_classes,
            )
            return current_node, new_root, nodes_added
        else:
            update_downwards_classifier_c(
                current_node, x, y_idx, dirichlet, use_aggregation, step,
                True, n_classes,
            )
            if current_node.is_leaf:
                return current_node, new_root, nodes_added
            else:
                feature = current_node.feature
                if feature in x:
                    if <double>x[feature] <= <double>current_node.threshold:
                        branch_no = 0
                        current_node = current_node.children[0]
                    else:
                        branch_no = 1
                        current_node = current_node.children[1]
                else:
                    branch_no, current_node = current_node.most_common_path()

Just a minor observation regarding performance: since split_pure is usually set to False, doing the split check before computing the range and split time would provide a nice little performance boost. This is because calculating the range is the most computationally expensive part of a Mondrian tree. I haven't touched the Cython implementation, though, as I'm not very familiar with it yet.

@MaxHalford
Copy link
Copy Markdown
Member

MaxHalford commented Apr 28, 2026

That's a good observation. It brings a small ~5% improvement when I profile it. I've opened a PR.

Out of curiosity, do you have access to Claude Code? It's quite powerful for running this kind of benchmark.

@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 28, 2026

I wouldn't expect a massive improvement since pure nodes are usually the leaves, so the performance gain depends heavily on the tree depth. For a depth of 100, it's probably only around a 1% gain. However, for highly unbalanced data where a large chunk of early samples share the same class, you would definitely save on those range computations. I only brought it up because I noticed the discrepancy while reading the code line-by-line and comparing it with onelearn.

Regarding Claude Code, no, I don't have access to it. I mostly use AI for writing small Python scripts and fixing my grammar.

P.S. Is there anything else I need to do for this PR?

@MaxHalford
Copy link
Copy Markdown
Member

MaxHalford commented Apr 28, 2026

Makes sense, thanks for the additional explanations. All improvements, even minor ones, are acceptable!

We can merge this PR once the CI is green, and for that the doctests have to be updated. Could you change the doctests to use MinMaxScaler? I think it's important that we clarify that it is encouraged to use it. We should also document in the docstring why this is the case, like what @kulbachcedric did here.

MaxHalford added a commit that referenced this pull request Apr 28, 2026
When split_pure=False (default), check node purity before computing
range extensions. Pure nodes will never split, so the expensive
range_extension_c call can be skipped entirely. Benchmarks show ~3-5%
speedup on datasets with 50+ features.

Credit: @shi-zq for the observation in #1835.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shi-zq
Copy link
Copy Markdown
Contributor Author

shi-zq commented Apr 28, 2026

Thanks for the clarification! I was actually just about to ask how to get the CI green. I took a look at the sample, added an explanation for why we should use MinMaxScaler, and updated the final value in the doctest.

One minor detail: for the Bananas dataset, your script outputs 84.04%, but when I run those same settings in the doctest, I get 84.05%. Also, there is a huge improvement for the regressor, going from 0.279747 to 0.427341

@MaxHalford
Copy link
Copy Markdown
Member

For the regression it's not an improvement (lower is better). But it's probably just noise and is acceptable.

@MaxHalford MaxHalford merged commit e09dd75 into online-ml:main Apr 28, 2026
1 check passed
@shi-zq shi-zq deleted the mondrian-fix branch April 28, 2026 19:32
@MaxHalford
Copy link
Copy Markdown
Member

Thank you for the PR! Now you're officially a contributor to the package :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants