Mondrian fix#1835
Conversation
PR online-ml#1801 fixed the missing copy issue for the branch case. However, a similar bug still exists in the leaf case. When a leaf is split and generates two new leaves, the leaf that contains the new sample gets updated later, but the other leaf—which is supposed to inherit the old leaf's state—fails to copy the bounding boxes correctly.
Fix missing dictionary copy in Mondrian tree replant When replanting a MondrianNode, `memory_range_min` and `memory_range_max` were being assigned by reference rather than by value. Because these attributes are dictionaries, this shared reference caused unintended bounding box corruption when the original leaf's boundaries were modified. Appending `.copy()` to both dictionary assignments ensures they are copied by value, completely resolving the bounding box overlap issue. Fixes online-ml#1834
This reverts commit 08b40c3.
|
Additionally, looking at onelearn/datasets/loaders.py, it appears that the paper applies Min-Max scaling (to the [0, 1] range) for numerical features and one-hot encoding for categorical data. This preprocessing step might account for the discrepancy in accuracy reported in #1825. I also want to add that the onelearn version assumes the total number of classes is known a priori, whereas the river version dynamically counts only the seen classes (as mentioned in #1170 ). |
|
Your fix makes sense. Could you try seeing what happens if you scale the features in [0, 1]? Also, once you're done, you'll have to add a release note to |
Benchmark resultsI ran the benchmarks from #1825 against Benchmark script"""Benchmark for Mondrian tree fix (PR #1835)."""
import time
from river import datasets, forest, metrics
from river.tree.mondrian.mondrian_tree_classifier import MondrianTreeClassifier
def bench_single_tree_phishing():
model = MondrianTreeClassifier(seed=1)
metric = metrics.Accuracy()
t0 = time.perf_counter()
for x, y in datasets.Phishing():
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.learn_one(x, y)
elapsed = time.perf_counter() - t0
print(f"MondrianTreeClassifier on Phishing: {metric.get():.4%} ({elapsed:.2f}s)")
def bench_amf_phishing():
model = forest.AMFClassifier(n_estimators=10, seed=1)
metric = metrics.Accuracy()
t0 = time.perf_counter()
for x, y in datasets.Phishing():
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.learn_one(x, y)
elapsed = time.perf_counter() - t0
print(f"AMFClassifier(n=10) on Phishing: {metric.get():.4%} ({elapsed:.2f}s)")
def bench_amf_bananas():
model = forest.AMFClassifier(n_estimators=10, seed=1)
metric = metrics.Accuracy()
t0 = time.perf_counter()
for x, y in datasets.Bananas():
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.learn_one(x, y)
elapsed = time.perf_counter() - t0
print(f"AMFClassifier(n=10) on Bananas: {metric.get():.4%} ({elapsed:.2f}s)")
def bench_amf_elec2():
model = forest.AMFClassifier(n_estimators=10, seed=1)
metric = metrics.Accuracy()
t0 = time.perf_counter()
for i, (x, y) in enumerate(datasets.Elec2()):
if i >= 10_000:
break
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.learn_one(x, y)
elapsed = time.perf_counter() - t0
print(f"AMFClassifier(n=10) on Elec2[:10k]: {metric.get():.4%} ({elapsed:.2f}s)")
if __name__ == "__main__":
bench_single_tree_phishing()
bench_amf_phishing()
bench_amf_bananas()
bench_amf_elec2()Results
|
Benchmark results with MinMaxScalerSame benchmarks as above but with Benchmark script"""Benchmark for Mondrian tree fix (PR #1835) with MinMaxScaler."""
import time
from river import datasets, forest, metrics, preprocessing, compose
from river.tree.mondrian.mondrian_tree_classifier import MondrianTreeClassifier
def run_bench(name, model, dataset, limit=None):
pipe = compose.Pipeline(preprocessing.MinMaxScaler(), model)
metric = metrics.Accuracy()
t0 = time.perf_counter()
for i, (x, y) in enumerate(dataset):
if limit and i >= limit:
break
y_pred = pipe.predict_one(x)
metric.update(y, y_pred)
pipe.learn_one(x, y)
elapsed = time.perf_counter() - t0
print(f"{name}: {metric.get():.4%} ({elapsed:.2f}s)")
if __name__ == "__main__":
run_bench("MondrianTreeClassifier on Phishing", MondrianTreeClassifier(seed=1), datasets.Phishing())
run_bench("AMFClassifier(n=10) on Phishing", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Phishing())
run_bench("AMFClassifier(n=10) on Bananas", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Bananas())
run_bench("AMFClassifier(n=10) on Elec2[:10k]", forest.AMFClassifier(n_estimators=10, seed=1), datasets.Elec2(), limit=10_000)Results
Scaling narrows the gap substantially. |
|
I forgot to mention that the min-max scaling is applied in batched mode rather than in a streaming fashion. Since the author of the paper applies it to the data before sending it to the forest in I downloaded the I ran the experiments using seeds 0 through 9 (10 runs) and collected the mean and standard deviation. The parameters used were: It is interesting to note that the different mechanisms of applying min-max scaling lead to different accuracies. Since I am not entirely familiar with the exact mechanism of min-max scaling in stream mode, I reused your code and placed my results at the end. I have also included tables below detailing the differences between the models and the batched vs. stream scaling versions. Table 1: Model Comparison (Batched Min-Max Data)
Table 2: Scaling Mechanism Comparison (RiverML)
Personally, applying min-max scaling in a streaming fashion seems much more reasonable. In a real data stream, you cannot assume the global minimum and maximum of a feature in advance, and concept drift could easily change those boundaries over time. However, this approach becomes problematic with categorical features, as we cannot apply standard min-max scaling to categorical data. Because of this, I'm honestly not sure what the best approach is here and would appreciate your input on how to proceed. P.S. I will add the release notes to |
|
Thank you for these thorough benchmarks, which are helpful. Can you just confirm the results you obtained are taking into account the change you made in this PR? I just want to confirm that your fix brings an improvement.
Indeed, the batch min-max is not realistic and goes against River's guiding principles. For now regular streaming min-max is fine. To be really performant, we should consider some kind of rolling min-max scaling, to account for drift.
Indeed categorical features and trees don't go nicely together. But it's a separate problem that we can tackle elsewhere. You don't have to worry about it here. |
|
Please note that for testing, I actually used the mondrian-test branch. The only difference is that this version does not include the fix for the regressor tree, as it is outside the scope of my thesis. Additionally, I temporarily disabled the depth update for the Mondrian tree to allow for faster execution. I apologize for the oversight! I forgot to switch branches before running because I currently have some datasets processing in the background. If you'd prefer, I can switch to the
|
Co-authored-by: Max Halford <maxhalford25@gmail.com>
|
So far, all the fixes you have suggestions led to an accuracy improvement. So yes I'd like to see if this is the case with this fix too :) |
|
Here are the results of the Just a few notes to ensure I am doing this correctly. This is the version of River I tested: These are the parameters I used for the test: SEEDS_TO_TEST = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
TREES_TO_TEST = 10
STEP_VAL = 1.0
DIRICHLET_VAL = 0.5
USE_AGGREGATION = TrueHowever, the results are identical, and there is no difference between the two branches.
|
Just a minor observation regarding performance: since split_pure is usually set to False, doing the split check before computing the range and split time would provide a nice little performance boost. This is because calculating the range is the most computationally expensive part of a Mondrian tree. I haven't touched the Cython implementation, though, as I'm not very familiar with it yet. |
|
That's a good observation. It brings a small ~5% improvement when I profile it. I've opened a PR. Out of curiosity, do you have access to Claude Code? It's quite powerful for running this kind of benchmark. |
|
I wouldn't expect a massive improvement since pure nodes are usually the leaves, so the performance gain depends heavily on the tree depth. For a depth of 100, it's probably only around a 1% gain. However, for highly unbalanced data where a large chunk of early samples share the same class, you would definitely save on those range computations. I only brought it up because I noticed the discrepancy while reading the code line-by-line and comparing it with onelearn. Regarding Claude Code, no, I don't have access to it. I mostly use AI for writing small Python scripts and fixing my grammar. P.S. Is there anything else I need to do for this PR? |
|
Makes sense, thanks for the additional explanations. All improvements, even minor ones, are acceptable! We can merge this PR once the CI is green, and for that the doctests have to be updated. Could you change the doctests to use |
When split_pure=False (default), check node purity before computing range extensions. Pure nodes will never split, so the expensive range_extension_c call can be skipped entirely. Benchmarks show ~3-5% speedup on datasets with 50+ features. Credit: @shi-zq for the observation in #1835. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the clarification! I was actually just about to ask how to get the CI green. I took a look at the sample, added an explanation for why we should use MinMaxScaler, and updated the final value in the doctest. One minor detail: for the Bananas dataset, your script outputs 84.04%, but when I run those same settings in the doctest, I get 84.05%. Also, there is a huge improvement for the regressor, going from 0.279747 to 0.427341 |
|
For the regression it's not an improvement (lower is better). But it's probably just noise and is acceptable. |
|
Thank you for the PR! Now you're officially a contributor to the package :) |
It resolves an issue with the leaf case in the split method (#1801) and fixes a shared reference bug in replant by properly copying bounding box dictionaries by value rather than by reference (#1834).
Since this is my first pull request, please let me know if I missed any project conventions or if anything needs to be adjusted. I'm happy to make changes