Speed up `on` by iguinn · Pull Request #48 · gipert/dbetto

iguinn · 2026-02-16T03:08:10Z

Changes to TextDB and Props to speed up repeated calls to on. After these changes, I was able to get it so that when calling on for each cycle, the longest time is taken by yaml in reading the data.

Note these break some features (that I'm not sure are important). If these features are important, then an alternative to avoid processing time associated with calling on frequently when no new data is being read is in this PR: #46

Do not use deepcopy in Props.add_to. This shouldn't have unintended side-effects because the output nested dict structure is being constructed anew rather than copying from the individual dicts; unintended consequences could only arise if one were to change one of the values inside of the dict.
on is using rglob to find paths; this replaces it with just checking if the path is valid. This means that wildcards will no longer work, and paths will not be found nested in other paths.
Do not call resolve on relative paths in TextDB.getitem. I think this could break using environment variables as a part of a relative path.
Do not call subst_vars in on. Subst_vars is called when initially reading in the AttrsDicts, and so calling it again should be redundant.

codecov · 2026-02-16T08:57:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.33%. Comparing base (5b17622) to head (dddf6ad).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #48      +/-   ##
==========================================
+ Coverage   91.72%   92.33%   +0.60%     
==========================================
  Files           6        6              
  Lines         568      587      +19     
==========================================
+ Hits          521      542      +21     
+ Misses         47       45       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gipert

I think we could merge this, I don't think it would break anything.

But it's important to document these limitations in the docstring. Could you do that please?

iguinn · 2026-04-28T22:08:10Z

I've updated the docstrings with limitations. I didn't add a note about the rglob one, since I don't think the definition here https://legend-exp.github.io/legend-data-format-specs/dev/metadata/#Specifying-metadata-validity-in-time-(and-system) implies an expectation that wildcards are allowed. If this PR is accepted, #46 can be closed

ggmarshall · 2026-04-28T23:16:24Z

Removing the deep copies worries me, I have a recollection of this being needed

ggmarshall · 2026-04-28T23:17:46Z

From memory without the deepcopy repeated calls to the database were not giving the expected results (not sure how well this is covered by our unit tests)

ggmarshall · 2026-04-28T23:23:28Z

Ok I found the commit : legend-exp/pylegendmeta@2e6e900 , and the reason was when calculating exposures we found the psd_statuses as they were nested were being changed by the dictionaries being mutable, here was our test case (old code now)

from legendmeta import LegendMetadata

lmeta = LegendMetadata("./legend-metadata")
chmap = lmeta.channelmap()

exposure = 0
exposure_psd = 0

for period, runs in lmeta.dataprod.config.analysis_runs.items():
    for run in runs:
        if "phy" not in lmeta.dataprod.runinfo[period][run]:
            continue

        channels = lmeta.channelmap(on=lmeta.dataprod.runinfo[period][run].phy.start_key)
        for hpge, info in channels.group("system").geds.map("name").items():

            if info.analysis.usability == "on":
                this_exposure = (
                    lmeta.dataprod.runinfo[period][run].phy.livetime_in_s / (60 * 60 * 24 * 365)
                    * info.production.mass_in_g / 1000
                )
                exposure += this_exposure

            pinfo = info.analysis.psd
            if "is_bb_like" in pinfo and pinfo.is_bb_like != "missing":
                if all([pinfo.status[psd.strip()] == "valid" for psd in pinfo.is_bb_like.split("&")]):
                    exposure_psd += this_exposure


print(exposure, "kg yr")
print(exposure_psd, "kg yr")

gipert · 2026-04-29T08:32:48Z

yes let's try to reconstruct this problem, i also remember it now. we should try to cover it in the unit tests

iguinn · 2026-04-29T14:31:04Z

Deep copying has a huge impact on performance when you are repeatedly reading the same thing, this has an order of magnitude effect:

The above was calling channelmap and on, timing the first read (that's the first time), timeit-ing repeated reads with the same timestamp (which is equivalent to looping over cycles and using each cycle's timestamp), and then profiling. The timeit is highlighted because this is the issue being solved with this. The profile shows that the deepcopy is indeed responsible for the slow down.

ggmarshall · 2026-04-29T15:21:05Z

I agree it would be good to remove it but need a way to make sure that we are not going to run in mutable dictionary issues again

ggmarshall · 2026-04-29T15:27:02Z

Im also surprised hit pars is so much faster than channelmap despite it being a larger dictionary with a more complex structure? We should look into why

ggmarshall · 2026-04-29T17:09:15Z

Claude suggests we use a lightweight copy:

@staticmethod
def _copy(v):
    if isinstance(v, dict):
        return type(v)({k: Props._copy(vv) for k, vv in v.items()})
    if isinstance(v, list):
        return [Props._copy(vv) for vv in v]
    return v  # str, int, float, bool, None — immutable

Props._copy(b[key])

ggmarshall · 2026-04-29T17:13:31Z

Actually I like your solution more

iguinn · 2026-04-29T17:27:19Z

Added a test to catch this problem.

So, I tried an equivalent solution before the one I just pushed, and it helps but not much.

A solution that works well is to minimize the number of copies. The logic is to do a shallow-copy of a whole branch if nothing in the branch is getting updated. If something new is added, we also can do a shallow-copy. If we are doing an update of a branch, then we will create a new dict to avoid update side-effects, and then repeat the previous logic to minimize how deep we have to recurse and shallow-copy whenever that is acceptable. This performs quite well, and passes the test:

Note that I think this will create some potentially unpredictable/confusing results if you try modifying the result of on. Basically, it will update whatever underlying dict was the source of the information, unless it recursed all the way to the value itself, in which case it won't replace even in the underlying dict. I don't think we expect people to modify the results of on, but it's possible we want a way to enforce that on should be read-only somehow (would be much easier in C++ with const...).

gipert · 2026-04-29T17:39:36Z

can we raise an exception if AttrsDict.__setattr__() is called?

iguinn · 2026-04-30T17:52:19Z

Im also surprised hit pars is so much faster than channelmap despite it being a larger dictionary with a more complex structure? We should look into why

This is just because it's doing a shallow copy of the full dictionary without having to walk through any of it, while channelmap has to update a few things (note that the first read takes 3 s, much longer than channelmap).

Otherwise, I added AttrsDict_RO as a read only version of AttrsDict, by raising an exception when __setattr__ and __setitem__ are called, and returning a AttrsDict_RO instead of AttrsDict when __getattr__ and __getitem__ are called. Also deepcopy on an AttrsDict_RO returns an AttrsDict since this removes the risk of side-effects. It's quite easy to work around these restrictions, python being python, but it's harder to do so accidentally. on is updated to return a read only dict.

This change will break channelmap in pylegendmeta; this is because it calls on on multiple databases and then combines them. The fix is to deepcopy the first one we read and then combine the others into it. This performs ~3x faster than the old (since it's not doing quite as much deep copying), but the added deep copy is ~2x slower than my test above (but the fast version I think has some side effects that we don't want). I'm not sure if there are other situations where people are trying to modify their AttrsDicts from on, but if they are this would also break.

iguinn · 2026-05-01T14:07:14Z

Hmm well the AttrsDict_RO solution has some problems. Pickling in particular is going to be hard (impossible?) to make work, since the class derives from dict. I will investigate other solutions to this...

…sDict

iguinn · 2026-05-01T17:36:22Z

Ok attempt 2 at readonly: now I've added a __readonly__ attribute to AttrsDict. This will raise an Error if you call __setitem__. When you set __readonly__ it will propagate to any child AttrsDicts; this takes some time, but it's not terrible; it's barely noticeable for channelmap, while for the hit pars, it goes from 8 ms to 25 ms, but given how fast it was before this isn't a problem. It will also emit a warning if you try to change __readonly__ from True to False. I have also made it so that __setstate__ (which is used for deepcopy and pickling) will set __readonly__ to False regardless of what the original object was since the copy won't have unintended consequences. on will set __readonly__ to True.

I have also added more tests than my first attempt. Note that if not explicitly implmented, deepcopy uses __getstate__ and __setstate__ if not explicitly defined, so by testing deep copy it is implicitly testing pickle as well.

gipert · 2026-05-04T08:58:48Z

so are we sure now that this pr does not introduce any regression? (see snipped posted by george)

iguinn · 2026-05-06T13:15:46Z

As sure as we are in the completeness of the tests. For what it's worth, I reran the snippet from George and it worked correctly. I've also run a few tests with the querying functions which use this quite a bit under the hood and haven't seen any problems yet.

iguinn and others added 5 commits February 15, 2026 18:45

Do not deepcopy in add_to

9a3da7a

Do not use rglob to find paths in on

077b460

Do not try call resolve on relative paths in TextDB.__getitem__

d153e7d

Do not call subst_vars in on

6ed6fd7

style: pre-commit fixes

c9b92c7

iguinn mentioned this pull request Feb 16, 2026

Add is_valid to AttrsDict #46

Closed

gipert self-requested a review March 11, 2026 08:54

gipert requested changes Mar 11, 2026

View reviewed changes

iguinn added 3 commits April 28, 2026 14:53

Merge branch 'main' of https://github.com/gipert/dbetto into speed

3516c49

Improved docstrings + test coverage

773beec

Merge branch 'speed' of https://github.com/iguinn/dbetto into speed

3af6db6

iguinn and others added 2 commits April 29, 2026 08:04

Added test for calling on with deeply nested dicts

27a4c4d

style: pre-commit fixes

2aaab61

iguinn and others added 2 commits April 29, 2026 09:16

Props.add_to constructs new dictionary with minimal copying

b2c3c1b

style: pre-commit fixes

380a3b0

iguinn added 2 commits April 30, 2026 10:50

Add AttrsDict_RO, tests, and return AttrsDict_RO with on

081ae25

Merge branch 'speed' of https://github.com/iguinn/dbetto into speed

e456859

iguinn mentioned this pull request Apr 30, 2026

Use AttrsDict_RO for channelmap legend-exp/pylegendmeta#132

Closed

iguinn and others added 3 commits May 1, 2026 10:19

Remove AttrsDict_RO class; instead add __readonly__ attribute to Attr…

578bc30

…sDict

style: pre-commit fixes

d3909ed

pre-commit

dddf6ad

iguinn mentioned this pull request May 1, 2026

channelmap returns readonly AttrsDict legend-exp/pylegendmeta#133

Merged

gipert approved these changes May 4, 2026

View reviewed changes

gipert merged commit d5f8955 into gipert:main May 6, 2026
14 checks passed

Conversation

iguinn commented Feb 16, 2026

Uh oh!

codecov Bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gipert left a comment

Choose a reason for hiding this comment

Uh oh!

iguinn commented Apr 28, 2026

Uh oh!

ggmarshall commented Apr 28, 2026

Uh oh!

ggmarshall commented Apr 28, 2026

Uh oh!

ggmarshall commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gipert commented Apr 29, 2026

Uh oh!

iguinn commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggmarshall commented Apr 29, 2026

Uh oh!

ggmarshall commented Apr 29, 2026

Uh oh!

ggmarshall commented Apr 29, 2026

Uh oh!

ggmarshall commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iguinn commented Apr 29, 2026

Uh oh!

gipert commented Apr 29, 2026

Uh oh!

iguinn commented Apr 30, 2026

Uh oh!

iguinn commented May 1, 2026

Uh oh!

iguinn commented May 1, 2026

Uh oh!

gipert commented May 4, 2026

Uh oh!

iguinn commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Feb 16, 2026 •

edited

Loading

ggmarshall commented Apr 28, 2026 •

edited

Loading

iguinn commented Apr 29, 2026 •

edited

Loading

ggmarshall commented Apr 29, 2026 •

edited

Loading