Skip to content

Add is_valid to AttrsDict#46

Closed
iguinn wants to merge 8 commits into
gipert:mainfrom
iguinn:main
Closed

Add is_valid to AttrsDict#46
iguinn wants to merge 8 commits into
gipert:mainfrom
iguinn:main

Conversation

@iguinn
Copy link
Copy Markdown
Contributor

@iguinn iguinn commented Feb 13, 2026

Track validity file and yaml/json files used to create AttrsDict when calling on. Added is_valid to check if a given timestamp would produce the same AttrsDict so that the user can avoid unnecessary calls to on.

…k if AttrsDict is valid for a given timestamp
@gipert
Copy link
Copy Markdown
Owner

gipert commented Feb 14, 2026

hey ian, i think on() is stupidly slow and can easily sped up. maybe more useful?

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.18%. Comparing base (09ec900) to head (60f721c).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
- Coverage   91.77%   91.18%   -0.60%     
==========================================
  Files           6        6              
  Lines         547      601      +54     
==========================================
+ Hits          502      548      +46     
- Misses         45       53       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds validity/file provenance tracking to AttrsDict instances returned by TextDB.on(...) and introduces AttrsDict.is_valid(...) to let callers check whether a different timestamp would resolve to the same underlying file selection (to avoid redundant on() calls).

Changes:

  • Pass validity file path + resolved file list into the AttrsDict created by TextDB.on(...).
  • Extend AttrsDict to carry __validity_files__ / __files__ through recursion, mapping, merging, and pickling.
  • Add AttrsDict.is_valid(...) and corresponding tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/test_textdb.py Adds assertions covering the new AttrsDict.is_valid(...) behavior.
src/dbetto/textdb.py Constructs the AttrsDict returned by on() with validity/file provenance.
src/dbetto/attrsdict.py Adds provenance fields + is_valid, propagates metadata in operations, updates pickling state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dbetto/textdb.py
Comment thread src/dbetto/attrsdict.py Outdated
Comment thread src/dbetto/attrsdict.py Outdated
@iguinn
Copy link
Copy Markdown
Contributor Author

iguinn commented Feb 15, 2026

hey ian, i think on() is stupidly slow and can easily sped up. maybe more useful?

From profiling it, reading/parsing the yaml files is what's taking most of the time. I think to speed that part up, maybe switching to json would help (https://stackoverflow.com/questions/46774298/why-is-parsing-yaml-slower-than-an-equivalent-larger-amount-of-json); I think for the very large files like pargen files, human readability matters less and speed matters more. I tested another yaml package (pyyaml12) and found it was about the same as cyaml.

Otherwise, reducing the number of reads is the only way to speed it up. That was the idea behind this approach, but I agree there could be better approaches by caching the files read by on so that you don't need to reread the files when you re-call on, which would be both faster and easier to use than this is_valid approach...I can think of a few ways you could do this
1) Every time a file is read cache it; I think this would be the fastest since you would never have to reread a file again, but it could run into memory issues (e.g. pargen fully replaces the file contents each run, and would get quite large if you were to scan through the full dataset)
2) Every time you call on, cache only the files that were used, and clear any that weren't. This would work quite well for usage patterns that are scanning through metadata sequentially and would avoid the memory problems since you are typically replacing the large files meaning they would get cleared. There might be some usage patterns where this is slow, though (e.g., if someone is alternating between cal and phy and they have different file histories or something)
3) Set some other limit on cached files that causes them to clear at some point...Maybe if it isn't used for 5 calls of on you clear it, or something like that...This would be a little more annoying to implement though

I'll maybe give approach 2 a shot, shouldn't actually be that hard, and since it would keep this all under the hood we could change it later without causing problems.

Edit: Actually I misunderstood how on was working, #1 is already what's happening implicitly, so I tried profiling again and it's deepcopy, called in Props.add_to that is taking the bulk of the time on repeated reads (and yaml is the bulk of the time on the first read). I tried removing the deepcopy (side-effects be damned), and the limiting steps seem to be Props.subst_vars and a lot of the path arithmetic in TextDB.__getitem__.

@iguinn iguinn mentioned this pull request Feb 16, 2026
@iguinn
Copy link
Copy Markdown
Contributor Author

iguinn commented Feb 16, 2026

Opened a second pull request to try to actually speed up on here #48. However, that pull request may be a little risky as it reverts a few features related to wildcards and environment variables and such, and it is potentially less safe.

@gipert
Copy link
Copy Markdown
Owner

gipert commented Mar 11, 2026

Is this still needed if we merge #48?

@iguinn
Copy link
Copy Markdown
Contributor Author

iguinn commented Apr 28, 2026

This is not needed if we merge #48

@gipert gipert closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants