feat: add chunked histograms#685
Conversation
92b722a to
0d1fe90
Compare
Assisted-by: Kimi-K2.6 Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
0d1fe90 to
bf707f1
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ChunkedHist implementation to store histograms with categorical (chunk) axes as a dict of dense backing arrays keyed by categorical values, avoiding repeated dense reallocations when categories grow (as in issue #684).
Changes:
- Added
src/hist/chunked.pyimplementingChunkedHist, chunk-key selection (including wildcard support forStrCategory), materialization viato_hist(), and merging via+/+=. - Added
tests/test_chunked.pycovering construction, filling, selection, merging, materialization, and basic utility behaviors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/hist/chunked.py |
Implements the ChunkedHist data structure, fill/materialize/select/merge logic, and helper utilities for chunk-key normalization and dense-view accumulation. |
tests/test_chunked.py |
Adds a comprehensive test suite for the new ChunkedHist API and expected behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pfackeldey
left a comment
There was a problem hiding this comment.
Looks good to me!
There are a few small things, I'm not sure what your preference are here @henryiii, e.g., allowing wildcard matching of categories in getitem, or the fix in _save_chunk_view (which sounds like something that should be fixed...).
ChunkedHist now always owns saved chunk arrays (no aliasing), normalize_chunk_selection accepts numpy/scalar-like values, to_hist() preserves declared categorical keys even with empty chunks, wildcard selections with no matches now return an empty ChunkedHist instead of raising, and __repr__ reflects actual axis representations instead of hardcoding growth=True. I also added regression tests covering no-alias merge behavior, empty wildcard selection, numpy scalar selection in __getitem__, declared-category preservation on materialization, and repr growth handling Assisted-by: Copilot:GPT-5.3-Codex Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Close #684.
🤖 Suggested followups
Here are the natural followups, roughly ordered by impact vs. effort:
1. Top-level export (
from hist import ChunkedHist)Currently only
from hist.chunked import ChunkedHistworks. Adding it tohist/__init__.pyis trivial and would make it discoverable.2. Array-valued chunk axes in
fill()Right now
fill()requires scalar chunk-axis values:Supporting array-valued chunk axes would group by chunk key and dispatch to multiple chunks in one call. This is a common user expectation.
Tradeoff: More complex because you need to group the dense-axis data by chunk key and call
dense_hist.fill()per group.3. Native chunked UHI serialization
Right now round-tripping through JSON/bytes requires
to_hist()first, which is expensive for large histograms. A native format that serializes chunk metadata + individual chunk arrays would avoid materialization.4. Custom
__getstate__/__setstate__For pickle/dill interop. Without this, pickling a
ChunkedHistwon't work correctly (it has unpicklable internal state like the scratch hist reference).5.
Reporter-style operations:*,-,/,**Only
+/+=are implemented. Multiplication, subtraction, division could be useful for e.g. weighted subtraction of backgrounds.6. Relax
Mean/WeightedMeanstorage restrictionThe scratch-histogram reuse trick is trickier with structured storages, but it's solvable with per-field accumulation.
7. Support transformed
RegularaxesCurrently
Regular(..., transform=...)is rejected. This is just a validation gate that can be lifted once tested.8. Thread-safe filling
The current
fill()reuses a single scratchHistper instance. Parallel filling from multiple threads would race on that scratch buffer. Options:fill()Histper fill (slower but simpler)9.
chunk_view()on missing keys returns zeros instead of raisingCurrently missing chunks raise
KeyError. Some workflows might prefer getting a zeroed view for missing chunks (likedict.get()).10. Documentation page
A short user-guide section explaining when to reach for
ChunkedHistvs. plainHistvs.daskhist.My suggestion for priority order
What resonates with you?
🤖 Assisted-by: Kimi-K2.6