feat: add chunked histograms by henryiii · Pull Request #685 · scikit-hep/hist

henryiii · 2026-05-06T20:53:32Z

Close #684.

🤖 Suggested followups

Here are the natural followups, roughly ordered by impact vs. effort:

1. Top-level export (`from hist import ChunkedHist`)

Currently only from hist.chunked import ChunkedHist works. Adding it to hist/__init__.py is trivial and would make it discoverable.

2. Array-valued chunk axes in `fill()`

Right now fill() requires scalar chunk-axis values:

h.fill(x=[0.1, 0.2], cat="a")   # OK
h.fill(x=[0.1, 0.2], cat=["a", "b"])  # raises

Supporting array-valued chunk axes would group by chunk key and dispatch to multiple chunks in one call. This is a common user expectation.

Tradeoff: More complex because you need to group the dense-axis data by chunk key and call dense_hist.fill() per group.

3. Native chunked UHI serialization

Right now round-tripping through JSON/bytes requires to_hist() first, which is expensive for large histograms. A native format that serializes chunk metadata + individual chunk arrays would avoid materialization.

4. Custom `getstate` / `setstate`

For pickle/dill interop. Without this, pickling a ChunkedHist won't work correctly (it has unpicklable internal state like the scratch hist reference).

5. **`Reporter`-style operations: `*`, `-`, `/`, ``

Only + / += are implemented. Multiplication, subtraction, division could be useful for e.g. weighted subtraction of backgrounds.

6. Relax `Mean` / `WeightedMean` storage restriction

The scratch-histogram reuse trick is trickier with structured storages, but it's solvable with per-field accumulation.

7. Support transformed `Regular` axes

Currently Regular(..., transform=...) is rejected. This is just a validation gate that can be lifted once tested.

8. Thread-safe filling

The current fill() reuses a single scratch Hist per instance. Parallel filling from multiple threads would race on that scratch buffer. Options:

A thread-local scratch buffer
A lock around fill()
Remove the scratch buffer entirely and create a temporary Hist per fill (slower but simpler)

9. `chunk_view()` on missing keys returns zeros instead of raising

Currently missing chunks raise KeyError. Some workflows might prefer getting a zeroed view for missing chunks (like dict.get()).

10. Documentation page

A short user-guide section explaining when to reach for ChunkedHist vs. plain Hist vs. dask hist.

My suggestion for priority order

Top-level export (trivial, high visibility)
Pickle support (needed for real workflows)
Array-valued chunk axes (biggest UX improvement)
Native serialization (needed for large-scale I/O)
Thread safety (needed for parallel analysis)

What resonates with you?

🤖 Assisted-by: Kimi-K2.6

Assisted-by: Kimi-K2.6 Signed-off-by: Henry Schreiner <henryfs@princeton.edu>

Copilot

Pull request overview

This PR introduces a new ChunkedHist implementation to store histograms with categorical (chunk) axes as a dict of dense backing arrays keyed by categorical values, avoiding repeated dense reallocations when categories grow (as in issue #684).

Changes:

Added src/hist/chunked.py implementing ChunkedHist, chunk-key selection (including wildcard support for StrCategory), materialization via to_hist(), and merging via + / +=.
Added tests/test_chunked.py covering construction, filling, selection, merging, materialization, and basic utility behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`src/hist/chunked.py`	Implements the `ChunkedHist` data structure, fill/materialize/select/merge logic, and helper utilities for chunk-key normalization and dense-view accumulation.
`tests/test_chunked.py`	Adds a comprehensive test suite for the new `ChunkedHist` API and expected behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pfackeldey

Looks good to me!

There are a few small things, I'm not sure what your preference are here @henryiii, e.g., allowing wildcard matching of categories in getitem, or the fix in _save_chunk_view (which sounds like something that should be fixed...).

ChunkedHist now always owns saved chunk arrays (no aliasing), normalize_chunk_selection accepts numpy/scalar-like values, to_hist() preserves declared categorical keys even with empty chunks, wildcard selections with no matches now return an empty ChunkedHist instead of raising, and __repr__ reflects actual axis representations instead of hardcoding growth=True. I also added regression tests covering no-alias merge behavior, empty wildcard selection, numpy scalar selection in __getitem__, declared-category preservation on materialization, and repr growth handling Assisted-by: Copilot:GPT-5.3-Codex Signed-off-by: Henry Schreiner <henryfs@princeton.edu>

henryiii force-pushed the henryiii/feat/chunked branch 2 times, most recently from 92b722a to 0d1fe90 Compare May 6, 2026 21:07

feat: add chunked histograms

bf707f1

Assisted-by: Kimi-K2.6 Signed-off-by: Henry Schreiner <henryfs@princeton.edu>

henryiii force-pushed the henryiii/feat/chunked branch from 0d1fe90 to bf707f1 Compare May 6, 2026 22:17

henryiii requested review from Copilot May 6, 2026 22:26

Copilot started reviewing on behalf of henryiii May 6, 2026 22:26 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread src/hist/chunked.py Outdated

Comment thread src/hist/chunked.py Outdated

Comment thread src/hist/chunked.py

Comment thread src/hist/chunked.py Outdated

Comment thread src/hist/chunked.py Outdated

Comment thread tests/test_chunked.py

henryiii requested a review from pfackeldey May 7, 2026 02:49

pfackeldey reviewed May 7, 2026

View reviewed changes

henryiii and others added 2 commits May 13, 2026 11:47

style: pre-commit fixes

19e2165

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add chunked histograms#685

feat: add chunked histograms#685
henryiii wants to merge 3 commits into
mainfrom
henryiii/feat/chunked

henryiii commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pfackeldey left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

henryiii commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Top-level export (from hist import ChunkedHist)

2. Array-valued chunk axes in fill()

3. Native chunked UHI serialization

4. Custom __getstate__ / __setstate__

5. Reporter-style operations: *, -, /, **

6. Relax Mean / WeightedMean storage restriction

7. Support transformed Regular axes

8. Thread-safe filling

9. chunk_view() on missing keys returns zeros instead of raising

10. Documentation page

My suggestion for priority order

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pfackeldey left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

henryiii commented May 6, 2026 •

edited

Loading

1. Top-level export (`from hist import ChunkedHist`)

2. Array-valued chunk axes in `fill()`

4. Custom `getstate` / `setstate`

5. **`Reporter`-style operations: `*`, `-`, `/`, ``

6. Relax `Mean` / `WeightedMean` storage restriction

7. Support transformed `Regular` axes

9. `chunk_view()` on missing keys returns zeros instead of raising