Skip to content

feat: add chunked histograms#685

Draft
henryiii wants to merge 3 commits into
mainfrom
henryiii/feat/chunked
Draft

feat: add chunked histograms#685
henryiii wants to merge 3 commits into
mainfrom
henryiii/feat/chunked

Conversation

@henryiii
Copy link
Copy Markdown
Member

@henryiii henryiii commented May 6, 2026

Close #684.

🤖 Suggested followups

Here are the natural followups, roughly ordered by impact vs. effort:


1. Top-level export (from hist import ChunkedHist)

Currently only from hist.chunked import ChunkedHist works. Adding it to hist/__init__.py is trivial and would make it discoverable.


2. Array-valued chunk axes in fill()

Right now fill() requires scalar chunk-axis values:

h.fill(x=[0.1, 0.2], cat="a")   # OK
h.fill(x=[0.1, 0.2], cat=["a", "b"])  # raises

Supporting array-valued chunk axes would group by chunk key and dispatch to multiple chunks in one call. This is a common user expectation.

Tradeoff: More complex because you need to group the dense-axis data by chunk key and call dense_hist.fill() per group.


3. Native chunked UHI serialization

Right now round-tripping through JSON/bytes requires to_hist() first, which is expensive for large histograms. A native format that serializes chunk metadata + individual chunk arrays would avoid materialization.


4. Custom __getstate__ / __setstate__

For pickle/dill interop. Without this, pickling a ChunkedHist won't work correctly (it has unpicklable internal state like the scratch hist reference).


5. Reporter-style operations: *, -, /, **

Only + / += are implemented. Multiplication, subtraction, division could be useful for e.g. weighted subtraction of backgrounds.


6. Relax Mean / WeightedMean storage restriction

The scratch-histogram reuse trick is trickier with structured storages, but it's solvable with per-field accumulation.


7. Support transformed Regular axes

Currently Regular(..., transform=...) is rejected. This is just a validation gate that can be lifted once tested.


8. Thread-safe filling

The current fill() reuses a single scratch Hist per instance. Parallel filling from multiple threads would race on that scratch buffer. Options:

  • A thread-local scratch buffer
  • A lock around fill()
  • Remove the scratch buffer entirely and create a temporary Hist per fill (slower but simpler)

9. chunk_view() on missing keys returns zeros instead of raising

Currently missing chunks raise KeyError. Some workflows might prefer getting a zeroed view for missing chunks (like dict.get()).


10. Documentation page

A short user-guide section explaining when to reach for ChunkedHist vs. plain Hist vs. dask hist.


My suggestion for priority order

  1. Top-level export (trivial, high visibility)
  2. Pickle support (needed for real workflows)
  3. Array-valued chunk axes (biggest UX improvement)
  4. Native serialization (needed for large-scale I/O)
  5. Thread safety (needed for parallel analysis)

What resonates with you?

🤖 Assisted-by: Kimi-K2.6

@henryiii henryiii force-pushed the henryiii/feat/chunked branch 2 times, most recently from 92b722a to 0d1fe90 Compare May 6, 2026 21:07
Assisted-by: Kimi-K2.6
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ChunkedHist implementation to store histograms with categorical (chunk) axes as a dict of dense backing arrays keyed by categorical values, avoiding repeated dense reallocations when categories grow (as in issue #684).

Changes:

  • Added src/hist/chunked.py implementing ChunkedHist, chunk-key selection (including wildcard support for StrCategory), materialization via to_hist(), and merging via + / +=.
  • Added tests/test_chunked.py covering construction, filling, selection, merging, materialization, and basic utility behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/hist/chunked.py Implements the ChunkedHist data structure, fill/materialize/select/merge logic, and helper utilities for chunk-key normalization and dense-view accumulation.
tests/test_chunked.py Adds a comprehensive test suite for the new ChunkedHist API and expected behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/hist/chunked.py Outdated
Comment thread src/hist/chunked.py Outdated
Comment thread src/hist/chunked.py
Comment thread src/hist/chunked.py Outdated
Comment thread src/hist/chunked.py Outdated
Comment thread tests/test_chunked.py
@henryiii henryiii requested a review from pfackeldey May 7, 2026 02:49
Copy link
Copy Markdown

@pfackeldey pfackeldey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

There are a few small things, I'm not sure what your preference are here @henryiii, e.g., allowing wildcard matching of categories in getitem, or the fix in _save_chunk_view (which sounds like something that should be fixed...).

henryiii and others added 2 commits May 13, 2026 11:47
ChunkedHist now always owns saved chunk arrays (no aliasing), normalize_chunk_selection accepts numpy/scalar-like values, to_hist() preserves declared categorical keys even with empty
chunks, wildcard selections with no matches now return an empty ChunkedHist instead of raising, and __repr__ reflects actual axis representations instead of hardcoding growth=True. I
also added regression tests covering no-alias merge behavior, empty wildcard selection, numpy scalar selection in __getitem__, declared-category preservation on materialization, and
repr growth handling

Assisted-by: Copilot:GPT-5.3-Codex
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] ChunkedHist

3 participants