Mem 9 - Memory Efficiency Update #739

jhnwu3 · 2025-12-17T17:54:43Z

This pull request introduces new benchmarking scripts for MIMIC-IV mortality prediction and refactors example scripts for clarity and maintainability. It also includes some library improvements and minor bug fixes. The main changes are grouped as follows:

1. New Benchmarking Scripts

Added two new benchmarking scripts, benchmark_workers_1.py and benchmark_workers_4.py, to the examples/benchmark_perf directory. These scripts measure dataset loading time, task processing time, cache sizes, and peak memory usage for MIMIC-IV mortality prediction with different numbers of workers. They also provide optional memory limit enforcement and detailed reporting. [1] [2]

2. Example Script Refactoring and Addition

Refactored examples/memtest.py to provide a more structured and readable example for using StageNet on MIMIC-IV, including data loading, task application, dataset splitting, model training, evaluation, and prediction inspection. The script is now wrapped in a __main__ block and includes detailed output for each step.
Added a new example script examples/benchmark_perf/memtest.py with similar functionality to the refactored memtest.py, demonstrating the end-to-end pipeline for StageNet on MIMIC-IV.

3. Library Improvements and API Changes

Changed imports from polars to narwhals in several dataset modules (bmd_hs.py, mimic3.py) to improve compatibility or performance. [1] [2]
Updated the SampleDataset import in pyhealth/datasets/__init__.py to also include SampleBuilder and create_sample_dataset, making these utilities available at the package level.
Modified the MIMIC4EHRDataset constructor to accept a cache_dir parameter, allowing cache directory customization when instantiating the dataset. [1] [2]
Updated the memory logging utility in mimic4.py to include a type ignore for compatibility with static analysis tools. [1] [2]

4. Bug Fixes

Fixed a bug in _filter_by_time_range_fast in pyhealth/data/data.py by ensuring that time comparisons use np.datetime64 for accurate filtering.
Updated the type annotation in covariate_label.py so that the cal_dataset parameter in the calibrate method now expects an IterableDataset instead of a Subset, improving flexibility. [1] [2]

…t dask

Copilot

Pull request overview

This PR introduces a significant architectural refactoring to improve memory efficiency by migrating from in-memory dataset processing to streaming/disk-backed processing using litdata, dask, and narwhals. The changes modernize the data pipeline while maintaining the public API surface through a compatibility layer.

Key Changes:

Replaced in-memory SampleDataset with streaming litdata.StreamingDataset architecture
Migrated data processing from polars LazyFrames to dask DataFrames for better memory management
Introduced SampleBuilder for processor fitting and create_sample_dataset helper function
Updated all processors to accept Iterable instead of List for samples
Refactored BaseDataset to use caching and lazy evaluation throughout

Reviewed changes

Copilot reviewed 66 out of 66 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
`pyhealth/datasets/sample_dataset.py`	Complete rewrite: introduces `SampleBuilder`, streaming `SampleDataset`, `InMemorySampleDataset`, and `create_sample_dataset` helper
`pyhealth/datasets/base_dataset.py`	Major refactoring: added streaming parquet writer, dask-based data loading, improved caching with content-addressed directories
`pyhealth/datasets/utils.py`	Updated `get_dataloader` to work with streaming datasets by calling `set_shuffle`
`pyhealth/datasets/splitter.py`	Updated all split functions to use `dataset.subset()` instead of `torch.utils.data.Subset`
`pyhealth/processors/*.py`	Changed `fit` method signatures from `List[Dict]` to `Iterable[Dict[str, Any]]`
`pyhealth/models/*.py`	Updated documentation examples to use `create_sample_dataset` instead of `SampleDataset`
`tests/core/*.py`	Updated all tests to use `create_sample_dataset` API and adapted to streaming dataset behavior
`pyproject.toml`	Added new dependencies: `dask`, `litdata`, `pyarrow`, `narwhals`; updated `polars` version
`examples/memtest.py`	Refactored into structured example with `__main__` guard
`examples/benchmark_perf/*.py`	Added new benchmarking scripts for performance testing with different worker configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyhealth/datasets/utils.py

pyhealth/datasets/base_dataset.py

Copilot · 2025-12-17T18:04:43Z

pyhealth/datasets/sample_dataset.py

+    def __iter__(self) -> Iterable[Dict[str, Any]]:  # type: ignore
+        """Returns an iterator over all samples in the dataset.
+
+        Returns:
+            An iterator yielding processed sample dictionaries.
        """
-        return len(self.samples)
+        if self._shuffle:
+            shuffled_data = self._data[:]
+            random.shuffle(shuffled_data)
+            return iter(shuffled_data)
+        else:
+            return iter(self._data)


The InMemorySampleDataset.__iter__ method creates a copy of self._data and shuffles it in place when shuffle is enabled. However, the shuffled copy is only used for the current iteration. If the dataloader iterates multiple times (multiple epochs), it will need to call __iter__ again, but since set_shuffle is called only once before dataloader creation (in get_dataloader), subsequent iterations will use the same shuffled order. This doesn't provide per-epoch shuffling like PyTorch's standard DataLoader shuffle behavior.

tests/core/test_streaming_parquet_writer.py

pyhealth/datasets/sample_dataset.py

examples/benchmark_perf/benchmark_workers_4.py

examples/benchmark_perf/benchmark_workers_1.py

…PyHealth due to new change in backend

docs/install.rst

Logiquo added 30 commits November 26, 2025 22:47

Polars fix bug of OOM on large table join

f58ec7d

Fix type hint

c196a8b

Add cache_dir

8602c0d

Remove to_lower as this is a no-op

0ceeea9

Add caching behaviour

996b35c

Add test case

6991f26

Add StreamingParquetWriter

a36a819

write samples

4d95ce5

Add SampleBuilder

4f26c1d

Fix Mimic4

2ad809d

fix incorrect dev mode

45ae343

change fit to take Iterator

b4949ec

update test

0494bca

rename

e63d500

update fit to use Iterable

525c526

Fix SampleBuilder

6caa917

Fix tsv test

8a40d3b

Fix base dataset test

504aaa2

cache processed data

6c9363a

save schema for SampleBuilder

a42c131

Fix sampledataset

8f027ae

Fix multi-worker crashes

9578424

update test

830259a

Fix non-pickable

3652e2d

Fix get_dataloader

174d53f

Fix embedding

302cb22

support split

b754b4f

Fix test

34ba3f2

Fix collate_fn

340902b

fix conflicting cache dir

ba348c1

Logiquo added 19 commits December 10, 2025 18:12

Fixup

668f4de

Fixup

ebcabc5

fixup

d8e2383

main guard

aa64e99

fix incorrect null value handling

763f358

change back to ms to mimic old pyhealth beahviour

bd92ba7

add TODO

466d95e

main guard check

87b171f

fix nullable issue?

e77e08b

revert change

49b3b64

Merge remote-tracking branch 'upstream/master' into mem-9

afcd0f2

update API layers

7748307

Fix cache test

f6a582b

support remote url

7c7e5e4

fix test

cfa87be

fix incorrect type in nw

264a026

less worker

045d2ad

non-dev for memtest

5b000b1

fix incorrect behaviour on notebook & make sure dask excption throw a…

44c185f

…t dask

jhnwu3 requested a review from Copilot December 17, 2025 17:54

Copilot started reviewing on behalf of jhnwu3 December 17, 2025 17:55 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

update on installation details and recommended settings for use with …

54db875

…PyHealth due to new change in backend

Logiquo reviewed Dec 17, 2025

View reviewed changes

docs/install.rst Outdated Show resolved Hide resolved

additional clarifications here

9e994bd

jhnwu3 merged commit 4078fa3 into master Dec 17, 2025
1 check passed

jhnwu3 deleted the mem-9 branch December 17, 2025 22:08

This was referenced Dec 18, 2025

Add data caching #326

Closed

Improve memory usage of BaseDataset.load_data() #332

Closed

EricSchrock mentioned this pull request Jan 1, 2026

[Tracking] Tracking issue for the new memory efficient dataset. #740

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mem 9 - Memory Efficiency Update #739

Mem 9 - Memory Efficiency Update #739

Uh oh!

jhnwu3 commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mem 9 - Memory Efficiency Update #739

Mem 9 - Memory Efficiency Update #739

Uh oh!

Conversation

jhnwu3 commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants