Skip to content

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Dec 17, 2025

This pull request introduces new benchmarking scripts for MIMIC-IV mortality prediction and refactors example scripts for clarity and maintainability. It also includes some library improvements and minor bug fixes. The main changes are grouped as follows:

1. New Benchmarking Scripts

  • Added two new benchmarking scripts, benchmark_workers_1.py and benchmark_workers_4.py, to the examples/benchmark_perf directory. These scripts measure dataset loading time, task processing time, cache sizes, and peak memory usage for MIMIC-IV mortality prediction with different numbers of workers. They also provide optional memory limit enforcement and detailed reporting. [1] [2]

2. Example Script Refactoring and Addition

  • Refactored examples/memtest.py to provide a more structured and readable example for using StageNet on MIMIC-IV, including data loading, task application, dataset splitting, model training, evaluation, and prediction inspection. The script is now wrapped in a __main__ block and includes detailed output for each step.
  • Added a new example script examples/benchmark_perf/memtest.py with similar functionality to the refactored memtest.py, demonstrating the end-to-end pipeline for StageNet on MIMIC-IV.

3. Library Improvements and API Changes

  • Changed imports from polars to narwhals in several dataset modules (bmd_hs.py, mimic3.py) to improve compatibility or performance. [1] [2]
  • Updated the SampleDataset import in pyhealth/datasets/__init__.py to also include SampleBuilder and create_sample_dataset, making these utilities available at the package level.
  • Modified the MIMIC4EHRDataset constructor to accept a cache_dir parameter, allowing cache directory customization when instantiating the dataset. [1] [2]
  • Updated the memory logging utility in mimic4.py to include a type ignore for compatibility with static analysis tools. [1] [2]

4. Bug Fixes

  • Fixed a bug in _filter_by_time_range_fast in pyhealth/data/data.py by ensuring that time comparisons use np.datetime64 for accurate filtering.
  • Updated the type annotation in covariate_label.py so that the cal_dataset parameter in the calibrate method now expects an IterableDataset instead of a Subset, improving flexibility. [1] [2]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a significant architectural refactoring to improve memory efficiency by migrating from in-memory dataset processing to streaming/disk-backed processing using litdata, dask, and narwhals. The changes modernize the data pipeline while maintaining the public API surface through a compatibility layer.

Key Changes:

  • Replaced in-memory SampleDataset with streaming litdata.StreamingDataset architecture
  • Migrated data processing from polars LazyFrames to dask DataFrames for better memory management
  • Introduced SampleBuilder for processor fitting and create_sample_dataset helper function
  • Updated all processors to accept Iterable instead of List for samples
  • Refactored BaseDataset to use caching and lazy evaluation throughout

Reviewed changes

Copilot reviewed 66 out of 66 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
pyhealth/datasets/sample_dataset.py Complete rewrite: introduces SampleBuilder, streaming SampleDataset, InMemorySampleDataset, and create_sample_dataset helper
pyhealth/datasets/base_dataset.py Major refactoring: added streaming parquet writer, dask-based data loading, improved caching with content-addressed directories
pyhealth/datasets/utils.py Updated get_dataloader to work with streaming datasets by calling set_shuffle
pyhealth/datasets/splitter.py Updated all split functions to use dataset.subset() instead of torch.utils.data.Subset
pyhealth/processors/*.py Changed fit method signatures from List[Dict] to Iterable[Dict[str, Any]]
pyhealth/models/*.py Updated documentation examples to use create_sample_dataset instead of SampleDataset
tests/core/*.py Updated all tests to use create_sample_dataset API and adapted to streaming dataset behavior
pyproject.toml Added new dependencies: dask, litdata, pyarrow, narwhals; updated polars version
examples/memtest.py Refactored into structured example with __main__ guard
examples/benchmark_perf/*.py Added new benchmarking scripts for performance testing with different worker configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +444 to +455
def __iter__(self) -> Iterable[Dict[str, Any]]: # type: ignore
"""Returns an iterator over all samples in the dataset.

Returns:
An iterator yielding processed sample dictionaries.
"""
return len(self.samples)
if self._shuffle:
shuffled_data = self._data[:]
random.shuffle(shuffled_data)
return iter(shuffled_data)
else:
return iter(self._data)
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InMemorySampleDataset.__iter__ method creates a copy of self._data and shuffles it in place when shuffle is enabled. However, the shuffled copy is only used for the current iteration. If the dataloader iterates multiple times (multiple epochs), it will need to call __iter__ again, but since set_shuffle is called only once before dataloader creation (in get_dataloader), subsequent iterations will use the same shuffled order. This doesn't provide per-epoch shuffling like PyTorch's standard DataLoader shuffle behavior.

Copilot uses AI. Check for mistakes.
@jhnwu3 jhnwu3 merged commit 4078fa3 into master Dec 17, 2025
1 check passed
@jhnwu3 jhnwu3 deleted the mem-9 branch December 17, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants