Skip to content

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Apr 8, 2025

Massive refactor:

This pull request includes several changes to improve the PyHealth framework, particularly in dataset management and configuration. The most important changes include the addition of a new hackathon announcement, updates to dataset initialization and configuration, and the introduction of a new base dataset class.

Hackathon Announcement:

Dataset Initialization:

New Base Dataset Class:

  • pyhealth/datasets/base_dataset.py: Introduced a new BaseDataset class with detailed methods for loading and processing dataset tables, handling joins, and generating task-specific sample datasets.

Configuration Updates:

Miscellaneous Fixes:

zzachw and others added 24 commits March 25, 2025 16:17
* hackathon 2024 draft

* Update 202410-sunlab-hackthon.md

* update intro based on 10/18 meeting

* add more intro

* update mentor

* merged commits for multimodal pr

* fix small bug in Mortality30DaysMIMIC4

---------

Co-authored-by: Jimeng Sun <jimeng.sun@gmail.com>
Testing PR.

Deleted the file I accidentally pushed earlier. The contribution guide seems to be correct so far.
1. Patient is now a sequence of event.
2. Updated Patient class to initialize with a Polars DataFrame for event management.
- Unified APIs for all modalities.
- Enabled data loading based on YAML configs.
- Switched to Polars backend.
- Removed deprecated base_dataset, sample_dataset.
- Renamed base_dataset_v2 as base_dataset.
- Renamed sample_dataset_v2 as sample_dataset.
- Moved padding to collate_fn.
- Cleaned up unused featurizer classes.
Simplified MIMIC4Dataset class by merging loading functions
Introduced a YAML configuration file for dataset management, detailing file paths and attributes for various tables.
- Renamed `TaskTemplate` to `BaseTask`.
- Introduced `InHospitalMortalityMIMIC4`.
- Introduced `Readmission30DaysMIMIC4`.
- Introduced a new processor registry to manage different data processors.
- Implemented base processor classes: `Processor`, `FeatureProcessor`, `SampleProcessor`, and `DatasetProcessor`.
- Added specific processors for images (`ImageProcessor`), labels (`BinaryLabelProcessor`, `MultiClassLabelProcessor`, `MultiLabelProcessor`, `RegressionLabelProcessor`), sequences (`SequenceProcessor`), signals (`SignalProcessor`), and time series (`TimeseriesProcessor`).
- Each processor includes methods for processing data and managing state, with appropriate error handling and configuration options.
- Updated `BaseModel` to streamline initialization and remove deprecated parameters.
- Introduced `EmbeddingModel` for handling embedding layers for various input types.
- Refactored `RNN` class to utilize `EmbeddingModel` for embedding inputs, enhancing modularity.
- Cleaned up unused code and improved type annotations for better clarity and maintainability.
Co-authored-by: John Wu
Major Refactor: Unified Event Stream, YAML Config, Multimodal Processor, Simplified Model

I think it looks good so far, we can iterate if we find more issues. Easier to break things in the dev branch and then hotfix later with our tiny size.
@jhnwu3 jhnwu3 requested a review from zzachw April 8, 2025 22:06
@zzachw zzachw merged commit 694e2e9 into master Apr 9, 2025
@zzachw
Copy link
Collaborator

zzachw commented Apr 9, 2025

FYI: We’ve decided to remove the dev branch going forward. The master branch will now always contain the latest code, and future development branches will be squashed and merged directly into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants