Integrate SmartDiskCache for hash-based persistent caching#1411
Open
BitcrushedHeart wants to merge 15 commits intoNerogar:masterfrom
Open
Integrate SmartDiskCache for hash-based persistent caching#1411BitcrushedHeart wants to merge 15 commits intoNerogar:masterfrom
BitcrushedHeart wants to merge 15 commits intoNerogar:masterfrom
Conversation
Replaces DiskCache with SmartDiskCache in all dataloaders, adds sourceless_training config field with UI toggle, adds Clean Cache button with preview dialog, updates clear_cache_before_training default to False, and adds xxhash to requirements.
SmartCache validates incrementally and detects model type changes automatically, so the old warning about disabling cache clearing is no longer accurate.
SmartDiskCache import was placed after CollectPaths/DecodeVAE instead of in alphabetical order after SingleAspectCalculation.
Text encoder training requires re-tokenizing prompts from source files, which are not available in sourceless mode. Raise a clear error at dataset creation time rather than failing mid-training.
81c650c to
97fc9ca
Compare
- Fix source_path_in_name: prompt_path -> image_path for text cache - Add stop_check_fun to SmartDiskCache for interruptible caching - Catch CachingStoppedException in trainer epoch loop - Closes Nerogar#109
- Text cache now validates against sample_prompt_path instead of image_path - Clean button disabled while training is running to prevent concurrent access
Upstream mgds SmartCache added f65c2de 'Add fast validation to skip expensive per-file cache checks', replacing the 20+ min full stat walk with a directory-mtime + sampled spot-check path that returns in under a second on unchanged datasets.
Upstream mgds SmartCache now caches validated source filepaths in a per-process set and short-circuits start-of-epoch validation when every required path is already in that set. Before, even with the fast-validate path available, each epoch still re-stat'd the dataset. After, only the first epoch validates; every epoch after that returns immediately.
Pulls in the SmartDiskCache change that backfills missing split/aggregate names (e.g. 'latent_mask') into existing .pt files when settings like masked_training are toggled, instead of crashing downstream readers with KeyError. Old caches keep working without a full rebuild.
Replaces the previous 905efb2 augment-in-place with invalidate-and- rebuild. The augment path could write latent_mask at a shape that didn't match the cached latent_image (mask_augmentation modules added by enabling masked_training change crop_resolution), causing collate_fn to fail with 'stack expects each tensor to be equal size' on the first batch. Rebuilding the affected entries fresh produces all keys in one upstream pass so shapes stay consistent. The new mgds also auto-detects caches stamped by the prior augment code (via SCHEMA_METHOD marker) and rebuilds them on the next run.
…augment) Reverts the pin to mgds 51b3f19 (rebuild-on-schema-drift) which was a non-starter on big caches -- 100k entries means an unacceptable full VAE re-encode. Switches to mgds bfb3544 which keeps the augment-in- place strategy but fixes the shape-mismatch bug at source: per cached entry, augmented values are forced onto the spatial shape of the already-cached latent_image (bilinear interpolation when upstream returns a divergent crop_resolution). Existing caches whose latent_mask was written shape-divergently by the previous augment get re-augmented automatically via the bumped SCHEMA_METHOD marker.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SmartDiskCache Integration
What This Is
Wires OneTrainer into the new 'SmartDiskCache' module from the companion mgds PR (Nerogar/mgds#49). The cache becomes persistent and content-addressed. It grows over time and only rebuilds what's genuinely stale, rather than wiping and rebuilding every time a file changes.
What Changed
Config
'sourceless_training' field added to
TrainConfigwith migration (migration_10). Default 'False'. 'clear_cache_before_training' default changed toFalsesince SmartCache makes forced rebuilds unnecessary in most cases.UI
.ptfiles without source images/textclear_cache_before_trainingtooltip to reflect that SmartCache validates incrementally and detects model type changes automaticallyDataloaders
All dataloaders that previously used
DiskCachenow useSmartDiskCachethroughDataLoaderText2ImageMixin._cache_modules(). The mixin passesmodeltype,source_path_in_name, andsourcelessto the SmartDiskCache constructor.When 'sourceless_training' and 'latent_caching' are both enabled, '_create_dataset()' short-circuits to '[cache_modules, output_modules]', skipping file enumeration, loading, augmentation, and preparation modules entirely.
Interruptible Caching
Pressing "Stop Training" during caching now finishes the current file, saves the cache index, and stops gracefully. The next run picks up where it left off.
GenericTrainer
'__clear_cache()' now prints a message explaining that SmartCache makes clearing unnecessary. The wipe logic is preserved (deletes
image/,text/, andepoch-*dirs) but the default is off.Dependencies
Requires Nerogar/mgds#49 (SmartDiskCache module).
Testing
Test branch: 'SmartcacheTests' on the mgds repo contains 69 tests covering the full cache system.
Closes #280
Closes #109