Add SmartDiskCache module with hash-based persistent caching#49
Open
BitcrushedHeart wants to merge 26 commits intoNerogar:masterfrom
Open
Add SmartDiskCache module with hash-based persistent caching#49BitcrushedHeart wants to merge 26 commits intoNerogar:masterfrom
BitcrushedHeart wants to merge 26 commits intoNerogar:masterfrom
Conversation
Introduces SmartDiskCache as a drop-in replacement for DiskCache with per-file xxhash64 content validation, content-addressed cache filenames, a cache.json index with deduplication support, atomic writes with crash recovery, garbage collection, sourceless training mode, and a sample selection fix for the SAMPLES balancing strategy.
- rebuild validation status now cleans hash_index before re-queuing, matching the behavior of content_changed/resolution_changed/missing_pt - Remove unused all_input_files set from __refresh_cache
- Store loss_weight, type, name, path, seed from concept dict in .pt files at build time (follows existing __cache_version pattern) - In sourceless mode, reconstruct concept dict from stored metadata so OutputPipelineModule can resolve concept.loss_weight - Add concept to sourceless get_outputs() so pipeline resolution finds SmartDiskCache instead of walking back to ConceptPipelineModule - Bump CACHE_VERSION to 2 (forces cache rebuild for sourceless mode, normal mode unaffected)
Call before_cache_fun before falling through to upstream pipeline modules in get_item, so the model is on the correct device when re-encoding uncached items at training time.
The real bug was in OneTrainer passing 'prompt_path' (nonexistent)
as source_path_in_name for the text cache, causing every text lookup
to miss. With the correct key ('image_path'), the fallback path
should never be reached after a fresh cache build.
- Add .pt existence check on mtime fast-path to prevent FileNotFoundError - Replace shutil.move with os.replace for atomic writes on Windows - Rewrite _load_cache_index with 3-stage fallback (cache.json → .tmp → .bak) - Extend _index_lock to cover full save operation (write + backup + rename) - Switch to time-based flush interval (30s) with compact JSON for intermediate flushes - Cache os.path.realpath once in __init__, use _real_pt_path consistently - Cache source paths at epoch start, eliminate per-item pipeline traversal - Load aggregate data into RAM at epoch start, serve from memory in get_item
Shows tqdm progress during the validation loop and aggregate cache loading so the terminal doesn't appear frozen between phases.
The generator expression caused as_completed to submit futures lazily, one at a time, preventing the executor from pipelining the next item while the current one's I/O completes.
c495267 to
22087de
Compare
On repeat runs where nothing changed, cache validation was taking 20+ minutes due to stat-ing every source file individually. This adds a fast path that checks directory mtimes and spot-checks a sample of entries, reducing validation to under a second for unchanged datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache validation was running at the start of every epoch, even when the same filepaths were being delivered (which is the common case since users configure repeats rather than custom samples_per_epoch). On larger datasets the per-file validation loop was noticeable at each epoch boundary despite no actual dataset change. Track validated filepaths in a per-process set and short-circuit _reshuffle_and_prepare when every required path is already in that set and still present in the on-disk index. Fall through to the existing fast-validate / full-validate paths otherwise. Trade-off: within-run edits to source files are no longer detected. Cross-run detection (via cache.json + fast validation) is unchanged. Training against a mutating dataset within a single process was never well-defined anyway.
Fix device mismatch on cache miss during training. Call before_cache_fun before falling through to upstream pipeline modules in get_item, so the model is on the correct device when re-encoding uncached items at training time. The fallback is reachable whenever individual files fail to cache (build_failed / missing / hash_failed), so the band-aid from c22be2f was removed prematurely in 28795b1.
Persist a zero-tensor sentinel during cache validation using any successful entry as a shape template. On cache miss, return the sentinel directly instead of re-running upstream encoders. Rationale: files that fail to cache (build_failed / missing / hash_failed) leave gaps in the index. At training time the text encoder is on the temp device (CPU) and bringing it back to GPU to re-encode a single sample risks both a device mismatch and an OOM since the main model is already on GPU. The before_cache_fun re-encode path is kept as a last-resort fallback for the edge case where no valid entries exist yet (e.g. caching interrupted before any file succeeded).
When the env var is set, skip per-file mtime/hash/.pt-existence checks and the upstream _get_resolution_string call (which can trigger per-image I/O on slow cloud storage). Filepaths already in the on-disk index are trusted; only missing filepaths are cached. Modeltype mismatch still raises to prevent silent cross-model cache reuse. Driven by the --skip-cache-validation CLI flag in OneTrainer/scripts/train.py.
Toggling settings like masked_training between runs adds keys (e.g. 'latent_mask') to split_names/aggregate_names that aren't present in existing .pt files, so downstream readers (AspectBatchSorting et al) crashed with KeyError instead of silently dropping the missing field. Stamp split+aggregate names into cache.json as 'schema' so we can detect drift on startup. When drift is found, walk every entry, run only the missing names through the upstream pipeline, and merge them into the existing .pt (preserving all other keys). Atomic via tmp + os.replace, parallelised through the existing executor. _ensure_blank_sentinel now rebuilds when the sentinel doesn't cover all currently-required keys, and get_item borrows zero-tensors from the sentinel for any key still missing from a per-file augmentation failure -- no single bad entry can crash training.
Previous augment-in-place fix re-ran _get_previous_item('latent_mask',
in_index) through the upstream pipeline to backfill missing keys, then
wrote them into the existing .pt next to the already-cached
latent_image. That breaks when toggling settings adds modules to the
upstream chain (e.g. enabling masked_training pulls in
mask_augmentation_modules and changes 'mask' to be cropped alongside
'image'), which can produce a different crop_resolution than the one
stored in the cache. Result: latent_mask written at a shape that
doesn't match the cached latent_image, then collate_fn crashes with
'stack expects each tensor to be equal size' once a batch mixes
samples whose mask shapes diverged.
Switch to invalidate-and-rebuild: when schema drift is detected, drop
every entry from the index, delete the .pt files, and let the
existing build loop rebuild each entry in a single upstream pass so
all keys share the same crop_resolution and shape.
Add a SCHEMA_METHOD marker stamped into cache.json. Caches that were
schema-stamped by the prior augment-based code (schema set,
schema_method unset) are auto-invalidated on the next run so users
who already trained on shape-corrupted .pt files get a clean rebuild
without manually nuking their cache_dir.
This reverts commit 51b3f19.
Augmenting a cache built under different settings (e.g. masked_training toggled, which adds mask_augmentation modules to the upstream chain) re-runs the upstream pipeline for the missing names. The fresh run can produce a different crop_resolution than the one stored alongside latent_image, so the augmented latent_mask ends up at a different spatial shape -- collate_fn then crashes with 'stack expects each tensor to be equal size' once a batch mixes samples whose mask shapes diverged. Fix at the source: - Per cached entry, derive a reference spatial shape from the already-cached latent_image. - For every target name, recompute via _get_previous_item only when the cached value is missing OR its spatial shape mismatches the reference. Names that already match are left untouched. - Force the recomputed value onto the reference shape via bilinear interpolation when upstream returns something divergent. The mask is approximate when the cache crosses pipelines, but it's much cheaper than rebuilding 100k entries from scratch. Stamp a SCHEMA_METHOD marker into cache.json. Caches stamped by the prior augment that didn't shape-check (schema set, schema_method unset/different) are auto re-augmented on the next run, fixing the already-broken on-disk values without manual cache_dir cleanup.
Pure refactor, no behavior change. - I001: sort the import block (mgds-internal imports first-party section per the config used to lint). - UP035: import Callable from collections.abc instead of typing. - UP008: drop redundant super() arguments. - SIM105: replace try/except/pass with contextlib.suppress. - SIM118: drop .keys() in 'in dict' membership checks. - SIM108: collapse if/else into a ternary where it fits. - SIM113: fold a manual build_count into enumerate(start=1). - C416: rewrite a list comprehension as list(). - RET503: add an explicit return None on the no-match path. - RSE102: drop the empty parentheses on raise. - B007: drop the unused fp/i loop variables.
… dir SmartDiskCache validation regressed dramatically vs the old DiskCache: a 30k-image cache validated in ~40 minutes instead of ~10. Each entry was firing 1 getmtime + V isfile syscalls plus two pipeline traversals, all serial. Under Windows Defender / EDR filters this lulled to 4 it/s. Five bundled changes reduce a fresh-pipeline validation pass on 30k images from minutes to seconds: 1. _scan_existing_pt_files(): one os.scandir of the cache dir replaces N×V os.path.isfile calls during validation, dedup, and build. 2. _bulk_stat_source_files(): parallel os.scandir per source parent dir via the existing executor; harvests mtimes in K syscalls (K = #parent dirs) instead of N getmtime calls. 3. Validation loop iterates unique in_index once instead of needed_variations × N. _validate_entry is invariant in in_variation; the build phase still iterates all V internally. 4. Resolution short-circuit: _get_resolution_string is only called when an entry is missing or invalidated, not on every cache hit. 5. Per-watched-file directory fingerprint: replaces the parent-dir mtime check in _fast_validate. Touching an unrelated sidecar file (caption .txt, mask, .npz) in a watched dir no longer invalidates the fast path. Stored as cache_index['watched_fingerprints']; legacy caches without the field run one full validation pass to write it, then take the fast path on subsequent runs. Also fixes pre-existing ruff violations in tests/test_smartcache.py (import sort, unused vars, set comprehensions, zip strict=). Tests: 15 new behaviour-parity tests (TestBulkScanCorrectness, TestBulkStatCorrectness, TestResolutionShortCircuit, TestVariationDedup, TestWatchedFingerprint) plus 3 timing benchmarks. Headline benchmark on this machine: 200-file cold validation 2.5s, fresh-pipeline warm fast-validate 177ms, full validation after one file touch 177ms. Existing tests unchanged; one (test_rebuild_cleans_hash_index) updated to drive the same code path through file-content change rather than patching os.path.getmtime which the bulk-stat path no longer uses. Pre-existing GC tests (test_gc_preview_empty, test_gc_clean) still fail; the blank_sentinel.pt orphan is created by an unrelated upstream module and is out of scope for this commit.
The validation loop was still calling _get_resolution_string for every valid cache entry, which chains AspectBucketing -> CalcAspect -> LoadImage and opens the source image to read its dimensions. On a 33k dataset this was the dominant remaining cost (~5 it/s, ~hour-and-a-half total) even after the bulk-scan fixes — the per-image decode dwarfed the syscalls we'd already eliminated. Trust the cached resolution on the happy path. Same contract as the original DiskCache: bucket config changes require a manual cache clear. schema_method drift is detected earlier in __refresh_cache via _detect_cache_schema_drift / _augment_cache_with_missing_names and remains intact. The rebuild branch still calls _get_resolution_string for files that genuinely need rebuilding, which is correct and small in the steady state.
CACHE_VERSION 2 -> 3. Each entry now stores a ``variants`` dict keyed by resolution string (e.g. ``"896x640"``) instead of single ``cache_file``/``resolution`` fields. v2 indices migrate in place on load — no .pt rebuild required. When AspectBucketing config changes between runs (e.g. user edits target_resolutions), drift recovery derives the new bucket assignment for each entry purely from the cached aspect ratio (parse "HxW" -> aspect, run the same argmin against the new bucket_aspects). Any pre-existing .pt file matching a derived key is reused; missing keys queue rebuilds of just that variant. No source images are decoded for unchanged resolutions. The image cache thus becomes a multi-resolution store: training at 512 yesterday and 768 today doesn't invalidate the 512 variants — both coexist. Wired through DataLoaderText2ImageMixin and StableDiffusionFineTuneVaeDataLoader via ``bucket_method_provider`` and ``rebucket_provider`` callbacks. Other changes: - gc_preview/gc_clean walk every variant and honour the v2->v3 migrator. - blank_sentinel.pt is now correctly recognised as referenced (latent bug present pre-CACHE_VERSION 3 too). - _validate_entry returns 'missing_variant' for variant-level rebuilds that preserve the parent entry and other variants. - AspectBucketing exposes bucket_for_aspect() and compute_bucket_method_hash() for the cache to call without re-entering the LoadImage chain.
Previously, drift recovery only fired when the cache.json had a stored ``bucket_method`` AND it differed from the current one. v2 caches migrated to v3 have ``stored == None``, so drift was skipped and stale variants kept being served unchanged — even when the user's target_resolution had changed since the cache was originally built. This manifested as OOM (latents larger than the trainer expected) and batch shape-stack errors (mixed-resolution caches grouping inconsistently). Fix: trigger drift recovery whenever ``stored != current`` (treating ``None`` as ''old / unknown''). On the no-change happy path the recovery is a no-op since aspect math produces the already-cached variant keys. When keys do differ, existing pre-built variants are linked in if their .pt files exist on disk, and only entries with no matching variant trigger rebuilds. Also bump AspectBucketing's bucket_method version from ``aspect_v1`` to ``aspect_v2`` so users who already validated under v3 (and got an aspect_v1 hash stamped) re-run drift recovery once to catch any inconsistencies the original v2 -> v3 migration missed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SmartDiskCache - Hash-Based Persistent Caching
What This Is
A replacement for 'DiskCache' that makes caching persistent and content-addressed rather than ephemeral. Adding one image to a 100k dataset caches one file, not 100k. Editing one caption recaches one text embedding, not all of them. Moving files between concepts (same content, different path) reuses existing cache via hash matching. Switching between training configs that differ only in non-cache-relevant settings never triggers recaching.
The cache becomes a content-addressed store that grows over time and only rebuilds what's genuinely stale.
How It Works
Hashing
Every source file gets an xxhash64 hash of its contents. xxhash64 is faster than MD5/SHA-256 and has excellent collision resistance for non-cryptographic purposes. The full 64-bit hash is used internally for comparison. Cache filenames use a 12 hex char truncation (48 bits, ~281 trillion possible values) to keep paths manageable.
Image cache files: '{hash12}{resolution}{variation}.pt'
Text cache files: '{hash12}_{variation}.pt'
Validation Flow
Per-file validation runs for each file needed in the current epoch:
The mtime check is the fast path. Hash computation only happens when mtime changes. This means validation of a 100k dataset where nothing changed is essentially free - it's 100k 'stat()' calls, no file reads.
Cache Index
Each cache directory ('image/' and 'text/') maintains a 'cache.json' index with per-file entries (filename, hash, mtime, modeltype, resolution, cache_file, cache_version) and a 'hash_index' mapping hashes to lists of filepaths for dedup lookups. The index uses atomic writes (write to '.tmp', backup to '.bak', rename) with crash recovery on startup.
Deduplication
When a new file is encountered, its hash is checked against the 'hash_index'. If a match exists with the same modeltype and resolution, the existing cache entry is reused - no encoding needed. This handles the common case of the same image appearing in multiple concepts.
When one copy of a deduplicated file is edited, it gets a new hash and new cache files. The unedited copy still points to the old cache entry. When all references to a hash are gone, the cache files become eligible for garbage collection.
Sourceless Training
If all necessary training data is embedded in the '.pt' cache files, users can train from cache alone without the source images/text files. A 'sourceless_training' toggle in the config enables this. When active, the dataloader skips file enumeration, loading, and augmentation modules entirely - the pipeline collapses to just '[cache_modules, output_modules]'.
On startup in sourceless mode, 'SmartDiskCache' validates that all cache entries have sufficient 'cache_version', correct 'modeltype', and existing '.pt' files. Clear errors are raised if anything is missing.
This enables dataset sharing without distributing original files. Cached latents can't be decoded back to pixel-space images without the VAE decoder, so this is a one-way transform - useful for privacy-sensitive datasets.
Garbage Collection
A "Clean Cache" button in the UI identifies orphaned cache files (source file no longer exists, or '.pt' files with no 'cache.json' entry) and shows a preview with file counts and sizes before deleting anything. Dedup-shared '.pt' files are preserved as long as at least one source file still references them.
Sample Selection Fix
The SAMPLES balancing strategy now shuffles the full file pool then takes N, rather than taking the first N then shuffling. This gives genuinely random sampling across epochs when using large datasets with sample limits.
What Changed
New File
'src/mgds/pipelineModules/SmartDiskCache.py' - the entire module. 'PipelineModule' + 'SingleVariationRandomAccessPipelineModule', drop-in replacement for 'DiskCache' with additional constructor params ('modeltype', 'source_path_in_name', 'sourceless').
Testing
Test branch: 'SmartcacheTests' - 69 tests covering hashing, cache validation flow, deduplication, atomic writes/crash recovery, garbage collection, sourceless training, sample selection, DiskCache regression, and issue regression scenarios.
Why not replace DiskCache?
While mgds is built for OneTrainer, I have no idea what else could be using mgds - so this allows existing repos to continue using DIskCache, even as OneTrainer shifts to SmartDiskCache - if desired we could raise a depreciation warning when DiskCache is used if this is merged.
Closes #41