make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation#4487
Conversation
|
Hi @neiljohari! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Cyan4973
left a comment
There was a problem hiding this comment.
Indeed, DiB_getFileSize() return -1 when the file doesn't exist.
Great work @neiljohari !
|
Note @neiljohari that signing the CLA is required before getting the code merged. |
Thanks Yann! Going through our internal process for approval, will be done shortly 😄 |
Summary
DiB_fileStats()andDiB_loadFiles()that could drivetotalSizeToLoadnegative when invalid inputs are present (e.g., broken symlinks, disappearing files, IO errors).totalSizeToLoadgoes negative, it later gets cast tosize_tand wraps to a huge value, leading to large allocation sizes and “not enough memory for DiB_trainFiles”.DiB_fileStats()withDiB_loadFiles(): both now skip non-positive file sizes.To be 100% transparent, I think you need a pretty pathological case to see crashes -- to the point where there's something severely wrong with your sample dir. I probably wouldn't put up a PR for this normally, but since it is possible to not crash but get slightly incorrect totals thought it's worth it.
Motivation and background
While investigating dictionary-training hanging with empty files, I found that my environment (conda-forge zstd labeled v1.5.2) didn’t include the upstream fix for the empty-file infinite loop (#3081). Took me a while to figure that out though, and while investigating I looked more broadly at dibio.c and noticed this possible issue.
The bug fixed in this PR did not impact me (unrelated to my empty file hang), I just happened to notice it via code inspection.
Root cause
DiB_getFileSize()returns:DiB_loadFiles()correctly skips files withfileSize <= 0.DiB_fileStats(), however, only skipped fileSize == 0. It counted negative sizes as samples and addedMIN(fileSize, SAMPLESIZE_MAX)tototalSizeToLoad. ForfileSize == -1, this subtracts 1 from totalSizeToLoad per invalid entry and inflates the expected sample count.Impact
With enough invalid entries in the training set (or just a few if the valid samples are small),
totalSizeToLoadcan go negative.The memory sizing logic casts negative totals to
size_t, wrapping to a massive value. This can result in:malloc(huge)→ failure →EXM_THROW(12)“not enough memory for DiB_trainFiles”, orminor-to-large drift between the intended and actual allocation sizes, depending on the exact inputs.
Even when it doesn’t crash, this mismatch can under/over-allocate sample memory. I'm not an expert in the trainer tool though, so I'm not sure if that really matters or if it can lead to the trained dicts being suboptimal?
Reproduction (added in 96fdb9b)
Minimal setup: a handful of small “good” files plus many nonexistent paths passed on the command line (nonexistent paths bypass symlink filtering).
The temporary debug logs show negative file sizes being counted by DiB_fileStats(), final totals going negative, and the resulting huge loadedSize after casting.
Before (current behavior)
With 5 valid samples plus many nonexistent files:
After (this change, 85f4a7e):
The “before” output produces a negative total and a huge loadedSize, ending in
EXM_THROW(12).The “after” output shows consistent nbSamples, sane totals, and modest allocations, proceeding to training as expected.