-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Opening an issue because I have exhausted the docs and am still not sure how to go about this.
Context
I'm trying to train a dictionary on a relatively large data-set and I keep bumping into the memory limit. The training set consists of ~100000 .json files, each with a similar schema and between 2 and 8 MB in size amounting to ~300GB of training data total. I have successfully trained a ~10MB dictionary on 25GB of such data and got decent compression with it before without encountering the memory limit.
I encounter the following errors in the current case, without playing with any other parameters except for dict size:
ᢹ rt-zstd-training-data.root zstd --train -r /mnt/volume_sfo3_01/messages/training -o dict25MB.zstd --maxdict=200000000 -1
! Warning : some sample(s) are very large
! Note that dictionary is only useful for small samples.
! As a consequence, only the first 131072 bytes of each sample are loaded
Training samples set too large (14342 MB); training on 2048 MB only...
I think, short of doing something nonsensical, my problem is that i can't quite get the combination of memory-limit knobs right. To my understanding and according to docs, there are:
-M#, --memory=#flag used to limit memory for dictionary training. I definitely want to increase that from the default 2GB(right?).--maxdict=#flag to limit the dictionary to specified size (default: 112640).-B#flag split input files into blocks of size # (default: no split)--size-hint
Issue
My issues are:
- When i try to add
-M5000000000, which should roughly be ~5GB i geterror: numeric value overflows 32-bit unsigned int. I don't suppose this is related to the build:zstd --versionyields*** zstd command line interface 64-bits v1.5.2, by Yann Collet *** - Given that i can only get the maximum memory limit to 2048 MB that severely limits the size of my dictionary.
- I'm not sure how to use the
-Bflag.
zstd --train -r /mnt/volume_sfo3_01/messages/training -vvv -o dict.zstd --maxdict=200000000 -1 -B80000 -M500000 -vvv
Shuffling input files
Found training data 115002 files, 298066576 KB, 3872950 samples
! Warning : setting manual memory limit for dictionary training data at 0 MB
Training samples set too large (291080 MB); training on 0 MB only...
Loaded 468 KB total training data, 6 nb samples
Trying 5 different sets of parameters
d=8
Total number of training samples is 4 and is invalid
Failed to initialize context
dictionary training failed : Src size is incorrect
It's a little unsatisfying that the only error log there is, even in -vvv mode, is Src size is incorrect.
It's an external package, but incidentally, i kept seeing this error line when using the zstd::dict rust crate separately from the cli and was confused for a while. When calling hence for example:
// ...
let in_path = Path::new("~/training_files");
let zdict = zstd::dict::from_files([in_path], 1024*1024*10)?; I do realize that Ian explains in this issue that files over ~1MB are considered large and the returns are probably diminished relative to the size.
My questions is whether there is an inherent limit on how large my training set (and correspondingly, the dictionary) can be and how to properly feed it into memory given that my samples are on average larger than 1MB. Again, it's roughly 300GB worth of JSON files ~5MB each, even though i'd like to see this for the general case.
Should i chunk them manually or provide --size-hint or something else entirely?
Thanks a lot in advance.
I'm running a fresh Ubuntu and built zstd with make install:
Linux version 5.4.0-97-generic (buildd@lcy02-amd64-032)
(gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04))
*** zstd command line interface 64-bits v1.5.2, by Yann Collet ***