Keep hitting memory limits on large training set with samples > 1MB. What's the strategy?

Opening an issue because I have exhausted the docs and am still not sure how to go about this.

## Context

I'm trying to train a dictionary on a relatively large data-set and I keep bumping into the memory limit. The training set consists of ~100000 .json files, each with a similar schema and between 2 and 8 MB in size amounting to ~300GB of training data total. I have successfully trained a ~10MB dictionary on 25GB of such data and got decent compression with it before without encountering the memory limit.  

I encounter the following errors in the current case, without playing with any other parameters except for dict size:

> ᢹ rt-zstd-training-data.root  zstd --train -r /mnt/volume_sfo3_01/messages/training  -o dict25MB.zstd  --maxdict=200000000 -1

> !  Warning : some sample(s) are very large 
> !  Note that dictionary is only useful for small samples. 
> !  As a consequence, only the first 131072 bytes of each sample are loaded 
> Training samples set too large (14342 MB); training on 2048 MB only...
> 
> 

I think, short of doing something nonsensical, my problem is that i can't quite get the combination of memory-limit knobs right. To my understanding and according to docs, there are:
-  `-M#, --memory=#` flag used to limit memory for dictionary training. I definitely want to increase that from the default 2GB(right?).
- `--maxdict=#` flag to limit the dictionary to specified size (default: 112640). 
- `-B#` flag split input files into blocks of size # (default: no split) 
- `--size-hint` 

## Issue

My issues are: 

1. When i try to add  `-M5000000000`, which should roughly be ~5GB i get  `error: numeric value overflows 32-bit unsigned int `. I don't suppose this is related to the build: `zstd --version` yields `*** zstd command line interface 64-bits v1.5.2, by Yann Collet ***`
2. Given that i can only get the maximum memory limit to 2048 MB that severely limits the size of my dictionary. 
3. I'm not sure how to use the `-B` flag.  

> zstd --train -r /mnt/volume_sfo3_01/messages/training -vvv  -o dict.zstd  --maxdict=200000000 -1 -B80000 -M500000 -vvv
> Shuffling input files
> Found training data 115002 files, 298066576 KB, 3872950 samples
> !  Warning : setting manual memory limit for dictionary training data at 0 MB 
> Training samples set too large (291080 MB); training on 0 MB only...
> Loaded 468 KB total training data, 6 nb samples                                
> Trying 5 different sets of parameters
> d=8
> Total number of training samples is 4 and is invalid
> Failed to initialize context
> dictionary training failed : Src size is incorrect 

It's a little unsatisfying that the only error log there is, even in `-vvv` mode, is `Src size is incorrect`. 
It's an external package, but incidentally, i kept seeing this error line when using the [zstd::dict ](https://docs.rs/zstd/0.11.1+zstd.1.5.2/zstd/dict/index.html) rust crate separately from the cli and was confused for a while. When calling hence for example:
```rust
    // ...
    let in_path = Path::new("~/training_files");
    let zdict     = zstd::dict::from_files([in_path], 1024*1024*10)?;  
```

I do realize that Ian explains in [this issue](https://github.com/facebook/zstd/issues/400#issuecomment-251291487) that files over ~1MB are considered large and the returns are probably diminished relative to the size.

 My questions is whether there is an inherent limit on how large my training set (and correspondingly, the dictionary) can be and how to properly feed it into memory given that my _samples_ are on average larger than 1MB. Again, it's roughly 300GB worth of JSON files ~5MB each, even though i'd like to see this for the general case.

 Should i chunk them manually or provide `--size-hint` or something else entirely? 
 
 Thanks a lot in advance.
 _______________________________________________
 
 I'm running a fresh Ubuntu and built `zstd` with `make install`:
 ```bash
 Linux version 5.4.0-97-generic (buildd@lcy02-amd64-032) 
 (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) 
 *** zstd command line interface 64-bits v1.5.2, by Yann Collet ***
 ```








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep hitting memory limits on large training set with samples > 1MB. What's the strategy? #3111

Context

Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Keep hitting memory limits on large training set with samples > 1MB. What's the strategy? #3111

Description

Context

Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions