Skip to content

Inaccurate Training Data File Size #400

@dcwardell7

Description

@dcwardell7

I am attempting to train a dictionary on a number of mp4 files. When I begin training, zstd reports the total size of these files as significantly lower than the actual size.

Here is my command and some output. The 10 files are actually ~26MB in size.

zstd --train -vvv data/* -o out/dictionary
*** zstd command line interface 64-bits v1.1.0, by Yann Collet ***
sorting 10 files of total size 1 MB ...                               
finding patterns ... 
minimum ratio : 4 

found 490 matches of length >= 7 at pos      12  
Selected ref at position 621832, of length 31 : saves 7641 (ratio: 246.48)  

found  10 matches of length >= 7 at pos      43  
Selected ref at position 393259, of length 63 : saves 540 (ratio: 8.57)  

found  10 matches of length >= 7 at pos     112  

found  30 matches of length >= 7 at pos     121  
Selected ref at position 69807, of length 17 : saves 168 (ratio: 9.88)  

...

found   4 matches of length >= 7 at pos 1299012  

found   4 matches of length >= 7 at pos 1304503  

found   4 matches of length >= 7 at pos 1304588  

 80 segments found, of total size 1771 
list 25 best segments 
  1: 91 bytes at pos   655575, savings   13451 bytes |........................................| 
  2: 31 bytes at pos   621832, savings    7641 bytes |............................!.T| 
  3: 63 bytes at pos   427700, savings    5370 bytes |.................@......................| 
  4: 63 bytes at pos   389781, savings    3462 bytes |.............@..........................| 
  5: 63 bytes at pos   189452, savings    2320 bytes |.................@......................| 
  6: 32 bytes at pos   342374, savings    2178 bytes |..............................!.| 
  7: 90 bytes at pos   127725, savings    2116 bytes |.@......................................| 
  8: 63 bytes at pos   384640, savings    1495 bytes |........B........@......................| 
  9: 33 bytes at pos   865135, savings    1066 bytes |..............................!.T| 
 10: 43 bytes at pos   655882, savings     584 bytes |........................................| 
 11: 63 bytes at pos   393259, savings     540 bytes |........................................| 
 12: 63 bytes at pos   156095, savings     390 bytes |.....................@..................| 
 13: 17 bytes at pos   162461, savings     339 bytes |.................| 
 14: 17 bytes at pos   751637, savings     325 bytes |T................| 
 15: 17 bytes at pos  1021895, savings     248 bytes |.................| 
 16: 17 bytes at pos  1133975, savings     234 bytes |_................| 
 17: 17 bytes at pos  1214643, savings     222 bytes |0p...............| 
 18: 17 bytes at pos   427934, savings     211 bytes |.................| 
 19: 57 bytes at pos   427869, savings     209 bytes |.....A..................................| 
 20: 17 bytes at pos    69807, savings     168 bytes |.F...............| 
 21: 17 bytes at pos   997075, savings     156 bytes |.................| 
 22: 17 bytes at pos   328272, savings     132 bytes |.................| 
 23: 17 bytes at pos   608203, savings     117 bytes |xz...............| 
 24: 51 bytes at pos   438492, savings      89 bytes |........................................| 
 25: 51 bytes at pos   161544, savings      60 bytes |.@......................................| 
!  warning : selected content significantly smaller than requested (1771 < 112640) 
statistics ...                                                        
HUF_writeCTable error 
dictionary training failed : Error (generic) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions