Skip to content

ttarvis/dataprint

Repository files navigation

dataprint

High-performance tool for large JSONL file fingerprinting and structural comparisons.

License Version


dataprint is a high performance tool that can be used for schema drift detection or deduplication of very large JSONL files. It uses the structure and semantics of JSON to meaningfully compute fingerprints and similarity.


Features

  • Hashing — performs simple BLAKE3 hash of a file
  • Fingerprint — creates a structural fingerprint of a JSONL file
  • Compare — Compare how similar two JSONL files are

Installation

From Source

git clone https://github.com/ttarvis/dataprint.git
cd dataprint
make
make install

Usage

Basic Examples

dataprint fingerprint file.json1
dataprint fingerprint file1.json1 file2.json file3.json ...
dataprint fingerprint -T file.json1
dataprint compare file1.json1 file2.jsonl
dataprint hash file1.jsonl file2.jsonl ...

More Examples

see Examples


Benchmarks

Benchmarks were run on an AWS c6a.8xlarge instance (AMD EPYC 7R13, 16 physical cores) with warm OS page cache. The best of 3 runs was taken. Files were generated by the included file generators so sizes reflected the actual sizes tested.

Fingerprint — flat JSONL

File Size Single Threaded Multithreaded Speedup Throughput
1.1 GB 5.76s 0.81s 8.12x 1.54 GB/s
11 GB 57.81s 7.05s 7.54x 1.43 GB/s

Running benchmarks yourself

cd benchmarks
make generate  # generates test files (~30 min for 11GB)
make run       # runs benchmarks and saves results to benchmarks/results/

Docs

See Docs for more technical details on the implementation and notes on various things.


License

Apache 2.0 © Terence Tarvis

Packages

 
 
 

Contributors