High-performance tool for large JSONL file fingerprinting and structural comparisons.
dataprint is a high performance tool that can be used for schema drift detection or deduplication of very large JSONL files. It uses the structure and semantics of JSON to meaningfully compute fingerprints and similarity.
- Hashing — performs simple BLAKE3 hash of a file
- Fingerprint — creates a structural fingerprint of a JSONL file
- Compare — Compare how similar two JSONL files are
git clone https://github.com/ttarvis/dataprint.git
cd dataprint
make
make installdataprint fingerprint file.json1dataprint fingerprint file1.json1 file2.json file3.json ...dataprint fingerprint -T file.json1dataprint compare file1.json1 file2.jsonldataprint hash file1.jsonl file2.jsonl ...see Examples
Benchmarks were run on an AWS c6a.8xlarge instance (AMD EPYC 7R13, 16 physical cores)
with warm OS page cache. The best of 3 runs was taken. Files were generated by the
included file generators so sizes reflected the actual sizes tested.
| File Size | Single Threaded | Multithreaded | Speedup | Throughput |
|---|---|---|---|---|
| 1.1 GB | 5.76s | 0.81s | 8.12x | 1.54 GB/s |
| 11 GB | 57.81s | 7.05s | 7.54x | 1.43 GB/s |
cd benchmarks
make generate # generates test files (~30 min for 11GB)
make run # runs benchmarks and saves results to benchmarks/results/See Docs for more technical details on the implementation and notes on various things.
Apache 2.0 © Terence Tarvis