dataprint

High-performance tool for large JSONL file fingerprinting and structural comparisons.

dataprint is a high performance tool that can be used for schema drift detection or deduplication of very large JSONL files. It uses the structure and semantics of JSON to meaningfully compute fingerprints and similarity.

Features

Hashing — performs simple BLAKE3 hash of a file
Fingerprint — creates a structural fingerprint of a JSONL file
Compare — Compare how similar two JSONL files are

Installation

From Source

git clone https://github.com/ttarvis/dataprint.git
cd dataprint
make
make install

Usage

Basic Examples

dataprint fingerprint file.json1

dataprint fingerprint file1.json1 file2.json file3.json ...

dataprint fingerprint -T file.json1

dataprint compare file1.json1 file2.jsonl

dataprint hash file1.jsonl file2.jsonl ...

More Examples

see Examples

Benchmarks

Benchmarks were run on an AWS c6a.8xlarge instance (AMD EPYC 7R13, 16 physical cores) with warm OS page cache. The best of 3 runs was taken. Files were generated by the included file generators so sizes reflected the actual sizes tested.

Fingerprint — flat JSONL

File Size	Single Threaded	Multithreaded	Speedup	Throughput
1.1 GB	5.76s	0.81s	8.12x	1.54 GB/s
11 GB	57.81s	7.05s	7.54x	1.43 GB/s

Running benchmarks yourself

cd benchmarks
make generate  # generates test files (~30 min for 11GB)
make run       # runs benchmarks and saves results to benchmarks/results/

Docs

See Docs for more technical details on the implementation and notes on various things.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
benchmarks		benchmarks
docs		docs
examples		examples
include		include
lib/blake3		lib/blake3
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clang-format-ignore		.clang-format-ignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataprint

Features

Installation

From Source

Usage

Basic Examples

More Examples

Benchmarks

Fingerprint — flat JSONL

Running benchmarks yourself

Docs

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dataprint

Features

Installation

From Source

Usage

Basic Examples

More Examples

Benchmarks

Fingerprint — flat JSONL

Running benchmarks yourself

Docs

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages