Skip to content
Merged

CLI #25

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
{
"rust-analyzer.cargo.features": ["cli"]
}
"rust-analyzer.cargo.features": "all",
}
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ our gymnast friend, serde.
- Works with any self-describing format with a Serde implementation.
- Suitable for large files.
- Keeps track of some useful info for each type.
- Keeps track of null/normal/missing/duplicate values separately.
- Keeps track of null/missing/duplicate values separately.
- Integrates with [Schemars](https://github.com/GREsau/schemars) and
[json_typegen](https://github.com/evestera/json_typegen) to produce types and json schema if needed.
[json_typegen](https://github.com/evestera/json_typegen) to produce types and a json schema if needed.
- There's a demo website [here](https://schema-analysis.com/).

### Installation
Expand All @@ -42,7 +42,7 @@ cargo install schema_analysis --features cli --locked

### CLI Usage

The `schema_analysis` binary can infer schemas and generate types directly from the command line.
`schema_analysis` can infer schemas and generate types from data directly from the command line.

```
schema_analysis [OPTIONS] [FILES]...
Expand All @@ -59,7 +59,7 @@ and reads from stdin if no files are provided.
| `--output <OUTPUT>` | Output mode (`schema`, `rust`, `typescript`, `typescript-alias`, `kotlin`, `kotlin-kotlinx`, `json-schema`, `shape`) | `schema` |
| `--name <NAME>` | Root type name for code generation | `Root` |
| `--compact` | Compact JSON output (no pretty printing) | |
| `--no-analysis` | Skip analysis info (counts, samples, min/max, etc.), outputting only the schema structure | |
| `--minimal` | Skip analysis info (counts, samples, min/max, etc.), outputting only the schema structure | |

**Examples:**

Expand Down
108 changes: 108 additions & 0 deletions packages/python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# schema_analysis

## Universal-ish Schema Analysis

Ever wished you could figure out what was in that json file? Or maybe it was xml... Ehr, yaml?
It was definitely toml.

Alas, many great tools will only work with one of those formats, and the internet is not so
nice a place as to finally understand that no, xml is not an acceptable data format.

Enter this neat little tool, a single interface to any self-describing format supported by
our gymnast friend, serde.

### Features

- Works with any self-describing format with a Serde implementation.
- Suitable for large files.
- Keeps track of some useful info for each type (opt out with --minimal).
- Keeps track of null/missing/duplicate values separately.
- Integrates with [Schemars](https://github.com/GREsau/schemars) and
[json_typegen](https://github.com/evestera/json_typegen) to produce types and a json schema if needed.
- There's a demo website [here](https://schema-analysis.com/).

### Installation

```bash
# Run without installing
uvx schema_analysis data.json
# or
pipx run schema_analysis data.json

# Install
pip install schema_analysis
# or
uv tool install schema_analysis
# or
cargo install schema_analysis --features cli --locked
```

### CLI Usage

`schema_analysis` can infer schemas and generate types from data directly from the command line.

```
schema_analysis [OPTIONS] [FILES]...
```

It auto-detects the input format from file extensions (`.json`, `.yaml`/`.yml`, `.xml`, `.toml`, `.cbor`, `.bson`)
and reads from stdin if no files are provided.

**Options:**

| Option | Description | Default |
| --- | --- | --- |
| `--format <FORMAT>` | Override input format (`json`, `yaml`, `xml`, `toml`, `cbor`, `bson`) | auto-detected |
| `--output <OUTPUT>` | Output mode (`schema`, `rust`, `typescript`, `typescript-alias`, `kotlin`, `kotlin-kotlinx`, `json-schema`, `shape`) | `schema` |
| `--name <NAME>` | Root type name for code generation | `Root` |
| `--compact` | Compact JSON output (no pretty printing) | |
| `--minimal` | Skip analysis info (counts, samples, min/max, etc.), outputting only the schema structure | |

**Examples:**

```bash
# Infer a schema from a JSON file
schema_analysis data.json

# Generate Rust types
schema_analysis data.json --output rust --name MyData

# Generate TypeScript interfaces
schema_analysis api.json --output typescript --name ApiResponse

# Generate JSON Schema
schema_analysis data.json --output json-schema

# Merge multiple files into a single schema
schema_analysis file1.json file2.json file3.json

# Read from stdin
cat data.json | schema_analysis --format json
```

### Library Usage

For use as a library, see the [Rust crate](https://crates.io/crates/schema_analysis/) or the [repo](https://github.com/QuartzLibrary/schema_analysis).

### Performance

> These are not proper benchmarks, but should give a vague idea of the performance on a i7-7700HQ laptop (2017) laptop with the raw data already loaded into memory.

| Size | wasm (MB/s) | native (MB/s) | Format | File # |
| --------------------- | ------------ | ------------- | ------ | ------ |
| [~180MB] | ~20s (9) | ~5s (36) | json | 1 |
| [~650MB] | ~150s (4.3) | ~50s (13) | json | 1 |
| [~1.7GB] | ~470s (3.6) | ~145s (11.7) | json | 1 |
| [~2.1GB] | <sup>a</sup> | ~182s (11.5) | json | 1 |
| [~13.3GB]<sup>b</sup> | | ~810s (16.4) | xml | ~200k |

<sup>a</sup> This one seems to go over some kind of browser limit when fetching the data in the Web Worker, I believe I would have to split large files to handle it.

<sup>b</sup> ~2.7GB compressed. This one seems like it would be a worst-case scenario because it includes decompression overhead and the files had a section that was formatted text which resulted in crazy schemas. (The json pretty printed schema was almost 0.5GB!)


[~180MB]: https://github.com/zemirco/sf-city-lots-json/blob/master/citylots.json
[~650MB]: https://catalog.data.gov/dataset/forestry-planting-spaces
[~1.7GB]: https://catalog.data.gov/dataset/nys-thruway-origin-and-destination-points-for-all-vehicles-15-minute-intervals-2018-q4
[~2.1GB]: https://catalog.data.gov/dataset/turnstile-usage-data-2016
[~13.3GB]: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/
1 change: 1 addition & 0 deletions packages/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ build-backend = "maturin"
name = "schema_analysis"
version = "0.6.0"
description = "Infer schemas from JSON, YAML, XML, TOML, CBOR, and BSON"
readme = "README.md"
license = { text = "MIT OR Apache-2.0" }
requires-python = ">=3.8"
authors = [{ name = "QuartzLibrary" }]
Expand Down
8 changes: 6 additions & 2 deletions schema_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ our gymnast friend, serde.
- Works with any self-describing format with a Serde implementation.
- Suitable for large files.
- Keeps track of some useful info for each type.
- Keeps track of null/normal/missing/duplicate values separately.
- Keeps track of null/missing/duplicate values separately.
- Integrates with [Schemars](https://github.com/GREsau/schemars) and
[json_typegen](https://github.com/evestera/json_typegen) to produce types and json schema if needed.
[json_typegen](https://github.com/evestera/json_typegen) to produce types and a json schema if needed.
- There's a demo website [here](https://schema-analysis.com/).

### Usage
Expand Down Expand Up @@ -52,6 +52,10 @@ Check [Schema](https://docs.rs/schema_analysis/latest/schema_analysis/enum.Schem
to see what info you get, and [targets](https://github.com/QuartzLibrary/schema_analysis/blob/HEAD/schema_analysis/src/targets)
to see the available integrations (which include code and json schema generation).

### CLI Usage

You can use this crate as a CLI, more info in the [repo](https://github.com/QuartzLibrary/schema_analysis).

### Advanced Usage

I know, I know, the internet is evil and has decided to plague you with not one, but thousands,
Expand Down
4 changes: 2 additions & 2 deletions schema_analysis/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ struct Cli {

/// Only output the schema structure, without analysis info (counts, samples, min/max, etc.)
#[arg(long)]
no_analysis: bool,
minimal: bool,
}

#[derive(Clone, Copy, PartialEq, Eq, clap::ValueEnum)]
Expand Down Expand Up @@ -73,7 +73,7 @@ fn main() -> Result<()> {
}
let format = resolve_format(&cli)?;

let output = if cli.no_analysis {
let output = if cli.minimal {
let mut schema = infer_schema::<()>(format, &cli.files)?;
if format == InputFormat::Xml {
cleanup_xml_schema(&mut schema);
Expand Down
6 changes: 3 additions & 3 deletions schema_analysis/tests/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -100,14 +100,14 @@ fn compact_flag() {
}

#[test]
fn no_analysis_flag() {
fn minimal_flag() {
cmd()
.arg(input("sample.json"))
.arg("--no-analysis")
.arg("--minimal")
.assert()
.success()
.stdout(include_str!(
"cli_fixtures/expected/json_schema_no_analysis.json"
"cli_fixtures/expected/json_schema_minimal.json"
));
}

Expand Down
Loading