Sema

Ontology extraction for knowledge graphs — semantic layer for GraphRAG

From Greek σῆμα — "sign" or "meaning"

Sema automatically extracts semantic ontology from any data warehouse, builds a knowledge graph, and serves it as a structured context layer for downstream consumers — NL2SQL engines, AI agents, data catalogs, lineage tools, and more.

Warehouse Metadata → LLM Interpretation → Knowledge Graph → Hybrid Search (GraphRAG) → Consumer
                           ↑
                   External Enrichment
                   (Atlan, etc.) [optional]

What It Does

Sema reads your warehouse catalog and produces a Semantic Context Object (SCO) — a query-relevant slice of the knowledge graph that any consumer can use without knowing about graph internals, embeddings, or LLM details.

Build (one-time per catalog):

Databricks Catalog
    → L1 Structural Extraction  — deterministic schema parsing
    → L2 Semantic Interpretation — LLM-powered entity and property detection
    → L3 Vocabulary Detection    — pattern matching + LLM synonym expansion
    → Commit to Neo4j
    → Embed nodes for vector search

Retrieve (per question):

Natural language question
    → Hybrid search (vector + lexical)
    → Normalize + dedup seed hits
    → Type-aware graph expansion
    → Dedup expanded artifacts
    → Visibility policy pruning
    → SCO

The SCO contains entities, physical assets, join paths, governed values, metrics, ancestry, and semantic type annotations — everything a consumer needs to understand the data.

Graph Data Model (v1)

Sema builds a multi-layer knowledge graph in Neo4j. Every node has a stable id (UUID); source-backed nodes also carry a ref that anchors them to the external system.

Node Types

Layer	Nodes	Purpose
Physical	`DataSource`, `Catalog`, `Schema`, `Table`, `Column`	Mirrors your warehouse structure. Each carries a platform-scoped `ref` (e.g. `databricks://workspace/catalog/schema/table`)
Semantic	`Entity`, `Property`, `Alias`, `Metric`	LLM-inferred business concepts. Entities map to tables via `ENTITY_ON_TABLE`, properties map to columns via `PROPERTY_ON_COLUMN`. Aliases replace synonyms with `is_preferred` and `description` fields
Vocabulary	`Vocabulary`, `ValueSet`, `Term`	Named vocabularies, coded value sets, term hierarchies (ICD-10, AJCC, etc.), and aliases for search expansion
Joins	`JoinPath`	First-class join artifacts with ordered `join_predicates`, `hop_count`, `cardinality_hint`, and optional `sql_snippet`. Linked to tables via `USES` and to entities via `FROM_ENTITY`/`TO_ENTITY`
Provenance	`Assertion`	Every fact is backed by an assertion with `source`, `confidence`, and `status` (`auto`, `accepted`, `rejected`, `pinned`, `superseded`). Selective `SUBJECT`/`OBJECT` edges link assertions to their resolved nodes

Relationships

Relationship	From	To	Purpose
`IN_SOURCE`	Catalog	DataSource	Catalog belongs to data source
`IN_CATALOG`	Schema	Catalog	Schema belongs to catalog
`IN_SCHEMA`	Table	Schema	Table belongs to schema
`IN_TABLE`	Column	Table	Column belongs to table
`ENTITY_ON_TABLE`	Entity	Table	Entity is implemented by table
`PROPERTY_ON_COLUMN`	Property	Column	Property is implemented by column
`HAS_PROPERTY`	Entity	Property	Entity has semantic property
`REFERS_TO`	Alias	Entity/Property/Term	Alias refers to canonical node
`HAS_VALUE_SET`	Column	ValueSet	Column has a set of permissible values
`MEMBER_OF`	Term	ValueSet	Term belongs to value set
`PARENT_OF`	Term	Term	Hierarchical term relationship
`CLASSIFIED_AS`	Property	Vocabulary	Property classified under vocabulary
`IN_VOCABULARY`	Term	Vocabulary	Term belongs to vocabulary
`MEASURES`	Metric	Entity	Metric measures entity
`AGGREGATES`	Metric	Property	Metric aggregates property
`FILTERS_BY`	Metric	Property/Term	Metric filters by property or term
`AT_GRAIN`	Metric	Property/Term	Metric operates at this grain
`FROM_ENTITY`	JoinPath	Entity	Join starts from entity
`TO_ENTITY`	JoinPath	Entity	Join ends at entity
`USES`	JoinPath	Table/Column	Join uses physical asset
`SUBJECT`	Assertion	Node	Assertion is about this node
`OBJECT`	Assertion	Node	Assertion references this node

Assertion-Driven Architecture

All extracted and inferred facts are stored as first-class Assertion records before being resolved into the graph. This enables:

Conflict resolution — multiple sources can assert different facts; winner selection uses pinned > accepted > source_precedence > confidence
Human overrides — pin or reject assertions without losing the original extraction
Auditability — every node traces back to the assertions that created it
Safe rebuilds — wipe and rebuild from source; human overrides are exported and re-imported via translate_ref()

SCO Visibility Policy

The Semantic Context Object filters candidates by assertion status and confidence policy before serving them to consumers:

Status	Included in SCO
`pinned`	Always
`accepted`	Always
`auto`	If confidence >= threshold
`rejected`	Never
`superseded`	Never

Confidence thresholds are determined by confidence_policy on each candidate: structural uses 0.5, semantic uses 0.7. When confidence_policy is absent, the threshold is inferred from the source field for backward compatibility.

Works With Any Domain

Sema is domain-agnostic. If your warehouse has tables and columns, Sema can extract the ontology.

Domain	Example Vocabularies
Healthcare	ICD-10 codes, AJCC staging, TNM classification, patient registries
Finance	NAICS codes, currency codes, transaction types, risk categories
Retail	SKU hierarchies, product categories, UPC codes, brand taxonomies
Manufacturing	BOM hierarchies, part classifications, process codes
PropTech	Zoning codes, property types, building classifications, land use codes
Any warehouse	Tables, columns, coded values, entity relationships

Connector & Model Support

Sema currently ships with a Databricks connector. Additional warehouse connectors (Snowflake, BigQuery, etc.) are on the roadmap — the connector interface is pluggable.

For LLM and embedding providers, bring your own model. Sema works with any provider through a unified interface:

	Supported Providers
LLM	OpenRouter, Anthropic, OpenAI, Databricks Model Serving, any OpenAI-compatible endpoint
Embeddings	OpenRouter, OpenAI, sentence-transformers (local), Databricks, any OpenAI-compatible endpoint

Getting Started

Prerequisites

Python 3.12+
uv — Python package manager (recommended)
Neo4j 5.x — local via Docker or remote
Databricks SQL Warehouse — data source (currently the only supported connector)
LLM API key — any supported provider above
Embedding API key — any supported provider above, or use local sentence-transformers

1. Clone and install

git clone git@github.com:Nine-Sigma/sema.git
cd sema
uv sync            # installs all dependencies into a virtual environment

Using pip instead of uv

pip install -e .

2. Configure credentials

cp .env.example .env

Edit .env with your credentials — see .env.example for the full list. Settings can also be passed via CLI flags or a YAML config file (--config path/to/config.yaml). CLI flags override env vars, and env vars override config file values.

3. Start Neo4j

docker compose up -d

This starts Neo4j 5.26 with APOC on bolt://localhost:7687. Browser UI at http://localhost:7474.

4. Build the knowledge graph

uv run sema build --catalog my_catalog --schemas schema1,schema2

This runs the full pipeline: structural extraction, semantic interpretation, vocabulary detection, graph materialization, and embedding computation. On a typical catalog with ~50 tables, expect 5-15 minutes depending on your LLM provider.

5. Query

# Get a Semantic Context Object (JSON)
uv run sema context --question "How many patients have stage III breast cancer?"

# Generate SQL from natural language
uv run sema query --question "Average age of patients by cancer type"

CLI Reference

All commands are run with uv run sema (or just sema if you installed with pip).

LLM provider configuration

Sema supports openrouter (default), anthropic, openai, databricks (Mosaic AI Model Serving), and custom (any OpenAI-compatible endpoint) as LLM providers, and the same set plus sentence-transformers for embeddings.

For Databricks Mosaic-specific operation — endpoint discovery, supported vs unsupported endpoints, profile-based auth, dimension-guard resolution, and the baseline/candidate eval workflow — see docs/runbooks/databricks-mosaic-provider.md.

`sema build`

Build the knowledge graph from your warehouse catalog.

uv run sema build --catalog my_catalog --schemas schema1,schema2

Flag	Description	Default
`--catalog`	Catalog name to extract from	—
`--schemas`	Comma-separated schema names	all schemas
`--table-pattern`	Glob pattern to filter tables	`*`
`--table-workers`	Parallel table workers	`4`
`--llm-provider`	LLM provider	`openrouter`
`--llm-model`	LLM model name	`anthropic/claude-sonnet-4`
`--llm-timeout`	LLM request timeout in seconds	`120`
`--skip-embeddings`	Create indexes only, skip embeddings	`false`
`--resume`	Skip tables already in the graph	`false`
`--config`	Path to YAML config file	—
`--verbose`	Enable verbose output	`false`

`sema context`

Retrieve a Semantic Context Object — a query-relevant slice of the knowledge graph.

uv run sema context --question "How many patients have stage III breast cancer?"

Flag	Description	Default
`--question`	Natural language question	required
`--consumer`	Consumer type for pruning: `nl2sql`, `discovery`	`nl2sql`

`sema query`

Generate and optionally execute SQL from natural language. Uses the NL2SQL consumer with plan/explain/execute operations.

# Plan — generate SQL without executing
uv run sema query --question "Average age of patients by cancer type" --operation plan

# Explain — generate SQL and show the execution plan
uv run sema query --question "Average age of patients by cancer type" --operation explain

# Execute — generate and run SQL against Databricks
uv run sema query --question "Average age of patients by cancer type" --operation execute

Flag	Description	Default
`--question`	Natural language question	required
`--operation`	`plan`, `explain`, or `execute`	`plan`
`--consumer`	Consumer type	`nl2sql`
`--llm-provider`	LLM provider	`openrouter`
`--llm-model`	LLM model name	`anthropic/claude-sonnet-4`
`--llm-timeout`	LLM request timeout in seconds	`120`
`--embedding-provider`	Embedding provider	`openrouter`
`--embedding-model`	Embedding model name	`google/gemini-embedding-001`
`--verbose`	Return full response JSON	`false`

`sema review`

Export low-confidence assertions for human review. Useful for identifying extraction results that may need manual correction.

# Print to stdout
uv run sema review --threshold 0.85

# Save to file
uv run sema review --threshold 0.7 --output review.json

Flag	Description	Default
`--threshold`	Confidence threshold — assertions below this are exported	`0.85`
`--output`	Output file path	stdout

Consumer Architecture

Sema uses a pluggable consumer protocol. Consumers receive a pruned SCO and produce task-specific outputs.

NL2SQL Consumer

The built-in NL2SQL consumer generates constrained SQL from natural language questions:

Plan — generates SQL with closed-world validation against the SCO
Explain — generates SQL and shows the Databricks execution plan
Execute — generates, validates, executes SQL, and synthesizes results

The consumer receives the SQL dialect explicitly and the prompt includes:

Entity context (name + description)
Table/column listing with semantic type annotations (e.g., tnm_stage (categorical))
Join paths with predicates
Governed filter values (exact values for WHERE clauses)
Metric definitions (name, formula, aggregates, filters, grains)
Term hierarchy context
Dialect-specific guidance (Databricks SQL rules, ANSI fallback)

Prompt truncation follows a strict deterministic cut order when the budget is exceeded.

Writing a Custom Consumer

Implement the Consumer protocol in src/sema/consumers/base.py:

class MyConsumer:
    name: str = "my_consumer"
    capabilities: set[str] = {"analyze"}

    def context_profile(self) -> ContextProfile: ...
    def run(self, request, sco, deps) -> ConsumerResult: ...

Register it in src/sema/consumers/__init__.py and it becomes available via --consumer my_consumer.

Running Tests

# Unit tests (no external services needed)
uv run pytest

# Integration tests (requires Neo4j running)
uv run pytest -m integration

# All tests with coverage
uv run pytest --cov=sema --cov-report=term-missing

# Type checking
uv run mypy src/sema/

License

Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github		.github
assets		assets
eval-runs		eval-runs
migrations		migrations
scripts		scripts
showcase		showcase
src/sema		src/sema
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.nvmrc		.nvmrc
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sema

What It Does

Graph Data Model (v1)

Node Types

Relationships

Assertion-Driven Architecture

SCO Visibility Policy

Works With Any Domain

Connector & Model Support

Getting Started

Prerequisites

1. Clone and install

2. Configure credentials

3. Start Neo4j

4. Build the knowledge graph

5. Query

CLI Reference

LLM provider configuration

`sema build`

`sema context`

`sema query`

`sema review`

Consumer Architecture

NL2SQL Consumer

Writing a Custom Consumer

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sema

What It Does

Graph Data Model (v1)

Node Types

Relationships

Assertion-Driven Architecture

SCO Visibility Policy

Works With Any Domain

Connector & Model Support

Getting Started

Prerequisites

1. Clone and install

2. Configure credentials

3. Start Neo4j

4. Build the knowledge graph

5. Query

CLI Reference

LLM provider configuration

sema build

sema context

sema query

sema review

Consumer Architecture

NL2SQL Consumer

Writing a Custom Consumer

Running Tests

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`sema build`

`sema context`

`sema query`

`sema review`

Packages