Ephemeral semantic embedding service. Ingests text, chunks it, embeds it with GTR-T5-base (768d), and returns ranked results. No persistence — all state is in-process and lost on restart.
Rust + ONNX Runtime. Runs in Docker on port 8080.
- Ingest — chunks text by sentence boundaries, embeds each chunk with GTR-T5-base (768d local, always on)
- Search — cosine similarity with optional temporal decay weighting
- Dual embedding — local GTR-T5-base for
organizerole (768d), optional OpenAI text-embedding-ada-002 forretrieverole (1536d) - Per-agent encryption — orthogonal matrix rotation on embeddings; preserves cosine similarity, keys ephemeral
- Vec2text inversion — reconstruct approximate text from an embedding vector (optional, requires inverter models)
- Temp store — named ephemeral vector stores with 2 hr TTL, separate from session store
- Auth — nuts-auth JWT + API token verification (optional; open dev mode if
NUTS_AUTH_JWKS_URLunset)
The Docker image uses CUDA 12.6. Host driver must be 545+.
# Check
nvidia-smi
# Ubuntu/Debian upgrade
sudo apt-get install -y nvidia-driver-590 && sudo rebootWindows: download from nvidia.com, run installer, reboot.
Linux:
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart dockerWindows: Docker Desktop with WSL 2 backend includes GPU support automatically.
Models are not in the repo. Run once before building the Docker image:
bash scripts/fetch_models.shThis checks for Python deps, installs them if needed, exports GTR-T5-base + vec2text to models/ (~280 MB), and runs a verification pass. Pass --force to re-export.
Manual export:
pip install torch transformers sentence-transformers vec2text onnx onnxruntime
python scripts/export_gtr_models.py --output_dir models/ --verifydocker compose up -dBuilds from source with CUDA, starts on :8080. No volume needed.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Status, model info, session/chunk counts |
| GET | /sessions |
List all sessions |
| POST | /sessions/:id/ingest |
Ingest text (chunk + embed) |
| GET | /sessions/:id/search?q=... |
Semantic search |
| GET | /sessions/:id |
Session metadata |
| DELETE | /sessions/:id |
Delete session |
| Method | Endpoint | Description |
|---|---|---|
| GET | /temp |
List named temp stores (with TTL) |
| POST | /temp/:name/ingest |
Ingest into ephemeral named store (2 hr TTL) |
| GET | /temp/:name/search?q=... |
Search temp store |
| GET | /temp/:name |
Dump all chunks |
| DELETE | /temp/:name |
Delete temp store |
| Method | Endpoint | Description |
|---|---|---|
| POST | /agent/:id/register |
Register per-agent orthogonal key |
| POST | /agent/:id/encrypt |
Encrypt embeddings |
| POST | /agent/:id/decrypt |
Decrypt embeddings |
| Method | Endpoint | Description |
|---|---|---|
| POST | /invert |
Reconstruct approximate text from embedding |
curl -X POST http://localhost:8080/sessions/my-session/ingest \
-H "Content-Type: application/json" \
-d '{
"text": "The harbor was quiet at dawn. Only the sound of halyards against aluminum masts.",
"source": "journal",
"emotion_primary": "calm"
}'# Basic
curl "http://localhost:8080/sessions/my-session/search?q=morning+at+the+marina&n=5"
# With temporal decay (half-life 24 hours, 30% time weight)
curl "http://localhost:8080/sessions/my-session/search?q=marina&n=5&time_weight=0.3&decay_halflife_hours=24"
# retrieve role (requires OPENAI_API_KEY)
curl "http://localhost:8080/sessions/my-session/search?q=marina&role=retrieve"| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
Listen port |
MODEL_PATH |
models/gtr-t5-base.onnx |
GTR-T5-base ONNX embedder |
TOKENIZER_PATH |
models/tokenizer.json |
GTR tokenizer |
OPENAI_API_KEY |
— | Enables text-embedding-ada-002 retrieve role |
OPENAI_EMBEDDING_MODEL |
text-embedding-ada-002 |
Override OpenAI model |
NUTS_AUTH_JWKS_URL |
— | Enable auth (open dev mode if unset) |
NUTS_AUTH_VALIDATE_URL |
https://auth.nuts.services/api/validate |
API token validation |
INVERTER_PROJECTION_PATH |
models/inverter/projection.onnx |
Vec2text projection |
INVERTER_ENCODER_PATH |
models/inverter/encoder.onnx |
Vec2text T5 encoder |
INVERTER_DECODER_PATH |
models/inverter/decoder.onnx |
Vec2text T5 decoder |
INVERTER_TOKENIZER_PATH |
models/inverter/tokenizer.json |
Vec2text tokenizer |
| Parameter | Default | Description |
|---|---|---|
q |
required | Search query text |
n |
5 |
Number of results |
role |
organize |
organize (768d local) or retrieve (1536d OpenAI) |
time_weight |
0.0 |
Blend semantic score with recency (0–1) |
decay_halflife_hours |
168 |
Recency decay half-life in hours |
include_nearby |
— | Include temporally adjacent chunks in results |
time_window_minutes |
30 |
Window for nearby chunks |
agent_id |
— | Agent ID for encrypted search |
max_length |
64 |
Max tokens for inverter output |
If NUTS_AUTH_JWKS_URL is set, the service enforces nuts-auth:
- organize role (local GTR) — always free, no token required
- retrieve role (OpenAI) — requires
Authorization: Bearer <jwt>orAuthorization: ahp_<token>
Leave NUTS_AUTH_JWKS_URL unset for open dev mode.
- Rust — axum, tokio, ort (ONNX Runtime 2.0)
- Embedding — GTR-T5-base (sentence-transformers), 768d, L2-normalized
- Storage — ephemeral
RwLock<HashMap>, no disk persistence - Inversion — vec2text gtr-base (projection + T5 encoder/decoder, optional)
- Auth — nuts-auth RS256 JWT +
ahp_API tokens
Proprietary. All rights reserved.