feat: vector features, Query abstraction, and domain-specific similarity search

## Summary

Hooks produce features via `Feature` subclasses. Feature fields can be scalar (tabular storage) or `Vector[N]` (pgvector storage). A new `Query` abstraction binds a hook's vector feature to a query-time encode function, enabling domain-specific similarity search as a generic platform capability. OSA deploys the encode function as a long-lived service and auto-generates search endpoints.

## Context

The old approach (post-publication index fan-out with ChromaDB + sentence-transformers) has been removed. Embeddings should be hook-produced features like any other derived data. But vector features also need a query-time component: when a user searches by molecule, protein sequence, or image, that input must be encoded into the same vector space. This encoding is domain-specific and must not live in the OSA codebase.

## SDK Design

### Feature base class

Every hook returns a `Feature` subclass. Field types determine storage:

```python
class ConplexFeatures(Feature):
    embedding: Vector[1024]       # → pgvector column

class PocketFeatures(Feature):
    pocket_id: str                # → text column
    score: float                  # → float column
    volume: float                 # → float column

class TextEmbeddings(Feature):
    embedding: Vector[1536]       # → pgvector column
```

- Scalar fields (`str`, `float`, `int`, `bool`, `datetime`) → typed PG columns (existing behavior, driven by `ColumnDef` / `column_mapper.py`)
- `Vector[N]` fields → `vector(N)` pgvector columns with HNSW index
- `list[Feature]` return type (many cardinality) → multiple rows per record in the feature table

The SDK generates column definitions from `Feature` class type hints at deploy time and sends them as part of the convention manifest. The server-side `ColumnDef` / `column_mapper.py` / `build_feature_table()` infrastructure stays as-is — it just gains a `vector` type.

### Query abstraction

`Query` is a declaration (not a decorator) that composes a hook, a feature field, an encode function, and a search metric into a searchable capability:

```python
@hook
def conplex(record: Record[PDBSchema]) -> ConplexFeatures:
    return ConplexFeatures(embedding=encode_protein(record.sequence))

def encode_smiles(smiles: str) -> list[float]:
    return morgan_fingerprint_and_project(smiles)

pocket_match = Query(
    name="pocket-match",
    hook=conplex,
    feature="embedding",
    encode=encode_smiles,
    metric="cosine",
)
```

Key properties of `Query`:
- **Not a decorator.** It's a noun — a thing you declare, not a function you annotate.
- **Binds the write side (hook) to the read side (encode function).** These share an embedding space and can't meaningfully exist independently.
- **The encode function is just a plain Python function.** No special annotations. The SDK packages it into a service container because `Query` references it.
- **Metric determines the PG operator.** `"cosine"` → pgvector `<=>`, `"tanimoto"` → custom operator, `"fulltext"` → tsvector `@@`.

At `osa deploy` time, the SDK sees the `Query` and:
1. Builds the hook container as usual (batch entrypoint via `osa-run-hook`)
2. Builds a service container from the same codebase — new `osa-run-service` entrypoint wraps the encode function in a lightweight HTTP server
3. Registers both with the OSA server, plus the search config (name, target feature, metric, encoder service URL)

### Three SDK primitives

| Primitive | Purpose | Container lifecycle |
|-----------|---------|-------------------|
| `@hook` | Batch compute, produces `Feature` instances | K8s Job (run and exit) |
| `@ingester` | Batch data import | K8s Job (run and exit) |
| `Query(...)` | Declares a searchable capability, binds hook + encode function | Encode function → K8s Deployment (long-lived) |

### The pattern generalizes

**Protein-drug similarity (vector, cosine):**
```python
class ConplexFeatures(Feature):
    embedding: Vector[1024]

@hook
def conplex(record: Record[PDBSchema]) -> ConplexFeatures:
    return ConplexFeatures(embedding=encode_protein(record.sequence))

def encode_smiles(smiles: str) -> list[float]:
    return morgan_fingerprint_and_project(smiles)

pocket_match = Query(
    name="pocket-match", hook=conplex, feature="embedding",
    encode=encode_smiles, metric="cosine",
)
```

**Protein sequence similarity (vector, cosine):**
```python
class ESMFeatures(Feature):
    embedding: Vector[1024]

@hook
def esm(record: Record[PDBSchema]) -> ESMFeatures:
    return ESMFeatures(embedding=esm_encode(record.metadata.sequence))

def encode_sequence(sequence: str) -> list[float]:
    return esm_encode(sequence)

sequence_search = Query(
    name="sequence-similarity", hook=esm, feature="embedding",
    encode=encode_sequence, metric="cosine",
)
```

**Semantic text search (vector, cosine, via OpenAI API):**
```python
class TextEmbeddings(Feature):
    embedding: Vector[1536]

@hook
def text_embed(record: Record[PDBSchema]) -> TextEmbeddings:
    text = f"{record.metadata.title} {record.metadata.organism}"
    return TextEmbeddings(embedding=openai_embed(text))

def encode_text(user_input: str) -> list[float]:
    return openai_embed(user_input)

semantic_search = Query(
    name="semantic", hook=text_embed, feature="embedding",
    encode=encode_text, metric="cosine",
)
```

**Chemical substructure search (fingerprint, tanimoto):**
```python
class ChemFeatures(Feature):
    fingerprint: Vector[2048]

@hook
def fingerprints(record: Record[ADMETSchema]) -> ChemFeatures:
    return ChemFeatures(fingerprint=compute_morgan_fp(record.metadata.smiles))

def encode_substructure(smiles: str) -> list[int]:
    return compute_morgan_fp(smiles)

substructure_search = Query(
    name="substructure", hook=fingerprints, feature="fingerprint",
    encode=encode_substructure, metric="tanimoto",
)
```

**Boundary with GraphQL:** `Query` handles domain-specific similarity search where user input needs transformation. Standard filtering ("records where resolution < 2.0") is handled by the GraphQL API (#76) over the same feature tables. The two are complementary, not overlapping.

## Server-side design

### pgvector + vector column type

- Install `pgvector` extension (migration)
- Add `vector` to `column_mapper.py`: maps to `Vector(dim)` column type
- `build_feature_table()` creates HNSW index on vector columns
- Storage: 200K records × 1024-dim × float32 ≈ 800MB raw, ~2-4GB with HNSW index

### Generic search endpoint

OSA auto-generates search endpoints from convention search config:

```
POST /conventions/{srn}/search/{query-name}
```

The handler is generic — no per-convention code:
1. Look up convention, find search config for the query name
2. Call encoder service: `POST http://{service}:{port}/encode` with user input
3. Execute pgvector query: `SELECT record_srn FROM features.{hook} ORDER BY {feature} <=> $1 LIMIT N`
4. Return ranked records with similarity scores

### Encoder sidecar lifecycle

The encoder service is deployed as a K8s Deployment + ClusterIP Service (internal only) when the convention is registered. The same K8s client infrastructure manages it.

### osa-run-service entrypoint

New SDK entrypoint (~50 lines). Discovers the encode function from the registry (same pattern as `osa-run-hook` discovers hooks via `OSA_HOOK_NAME`), wraps it in a uvicorn/FastAPI server, serves on a port. Generic — works for any encode function.

### Network and secrets

Hooks/encoders that call external APIs (e.g., OpenAI) need:
- Network access: `network: true` in OCI config (overrides the default `dnsPolicy=None`)
- Secrets: `env_from_secret` in hook config, injected from K8s Secrets

## Implementation steps

### Phase 1: pgvector + vector feature type
- [ ] Install pgvector extension (migration)
- [ ] Add `vector` type to `column_mapper.py` with dimension parameter
- [ ] Update `build_feature_table()` to create HNSW index on vector columns
- [ ] Update feature insertion path to handle vector data from hook output

### Phase 2: SDK Feature base class + Vector type
- [ ] `Feature` base class in SDK `osa/authoring/`
- [ ] `Vector[N]` type hint that maps to vector column definition
- [ ] Generate `ColumnDef` list from `Feature` class type hints at deploy time
- [ ] Update hook entrypoint to handle vector serialization in features.jsonl

### Phase 3: Query abstraction + osa-run-service
- [ ] `Query` declaration type in SDK `osa/authoring/`
- [ ] `osa-run-service` entrypoint in SDK `osa/runtime/` (wraps encode function in HTTP server)
- [ ] `osa deploy` generates service container for Query encode functions
- [ ] Convention manifest includes search config from Query declarations

### Phase 4: Server-side search endpoint + sidecar lifecycle
- [ ] Convention model gains `searches` config (from Query declarations)
- [ ] Generic search route handler (call encoder, query pgvector, return results)
- [ ] Deploy encoder as K8s Deployment + ClusterIP on convention registration
- [ ] Health checking and teardown for encoder services

### Phase 5: ConPLex hook + encoder (first use case)
- [ ] Package ConPLex (ESM-1b + projection) as hook container
- [ ] Add SMILES encoder entrypoint (RDKit + projection)
- [ ] GPU resource requests in K8s runner (`nvidia.com/gpu: 1`)
- [ ] Test end-to-end on PDB data

## Supersedes

- The old `VectorIndexHandler` / `FanOutToIndexBackends` / ChromaDB approach (already removed)
- #33 (OpenAI embedding backend — embeddings are now hook-produced features)
- #10 (pluggable index backend system — vector search is via pgvector on hook features)

## Depends on

- #75 (FK constraint from feature tables to records)

## Related

- #76 (GraphQL API — pgvector operators can be exposed via PostGraphile plugin)
- #99 (CAS — enables reconciliation for backfilling embeddings on existing records)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vector features, Query abstraction, and domain-specific similarity search #70

Summary

Context

SDK Design

Feature base class

Query abstraction

Three SDK primitives

The pattern generalizes

Server-side design

pgvector + vector column type

Generic search endpoint

Encoder sidecar lifecycle

osa-run-service entrypoint

Network and secrets

Implementation steps

Phase 1: pgvector + vector feature type

Phase 2: SDK Feature base class + Vector type

Phase 3: Query abstraction + osa-run-service

Phase 4: Server-side search endpoint + sidecar lifecycle

Phase 5: ConPLex hook + encoder (first use case)

Supersedes

Depends on

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Primitive	Purpose	Container lifecycle
`@hook`	Batch compute, produces `Feature` instances	K8s Job (run and exit)
`@ingester`	Batch data import	K8s Job (run and exit)
`Query(...)`	Declares a searchable capability, binds hook + encode function	Encode function → K8s Deployment (long-lived)

feat: vector features, Query abstraction, and domain-specific similarity search #70

Description

Summary

Context

SDK Design

Feature base class

Query abstraction

Three SDK primitives

The pattern generalizes

Server-side design

pgvector + vector column type

Generic search endpoint

Encoder sidecar lifecycle

osa-run-service entrypoint

Network and secrets

Implementation steps

Phase 1: pgvector + vector feature type

Phase 2: SDK Feature base class + Vector type

Phase 3: Query abstraction + osa-run-service

Phase 4: Server-side search endpoint + sidecar lifecycle

Phase 5: ConPLex hook + encoder (first use case)

Supersedes

Depends on

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions