Summary
Hooks produce features via Feature subclasses. Feature fields can be scalar (tabular storage) or Vector[N] (pgvector storage). A new Query abstraction binds a hook's vector feature to a query-time encode function, enabling domain-specific similarity search as a generic platform capability. OSA deploys the encode function as a long-lived service and auto-generates search endpoints.
Context
The old approach (post-publication index fan-out with ChromaDB + sentence-transformers) has been removed. Embeddings should be hook-produced features like any other derived data. But vector features also need a query-time component: when a user searches by molecule, protein sequence, or image, that input must be encoded into the same vector space. This encoding is domain-specific and must not live in the OSA codebase.
SDK Design
Feature base class
Every hook returns a Feature subclass. Field types determine storage:
class ConplexFeatures(Feature):
embedding: Vector[1024] # → pgvector column
class PocketFeatures(Feature):
pocket_id: str # → text column
score: float # → float column
volume: float # → float column
class TextEmbeddings(Feature):
embedding: Vector[1536] # → pgvector column
- Scalar fields (
str, float, int, bool, datetime) → typed PG columns (existing behavior, driven by ColumnDef / column_mapper.py)
Vector[N] fields → vector(N) pgvector columns with HNSW index
list[Feature] return type (many cardinality) → multiple rows per record in the feature table
The SDK generates column definitions from Feature class type hints at deploy time and sends them as part of the convention manifest. The server-side ColumnDef / column_mapper.py / build_feature_table() infrastructure stays as-is — it just gains a vector type.
Query abstraction
Query is a declaration (not a decorator) that composes a hook, a feature field, an encode function, and a search metric into a searchable capability:
@hook
def conplex(record: Record[PDBSchema]) -> ConplexFeatures:
return ConplexFeatures(embedding=encode_protein(record.sequence))
def encode_smiles(smiles: str) -> list[float]:
return morgan_fingerprint_and_project(smiles)
pocket_match = Query(
name="pocket-match",
hook=conplex,
feature="embedding",
encode=encode_smiles,
metric="cosine",
)
Key properties of Query:
- Not a decorator. It's a noun — a thing you declare, not a function you annotate.
- Binds the write side (hook) to the read side (encode function). These share an embedding space and can't meaningfully exist independently.
- The encode function is just a plain Python function. No special annotations. The SDK packages it into a service container because
Query references it.
- Metric determines the PG operator.
"cosine" → pgvector <=>, "tanimoto" → custom operator, "fulltext" → tsvector @@.
At osa deploy time, the SDK sees the Query and:
- Builds the hook container as usual (batch entrypoint via
osa-run-hook)
- Builds a service container from the same codebase — new
osa-run-service entrypoint wraps the encode function in a lightweight HTTP server
- Registers both with the OSA server, plus the search config (name, target feature, metric, encoder service URL)
Three SDK primitives
| Primitive |
Purpose |
Container lifecycle |
@hook |
Batch compute, produces Feature instances |
K8s Job (run and exit) |
@ingester |
Batch data import |
K8s Job (run and exit) |
Query(...) |
Declares a searchable capability, binds hook + encode function |
Encode function → K8s Deployment (long-lived) |
The pattern generalizes
Protein-drug similarity (vector, cosine):
class ConplexFeatures(Feature):
embedding: Vector[1024]
@hook
def conplex(record: Record[PDBSchema]) -> ConplexFeatures:
return ConplexFeatures(embedding=encode_protein(record.sequence))
def encode_smiles(smiles: str) -> list[float]:
return morgan_fingerprint_and_project(smiles)
pocket_match = Query(
name="pocket-match", hook=conplex, feature="embedding",
encode=encode_smiles, metric="cosine",
)
Protein sequence similarity (vector, cosine):
class ESMFeatures(Feature):
embedding: Vector[1024]
@hook
def esm(record: Record[PDBSchema]) -> ESMFeatures:
return ESMFeatures(embedding=esm_encode(record.metadata.sequence))
def encode_sequence(sequence: str) -> list[float]:
return esm_encode(sequence)
sequence_search = Query(
name="sequence-similarity", hook=esm, feature="embedding",
encode=encode_sequence, metric="cosine",
)
Semantic text search (vector, cosine, via OpenAI API):
class TextEmbeddings(Feature):
embedding: Vector[1536]
@hook
def text_embed(record: Record[PDBSchema]) -> TextEmbeddings:
text = f"{record.metadata.title} {record.metadata.organism}"
return TextEmbeddings(embedding=openai_embed(text))
def encode_text(user_input: str) -> list[float]:
return openai_embed(user_input)
semantic_search = Query(
name="semantic", hook=text_embed, feature="embedding",
encode=encode_text, metric="cosine",
)
Chemical substructure search (fingerprint, tanimoto):
class ChemFeatures(Feature):
fingerprint: Vector[2048]
@hook
def fingerprints(record: Record[ADMETSchema]) -> ChemFeatures:
return ChemFeatures(fingerprint=compute_morgan_fp(record.metadata.smiles))
def encode_substructure(smiles: str) -> list[int]:
return compute_morgan_fp(smiles)
substructure_search = Query(
name="substructure", hook=fingerprints, feature="fingerprint",
encode=encode_substructure, metric="tanimoto",
)
Boundary with GraphQL: Query handles domain-specific similarity search where user input needs transformation. Standard filtering ("records where resolution < 2.0") is handled by the GraphQL API (#76) over the same feature tables. The two are complementary, not overlapping.
Server-side design
pgvector + vector column type
- Install
pgvector extension (migration)
- Add
vector to column_mapper.py: maps to Vector(dim) column type
build_feature_table() creates HNSW index on vector columns
- Storage: 200K records × 1024-dim × float32 ≈ 800MB raw, ~2-4GB with HNSW index
Generic search endpoint
OSA auto-generates search endpoints from convention search config:
POST /conventions/{srn}/search/{query-name}
The handler is generic — no per-convention code:
- Look up convention, find search config for the query name
- Call encoder service:
POST http://{service}:{port}/encode with user input
- Execute pgvector query:
SELECT record_srn FROM features.{hook} ORDER BY {feature} <=> $1 LIMIT N
- Return ranked records with similarity scores
Encoder sidecar lifecycle
The encoder service is deployed as a K8s Deployment + ClusterIP Service (internal only) when the convention is registered. The same K8s client infrastructure manages it.
osa-run-service entrypoint
New SDK entrypoint (~50 lines). Discovers the encode function from the registry (same pattern as osa-run-hook discovers hooks via OSA_HOOK_NAME), wraps it in a uvicorn/FastAPI server, serves on a port. Generic — works for any encode function.
Network and secrets
Hooks/encoders that call external APIs (e.g., OpenAI) need:
- Network access:
network: true in OCI config (overrides the default dnsPolicy=None)
- Secrets:
env_from_secret in hook config, injected from K8s Secrets
Implementation steps
Phase 1: pgvector + vector feature type
Phase 2: SDK Feature base class + Vector type
Phase 3: Query abstraction + osa-run-service
Phase 4: Server-side search endpoint + sidecar lifecycle
Phase 5: ConPLex hook + encoder (first use case)
Supersedes
Depends on
Related
Summary
Hooks produce features via
Featuresubclasses. Feature fields can be scalar (tabular storage) orVector[N](pgvector storage). A newQueryabstraction binds a hook's vector feature to a query-time encode function, enabling domain-specific similarity search as a generic platform capability. OSA deploys the encode function as a long-lived service and auto-generates search endpoints.Context
The old approach (post-publication index fan-out with ChromaDB + sentence-transformers) has been removed. Embeddings should be hook-produced features like any other derived data. But vector features also need a query-time component: when a user searches by molecule, protein sequence, or image, that input must be encoded into the same vector space. This encoding is domain-specific and must not live in the OSA codebase.
SDK Design
Feature base class
Every hook returns a
Featuresubclass. Field types determine storage:str,float,int,bool,datetime) → typed PG columns (existing behavior, driven byColumnDef/column_mapper.py)Vector[N]fields →vector(N)pgvector columns with HNSW indexlist[Feature]return type (many cardinality) → multiple rows per record in the feature tableThe SDK generates column definitions from
Featureclass type hints at deploy time and sends them as part of the convention manifest. The server-sideColumnDef/column_mapper.py/build_feature_table()infrastructure stays as-is — it just gains avectortype.Query abstraction
Queryis a declaration (not a decorator) that composes a hook, a feature field, an encode function, and a search metric into a searchable capability:Key properties of
Query:Queryreferences it."cosine"→ pgvector<=>,"tanimoto"→ custom operator,"fulltext"→ tsvector@@.At
osa deploytime, the SDK sees theQueryand:osa-run-hook)osa-run-serviceentrypoint wraps the encode function in a lightweight HTTP serverThree SDK primitives
@hookFeatureinstances@ingesterQuery(...)The pattern generalizes
Protein-drug similarity (vector, cosine):
Protein sequence similarity (vector, cosine):
Semantic text search (vector, cosine, via OpenAI API):
Chemical substructure search (fingerprint, tanimoto):
Boundary with GraphQL:
Queryhandles domain-specific similarity search where user input needs transformation. Standard filtering ("records where resolution < 2.0") is handled by the GraphQL API (#76) over the same feature tables. The two are complementary, not overlapping.Server-side design
pgvector + vector column type
pgvectorextension (migration)vectortocolumn_mapper.py: maps toVector(dim)column typebuild_feature_table()creates HNSW index on vector columnsGeneric search endpoint
OSA auto-generates search endpoints from convention search config:
The handler is generic — no per-convention code:
POST http://{service}:{port}/encodewith user inputSELECT record_srn FROM features.{hook} ORDER BY {feature} <=> $1 LIMIT NEncoder sidecar lifecycle
The encoder service is deployed as a K8s Deployment + ClusterIP Service (internal only) when the convention is registered. The same K8s client infrastructure manages it.
osa-run-service entrypoint
New SDK entrypoint (~50 lines). Discovers the encode function from the registry (same pattern as
osa-run-hookdiscovers hooks viaOSA_HOOK_NAME), wraps it in a uvicorn/FastAPI server, serves on a port. Generic — works for any encode function.Network and secrets
Hooks/encoders that call external APIs (e.g., OpenAI) need:
network: truein OCI config (overrides the defaultdnsPolicy=None)env_from_secretin hook config, injected from K8s SecretsImplementation steps
Phase 1: pgvector + vector feature type
vectortype tocolumn_mapper.pywith dimension parameterbuild_feature_table()to create HNSW index on vector columnsPhase 2: SDK Feature base class + Vector type
Featurebase class in SDKosa/authoring/Vector[N]type hint that maps to vector column definitionColumnDeflist fromFeatureclass type hints at deploy timePhase 3: Query abstraction + osa-run-service
Querydeclaration type in SDKosa/authoring/osa-run-serviceentrypoint in SDKosa/runtime/(wraps encode function in HTTP server)osa deploygenerates service container for Query encode functionsPhase 4: Server-side search endpoint + sidecar lifecycle
searchesconfig (from Query declarations)Phase 5: ConPLex hook + encoder (first use case)
nvidia.com/gpu: 1)Supersedes
VectorIndexHandler/FanOutToIndexBackends/ ChromaDB approach (already removed)Depends on
Related