Skip to content

feat: vector features, Query abstraction, and domain-specific similarity search #70

@rorybyrne

Description

@rorybyrne

Summary

Hooks produce features via Feature subclasses. Feature fields can be scalar (tabular storage) or Vector[N] (pgvector storage). A new Query abstraction binds a hook's vector feature to a query-time encode function, enabling domain-specific similarity search as a generic platform capability. OSA deploys the encode function as a long-lived service and auto-generates search endpoints.

Context

The old approach (post-publication index fan-out with ChromaDB + sentence-transformers) has been removed. Embeddings should be hook-produced features like any other derived data. But vector features also need a query-time component: when a user searches by molecule, protein sequence, or image, that input must be encoded into the same vector space. This encoding is domain-specific and must not live in the OSA codebase.

SDK Design

Feature base class

Every hook returns a Feature subclass. Field types determine storage:

class ConplexFeatures(Feature):
    embedding: Vector[1024]       # → pgvector column

class PocketFeatures(Feature):
    pocket_id: str                # → text column
    score: float                  # → float column
    volume: float                 # → float column

class TextEmbeddings(Feature):
    embedding: Vector[1536]       # → pgvector column
  • Scalar fields (str, float, int, bool, datetime) → typed PG columns (existing behavior, driven by ColumnDef / column_mapper.py)
  • Vector[N] fields → vector(N) pgvector columns with HNSW index
  • list[Feature] return type (many cardinality) → multiple rows per record in the feature table

The SDK generates column definitions from Feature class type hints at deploy time and sends them as part of the convention manifest. The server-side ColumnDef / column_mapper.py / build_feature_table() infrastructure stays as-is — it just gains a vector type.

Query abstraction

Query is a declaration (not a decorator) that composes a hook, a feature field, an encode function, and a search metric into a searchable capability:

@hook
def conplex(record: Record[PDBSchema]) -> ConplexFeatures:
    return ConplexFeatures(embedding=encode_protein(record.sequence))

def encode_smiles(smiles: str) -> list[float]:
    return morgan_fingerprint_and_project(smiles)

pocket_match = Query(
    name="pocket-match",
    hook=conplex,
    feature="embedding",
    encode=encode_smiles,
    metric="cosine",
)

Key properties of Query:

  • Not a decorator. It's a noun — a thing you declare, not a function you annotate.
  • Binds the write side (hook) to the read side (encode function). These share an embedding space and can't meaningfully exist independently.
  • The encode function is just a plain Python function. No special annotations. The SDK packages it into a service container because Query references it.
  • Metric determines the PG operator. "cosine" → pgvector <=>, "tanimoto" → custom operator, "fulltext" → tsvector @@.

At osa deploy time, the SDK sees the Query and:

  1. Builds the hook container as usual (batch entrypoint via osa-run-hook)
  2. Builds a service container from the same codebase — new osa-run-service entrypoint wraps the encode function in a lightweight HTTP server
  3. Registers both with the OSA server, plus the search config (name, target feature, metric, encoder service URL)

Three SDK primitives

Primitive Purpose Container lifecycle
@hook Batch compute, produces Feature instances K8s Job (run and exit)
@ingester Batch data import K8s Job (run and exit)
Query(...) Declares a searchable capability, binds hook + encode function Encode function → K8s Deployment (long-lived)

The pattern generalizes

Protein-drug similarity (vector, cosine):

class ConplexFeatures(Feature):
    embedding: Vector[1024]

@hook
def conplex(record: Record[PDBSchema]) -> ConplexFeatures:
    return ConplexFeatures(embedding=encode_protein(record.sequence))

def encode_smiles(smiles: str) -> list[float]:
    return morgan_fingerprint_and_project(smiles)

pocket_match = Query(
    name="pocket-match", hook=conplex, feature="embedding",
    encode=encode_smiles, metric="cosine",
)

Protein sequence similarity (vector, cosine):

class ESMFeatures(Feature):
    embedding: Vector[1024]

@hook
def esm(record: Record[PDBSchema]) -> ESMFeatures:
    return ESMFeatures(embedding=esm_encode(record.metadata.sequence))

def encode_sequence(sequence: str) -> list[float]:
    return esm_encode(sequence)

sequence_search = Query(
    name="sequence-similarity", hook=esm, feature="embedding",
    encode=encode_sequence, metric="cosine",
)

Semantic text search (vector, cosine, via OpenAI API):

class TextEmbeddings(Feature):
    embedding: Vector[1536]

@hook
def text_embed(record: Record[PDBSchema]) -> TextEmbeddings:
    text = f"{record.metadata.title} {record.metadata.organism}"
    return TextEmbeddings(embedding=openai_embed(text))

def encode_text(user_input: str) -> list[float]:
    return openai_embed(user_input)

semantic_search = Query(
    name="semantic", hook=text_embed, feature="embedding",
    encode=encode_text, metric="cosine",
)

Chemical substructure search (fingerprint, tanimoto):

class ChemFeatures(Feature):
    fingerprint: Vector[2048]

@hook
def fingerprints(record: Record[ADMETSchema]) -> ChemFeatures:
    return ChemFeatures(fingerprint=compute_morgan_fp(record.metadata.smiles))

def encode_substructure(smiles: str) -> list[int]:
    return compute_morgan_fp(smiles)

substructure_search = Query(
    name="substructure", hook=fingerprints, feature="fingerprint",
    encode=encode_substructure, metric="tanimoto",
)

Boundary with GraphQL: Query handles domain-specific similarity search where user input needs transformation. Standard filtering ("records where resolution < 2.0") is handled by the GraphQL API (#76) over the same feature tables. The two are complementary, not overlapping.

Server-side design

pgvector + vector column type

  • Install pgvector extension (migration)
  • Add vector to column_mapper.py: maps to Vector(dim) column type
  • build_feature_table() creates HNSW index on vector columns
  • Storage: 200K records × 1024-dim × float32 ≈ 800MB raw, ~2-4GB with HNSW index

Generic search endpoint

OSA auto-generates search endpoints from convention search config:

POST /conventions/{srn}/search/{query-name}

The handler is generic — no per-convention code:

  1. Look up convention, find search config for the query name
  2. Call encoder service: POST http://{service}:{port}/encode with user input
  3. Execute pgvector query: SELECT record_srn FROM features.{hook} ORDER BY {feature} <=> $1 LIMIT N
  4. Return ranked records with similarity scores

Encoder sidecar lifecycle

The encoder service is deployed as a K8s Deployment + ClusterIP Service (internal only) when the convention is registered. The same K8s client infrastructure manages it.

osa-run-service entrypoint

New SDK entrypoint (~50 lines). Discovers the encode function from the registry (same pattern as osa-run-hook discovers hooks via OSA_HOOK_NAME), wraps it in a uvicorn/FastAPI server, serves on a port. Generic — works for any encode function.

Network and secrets

Hooks/encoders that call external APIs (e.g., OpenAI) need:

  • Network access: network: true in OCI config (overrides the default dnsPolicy=None)
  • Secrets: env_from_secret in hook config, injected from K8s Secrets

Implementation steps

Phase 1: pgvector + vector feature type

  • Install pgvector extension (migration)
  • Add vector type to column_mapper.py with dimension parameter
  • Update build_feature_table() to create HNSW index on vector columns
  • Update feature insertion path to handle vector data from hook output

Phase 2: SDK Feature base class + Vector type

  • Feature base class in SDK osa/authoring/
  • Vector[N] type hint that maps to vector column definition
  • Generate ColumnDef list from Feature class type hints at deploy time
  • Update hook entrypoint to handle vector serialization in features.jsonl

Phase 3: Query abstraction + osa-run-service

  • Query declaration type in SDK osa/authoring/
  • osa-run-service entrypoint in SDK osa/runtime/ (wraps encode function in HTTP server)
  • osa deploy generates service container for Query encode functions
  • Convention manifest includes search config from Query declarations

Phase 4: Server-side search endpoint + sidecar lifecycle

  • Convention model gains searches config (from Query declarations)
  • Generic search route handler (call encoder, query pgvector, return results)
  • Deploy encoder as K8s Deployment + ClusterIP on convention registration
  • Health checking and teardown for encoder services

Phase 5: ConPLex hook + encoder (first use case)

  • Package ConPLex (ESM-1b + projection) as hook container
  • Add SMILES encoder entrypoint (RDKit + projection)
  • GPU resource requests in K8s runner (nvidia.com/gpu: 1)
  • Test end-to-end on PDB data

Supersedes

Depends on

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationrefactorInternal restructuring, no behavior change

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions