Skip to content

feat: add OpenAI embedding backend for vector indexing #33

@rorybyrne

Description

@rorybyrne

Summary

Add an OpenAI embedding backend as an alternative to the local sentence-transformers model. This enables much faster bulk indexing at minimal cost.

Motivation

The current local embedding (MiniLM-L6-v2) takes ~30 seconds per batch of 64 records on a small Fly.io instance. For 250k records, this translates to ~75 hours of indexing time.

OpenAI's embedding API can process the same dataset in minutes for under $1:

  • 250k records × ~150 tokens = ~37M tokens
  • text-embedding-3-small: $0.02/1M tokens = ~$0.75 total
  • text-embedding-3-large: $0.13/1M tokens = ~$5 total

Proposed Implementation

  1. New backend class: OpenAIStorageBackend in osa/infrastructure/index/openai/

  2. Config:

    class OpenAIBackendConfig(BackendConfig):
        api_key: str  # or use OPENAI_API_KEY env var
        model: str = "text-embedding-3-small"
        batch_size: int = 2048  # OpenAI supports up to 2048 inputs per request
        dimensions: int | None = None  # Optional dimensionality reduction
  3. Backend implementation:

    • Use openai Python SDK (async client)
    • Batch requests (up to 2048 embeddings per API call)
    • Store in ChromaDB (same as current vector backend)
    • Handle rate limits with exponential backoff
  4. Config selection: Allow choosing backend type in index config:

    indexes:
      vector:
        type: openai  # or "local" for sentence-transformers
        model: text-embedding-3-small

Alternatives Considered

  • Voyage AI: Similar pricing, good quality, but OpenAI is more widely used
  • Cohere: Slightly more expensive ($0.10/1M tokens)
  • Larger Fly instance: More expensive than API costs for bulk indexing
  • GPU instance: Overkill for this use case

Tasks

  • Add openai to dependencies
  • Create OpenAIBackendConfig
  • Implement OpenAIStorageBackend with batching and rate limit handling
  • Update config to support backend type selection
  • Add integration test with mocked OpenAI responses
  • Update documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions