Implement Synix v0.9: multi-provider, concurrent, hybrid search#3
Implement Synix v0.9: multi-provider, concurrent, hybrid search#3
Conversation
…d search, and full observability Module restructure (S02): - New package layout: core/, build/, adapters/, cli/, search/ - Old modules become thin re-export shims for backward compatibility - Core models extracted to core/models.py, re-exported from __init__.py Logging & observability (S01): - SynixLogger with JSONL file logging per run (build/logs/) - RunLog/StepLog dataclasses tracking LLM calls, tokens, cache hits - -v/-vv verbosity flags on CLI Multi-provider LLM (S03): - LLMClient wrapping openai.OpenAI(base_url=...) for any OpenAI-compatible API - LLMConfig/EmbeddingConfig dataclasses with config precedence - Replaced anthropic SDK dependency with openai SDK Concurrent execution (S04): - LLMExecutor ABC with LLMRequest/LLMResult - SequentialExecutor and ConcurrentExecutor (ThreadPool + Semaphore + backoff) - --concurrency/-j CLI flag, wired into runner for by_conversation grouping Embeddings & hybrid search (S05): - EmbeddingProvider with content-hash caching (binary float32 files) - HybridRetriever: keyword, semantic, and hybrid modes - RRF score fusion (k=60), --mode/--top-k CLI flags synix plan (S06): - BuildPlan with per-layer estimates (LLM calls, tokens, cost) - --json/--save flags for plan output Shadow index swap (S07): - Build search index to search_shadow.db, atomic os.replace() on success - Old index preserved on build failure Artifact diffing (S08): - diff_builds(), diff_artifact() with unified diff output synix verify (S09): - 8 integrity checks: build_exists, manifest, artifacts, provenance, search_index, content_hashes, no_orphans, merge_integrity - --check flag for selective verification, --json output Text adapter (S10): - YAML frontmatter parsing, filename date inference, turn detection - Adapter registry with auto-detection by file extension Merge transform: - Jaccard similarity clustering with union-find - Natural language constraint parsing (e.g., "NEVER merge different customer_id") - Threshold + constraints in cache key Search CLI extensions: - --step (layer filter), --trace (provenance tree), --customer (metadata filter) Test infrastructure: - Mock LLM server (OpenAI-compatible HTTP, deterministic fixtures, error injection) - 3 demo corpora: personal (30 conv), startup (50 conv, 10 customers), incident (100 conv, 20 customers) - 477 tests (0 skipped, 0 failed): unit, integration, and E2E
Review ResponseAccepted all P0 and P1 items. Working on fixes now. P0 (adapter correctness):
P1 (perf + reliability): Note on anthropic dep: reviewer was incorrect — Pushing fix commit shortly. |
P0 — Adapter correctness: - ChatGPT: follow current_node path for linearization, fall back to first-child traversal for exports without current_node - ChatGPT: filter to user/assistant roles only (exclude system/tool/plugin) - Claude: normalize sender labels (human → user) for cross-source consistency - Claude: proper ISO-8601 timestamp parsing with fallback P1 — Performance: - Merge transform: pre-tokenize inputs once, eliminating O(n²) re-tokenization in pairwise Jaccard similarity comparisons - Semantic search: in-memory embedding cache keyed by content hash, loaded once per session instead of re-embedding all rows per query P1 — Architecture: - Relax single root-layer constraint to allow multi-source pipelines (e.g., separate ChatGPT and Claude level-0 layers) - Deep-copy config dict before passing to concurrent transform workers to prevent race conditions on shared mutable state P1 — Reliability: - Atomic cache writes (temp file + fsync + os.replace) for artifact store, provenance tracker, and embedding manifest - Actionable verify output: fix_hint field on VerifyCheck with specific remediation commands for each failure type Tests: 481 passed (4 new adapter tests, updated pipeline validation tests)
Conversation create_time only reflects when the conversation started, not when content was last produced. Both ChatGPT and Claude adapters now derive last_message_date from individual per-message timestamps.
Consolidate design documents under docs/: DESIGN.md, sprint-checklist.md, demo-test-specs.md, v09-build-plan.md. Add BACKLOG.md capturing deferred items from v0.9 PR review (episode chunking, retrospective provenance docs, full tree export, projections as DAG nodes, etc).
…atch race - Remove `open` re-export from __init__.py — no longer shadows builtin (both reviews, v2) - Export `SdkError` from __init__.py for user-facing error handling - Add path traversal validation in SdkSource — rejects `../` and absolute paths in add_text/remove (GPT critical) - Fix scratch release race: _get_closure reads snapshot_oid from receipt written by execute_release, not by re-resolving HEAD independently (Claude concern #3) - Scratch close() cleans up both work/ and releases/ dirs - Update sdk-design.md examples to use open_project (Claude question #4) - Update sdk.md quick start to use open_project - All tests use open_project; 3 new tests for path traversal + undeclared source
* feat: add Python SDK for programmatic access to synix projects Introduces synix.open(path) and synix.init(path, pipeline=...) entry points with Project, Release, SearchHandle, and typed error hierarchy. Supports build, release, search, artifact inspection, and ref listing through a stable Python API without touching CLI or internals directly. 68 e2e tests + 18 incremental cache tests covering build idempotency, release lifecycle, search correctness, and content-addressed dedup. * fix: address PR #93 review — API safety, error types, naming - Rename `open()` to `open_project()` to avoid shadowing Python builtin; keep `open` as deprecated alias for backwards compatibility - Use UUID-based scratch dir for `release("HEAD")` — prevents concurrent stomping (GPT critical finding) - Deep-copy pipeline in `build()` to prevent caller mutation (both reviews) - Remove `source()` fallback to undeclared sources — now raises SdkError with list of declared sources (GPT warning) - Add `ProjectionNotFoundError` — `flat_file()` no longer raises `SearchNotAvailableError` (GPT minor, wrong error taxonomy) - Extract `_resolve_flat_file_path()` — dedup flat_file/flat_file_path (Claude concern) - Wrap `_get_closure()` receipt parsing in SdkError (Claude concern) - Remove dead `SdkArtifact._from_snapshot_dict` (Claude nit) * fix: address round-2 review — remove open alias, path validation, scratch race - Remove `open` re-export from __init__.py — no longer shadows builtin (both reviews, v2) - Export `SdkError` from __init__.py for user-facing error handling - Add path traversal validation in SdkSource — rejects `../` and absolute paths in add_text/remove (GPT critical) - Fix scratch release race: _get_closure reads snapshot_oid from receipt written by execute_release, not by re-resolving HEAD independently (Claude concern #3) - Scratch close() cleans up both work/ and releases/ dirs - Update sdk-design.md examples to use open_project (Claude question #4) - Update sdk.md quick start to use open_project - All tests use open_project; 3 new tests for path traversal + undeclared source * fix: round-3 review — export all error types, remove stale docs - Export full error hierarchy from __init__.py (SynixNotFoundError, ReleaseNotFoundError, ArtifactNotFoundError, SearchNotAvailableError, EmbeddingRequiredError, PipelineRequiredError, ProjectionNotFoundError) - Remove stale deprecated-alias note from sdk.md (open was already removed) - Update sdk.md error import example to use `from synix import` (not sdk) - Fix variable name `l` → `layer` in list comprehension * docs: fix SDK documentation gaps - Fix stale synix.open() → open_project() in sdk.md and sdk-design.md - Fix incorrect BuildResult attributes in sdk-design.md (layers_built, cost_estimate → built, total_time, snapshot_oid) - Fix incorrect release_to() return type in sdk-design.md (dict, not object) - Document path traversal validation in SdkSource - Document build() deep-copy behavior - Update CLAUDE.md module comment to open_project - Add SDK link to README Learn More table
Closes #62 P0 trust/correctness: - Resolve relative source_dir/build_dir against pipeline file, not cwd (#3) - Clear synix_dir on --build-dir override to prevent stale routing (#4) - Propagate source load failures instead of silently succeeding (#5) - Add Layer.level read-only property to fix info crash (#8) - Rewrite info/status to read .synix/ snapshot store, not legacy build/ (#9) - Diff uses RefStore run history instead of legacy versions/ dir (#11) P1 operator consistency: - Planner uses estimated-count placeholders for downstream cardinality (#1) - Standardize invalid ref handling to sys.exit(1) across all inspectors (#10) - Clean also removes refs/releases/ ref files (#12) P2 docs/discoverability: - Mesh commands honor SYNIX_MESH_ROOT env var via resolve_mesh_root() (#2) - Batch planner tracks DAG cardinality instead of estimate_output_count(1) (#6) - Fix llms.txt diff syntax to match actual CLI (#7) - Add refs/plans to refs list prefix scan (#13)
Summary
core/,build/,adapters/,cli/,search/) with backward-compat shims for old import pathsLLMClientwrappingopenai.OpenAI(base_url=...)— works with any OpenAI-compatible API (OpenAI, Anthropic, Ollama, vLLM, DeepSeek)ConcurrentExecutorwith ThreadPool + Semaphore + exponential backoff.--concurrency/-jCLI flagEmbeddingProviderwith content-hash caching,HybridRetrieverwith keyword/semantic/hybrid modes, RRF fusion (k=60)SynixLoggerwith per-run JSONL file logs,RunLog/StepLogtracking LLM calls/tokens/cache hits,-v/-vvverbositysearch_shadow.db, atomicos.replace()on successdiff_builds(),diff_artifact()with unified diff output--step(layer filter),--trace(provenance tree),--customer(metadata filter)Test plan
uv run pytest tests/ -v— 477 passed, 0 skipped, 0 failed (~20s)