Skip to content

feat(server): Phase 4 — gRPC proto extensions + soft-delete + reaper#170

Merged
thebtf merged 14 commits into
mainfrom
feat/engram-server-proto-b0
Apr 15, 2026
Merged

feat(server): Phase 4 — gRPC proto extensions + soft-delete + reaper#170
thebtf merged 14 commits into
mainfrom
feat/engram-server-proto-b0

Conversation

@thebtf
Copy link
Copy Markdown
Owner

@thebtf thebtf commented Apr 15, 2026

Summary

Phase 4 of the engram v4.4.0 loom integration train. Adds the server-side infrastructure that the daemon-side serverevents bridge (Phase 5) will consume: two new gRPC RPCs, an HTTP DELETE /api/projects/{id} endpoint, an in-process event bus, a DB migration, and an hourly soft-delete reaper job.

Implements spec FR-6 (SyncProjectState RPC), FR-7 (ProjectEvents streaming RPC), FR-8 (DELETE endpoint + soft-delete + lifecycle event source) + C3 clarification (soft-delete with 30-day retention window) + US2 (P1 project removal end-to-end).

This is the largest PR in the v4.4.0 train (~1,700 LOC production + tests).

What ships

Proto layer (T021 + T022)

  • proto/engram/v1/engram.proto — appended SyncProjectState unary RPC + ProjectEvents server-streaming RPC + 4 new message types (SyncProjectStateRequest/Response, ProjectEventsRequest, ProjectEvent) + ProjectEventType enum. Existing 3 RPCs (CallTool, Initialize, Ping) byte-identical — engramcore contract preserved.
  • proto/engram/v1/engram.pb.go + engram_grpc.pb.go — regenerated via protoc.

Database layer (T023)

  • Migration 082_projects_lifecycle (NOT 025 as plan.md assumed — real highest engram migration is 081; plan.md spec was based on a stale semantic-search result). Adds removed_at TIMESTAMPTZ NULL + last_heartbeat TIMESTAMPTZ DEFAULT NOW() columns to the projects table. Idempotent IF NOT EXISTS guards. Rollback drops both columns.

In-process event bus (T024 + T025)

  • internal/worker/projectevents/ — new package with Bus + Event types. Synchronous fan-out, panic-recovered per subscriber (matches loom.EventBus pattern). Subscribe(handler) func() returns unsubscribe closure. Tests cover concurrent emit, unsubscribe idempotency, panic recovery.

HTTP endpoint (T026 through T029)

  • internal/worker/handlers_projects.gohandleDeleteProject method on Service. Extracts {id} from chi.URLParam, validates non-empty + hash shape, UPDATEs projects SET removed_at = NOW() WHERE id = $1 AND removed_at IS NULL inside a transaction, emits project_removed event on success, returns 200 {id, removed_at} on success / 400 malformed / 404 not found (or already deleted) / 500 DB error.
  • Route r.Delete("/api/projects/{id}", s.handleDeleteProject) mounted in the require-ready chi.Group alongside existing admin routes — inherits ApiKeyAuth Bearer token middleware.
  • Service struct gains eventBus *projectevents.Bus field, initialised in NewService, passed through to handlers via receiver.

gRPC handlers (T030 through T033)

  • internal/grpcserver/sync_project_state.go — implements FR-6. Input validation: ≤10 000 IDs → INVALID_ARGUMENT, non-empty client_idINVALID_ARGUMENT. Queries projects table for the union of input IDs. Returns removed[] for IDs with non-NULL removed_at OR not in table. Returns unknown[] for IDs with last_heartbeat > 24h AND removed_at IS NULL. UPDATEs last_heartbeat = NOW() for every reported ID as a side effect. Returns server_time_unix_ms for clock-skew detection.
  • internal/grpcserver/project_events.go — implements FR-7. Server-streaming handler. Subscribes to projectevents.Bus on entry, forwards each Event to the gRPC stream. Returns OUT_OF_RANGE if since_event_id is non-empty (replay reserved for v0.2.0). Unsubscribes cleanly on ctx cancel or Send error. Terminates with OK on engram-server shutdown.
  • Both handlers tested via bufconn fake server with 9 total test cases.

Soft-delete visibility audit (T034)

  • Audited internal/search/, internal/worker/handlers_* for project-scoped queries via semantic codebase search. Two queries updated to filter removed_at IS NULL:
    • internal/worker/handlers_data.go:1179getProjectDisplayName() lookup
    • internal/db/gorm/project_store.go:90ResolveProjectID() legacy alias lookup
  • Intentionally not filtered (correct behavior documented): handleDeleteProject UPDATE (needs to find the live row to mark deleted), SyncProjectState SELECT (needs to see soft-deleted rows to return them as removed[]), reaper DELETE (hard-deletes the soft-deleted rows), migration scripts.

Reaper job (T035 + T036 + T037)

  • internal/worker/reaper/reaper.go — hourly soft-delete cleanup job. Reaper struct with Start(ctx) and Stop(). time.NewTicker(1 * time.Hour) loop. Each tick executes DELETE FROM projects WHERE removed_at < NOW() - retention_interval, where retention is driven by ENGRAM_PROJECT_RETENTION_DAYS env var (default 30). Respects ctx.Done for graceful shutdown. Reaper started in Service.Start after DB is ready, stopped in shutdown path.

CASCADE-DELETE FK AUDIT (key finding): engram has no ON DELETE CASCADE foreign keys referencing projects.id. All project-scoped tables (observations, sdk_sessions, injection_log, issues, project_settings, patterns, embeddings, collections, memory_blocks) carry denormalized project TEXT columns managed by their own maintenance workflows. The reaper therefore deletes only from the projects table — child data is NOT cascade-deleted and remains visible until the respective tables' own cleanup runs. This is a pre-existing architectural choice in engram; it's documented here so future maintainers don't assume reaper does a deep cascade.

10 atomic commits

Commit Task Description
aac4a47 T021 proto RPCs + messages added
6a80085 T022 regenerated pb.go + grpc.pb.go
1583734 T023 migration 082_projects_lifecycle
12cb0db T024 + T025 projectevents.Bus + tests
7f50812 T026 handlers_projects.go handler (75 LOC)
45bc756 T027 handlers_projects_test.go (302 LOC, 7 tests)
4c911fa T028 + T029 + T037 Service.eventBus field + DELETE route + reaper start
416baaa T030 grpcserver/sync_project_state.go (144 LOC)
407c057 T031 sync_project_state_test.go (202 LOC, 5 tests)
14c9370 T032 grpcserver/project_events.go (102 LOC)
64ea59b T033 project_events_test.go (202 LOC, 4 bufconn tests)
a70db66 T034 soft-delete visibility audit (2 files)
c4160b4 T035 reaper/reaper.go (148 LOC)
8f4ab4b T036 reaper/reaper_test.go (173 LOC, 4 tests)

Verification

  • go build ./... — clean
  • go vet ./... — clean
  • go test ./... -count=1 — all 48 packages PASS, zero regressions
  • Migration 082 idempotent + rollback verified locally
  • DB-requiring tests (T027, T031, T036) skip cleanly when DATABASE_DSN is unset; validation-only tests always run

Compatibility

  • engramcore byte-identity preserved — NFR-5 contract test still passes. Existing 3 gRPC RPCs unchanged.
  • Plugin wire format unchanged — daemon-server protocol additions do not affect the client plugin.
  • Zero new CGO dependencies — NFR-6 holds. Only stdlib + existing engram deps.
  • Constitution §15 — server + plugin will unify to v4.4.0 in Phase 6 ceremony after all PRs merge.

What does NOT change

Related

thebtf added 14 commits April 15, 2026 18:48
Command used:
  protoc --go_out=. --go_opt=paths=source_relative \
         --go-grpc_out=. --go-grpc_opt=paths=source_relative \
         proto/engram/v1/engram.proto

Tools:
  protoc-gen-go v1.36.11
  protoc-gen-go-grpc v1.6.1
  protoc (libprotoc) 34.0

Existing CallTool/Initialize/Ping symbols unchanged.
New symbols: SyncProjectStateRequest, SyncProjectStateResponse,
ProjectEventsRequest, ProjectEvent, ProjectEventType.
…t_heartbeat

Strictly additive — two new nullable columns on projects table:
  - removed_at TIMESTAMPTZ NULL (soft-delete timestamp, NULL = live)
  - last_heartbeat TIMESTAMPTZ DEFAULT NOW() (daemon heartbeat tracking)

Indexes:
  - idx_projects_removed_at (partial, WHERE removed_at IS NOT NULL)
  - idx_projects_last_heartbeat

Rollback drops both columns and indexes idempotently.
Also adds RemovedAt/*time.Time + LastHeartbeat/*time.Time to Project GORM model.
…nt bus

Bus struct with Subscribe(func(Event)) func() + Emit(Event).
Synchronous fan-out in registration order; panic per subscriber
recovered individually so one bad handler cannot block others.
Atomic uint64 subscription IDs; idempotent unsubscribe closure.
Safe for concurrent use.
Extracts {id} from chi URL params, validates via ValidateProjectName,
UPDATE projects SET removed_at = NOW() WHERE id = ? AND removed_at IS NULL,
returns 400/404/500/200 per status codes in FR-8.
Emits projectevents.Event on success for gRPC ProjectEvents fanout.
MalformedId (400), MissingId (400), HappyPath (200+removed_at set),
AlreadyDeleted (404), NotFound (404), EmitsEvent (bus receives event),
SoftDeletedInvisibleToGetProjects (T034 AC).
DB-backed tests skip without DATABASE_DSN.
T028: Add eventBus *projectevents.Bus field, initialized in NewService.
      handleDeleteProject receives bus via Service receiver.
T029: Register r.Delete("/api/projects/{id}", s.handleDeleteProject)
      in requireReady group (same chi.Group as other data routes).
T037: Start reaper goroutine in initializeAsync after DB ready;
      reaper.New(store.DB).Start(s.ctx) stopped by context cancel.
      Add projectReaper *reaper.Reaper field to Service.
Imports: projectevents, reaper packages added.
Server gains db *gorm.DB + bus *projectevents.Bus fields.
SetDB/SetBus injectors called from worker.initializeAsync.
SyncProjectState: validates client_id + len(ids) <= 10000,
queries removed_at for reported IDs, classifies removed vs unknown,
UPDATEs last_heartbeat for live IDs, returns server_time_unix_ms.
withDB(db) helper for test injection.
TooManyIds returns InvalidArgument (no DB needed).
EmptyClientId returns InvalidArgument (no DB needed).
HappyPath, RemovedDetected, HeartbeatUpdated require DATABASE_DSN.
All t.Parallel; DB tests skip without DATABASE_DSN.
Subscribes to projectevents.Bus on entry; forwards events over gRPC stream.
Returns OUT_OF_RANGE for non-empty since_event_id (v0.1.0 does not support replay).
Exits cleanly on context cancel (client disconnect or server shutdown).
Unsubscribes on exit. event_id is monotonically-increasing process-scoped seq.
Buffer of 64 events per stream; overflow is dropped (heartbeat provides fallback).
SinceEventIdRejected returns OUT_OF_RANGE.
HappyPath: one event reaches client via bufconn.
MultipleEvents: 3 events in order.
ContextCancel: stream terminates cleanly.
All t.Parallel; no real DB required.
…LL filters

Audited internal/search/, internal/worker/handlers_*, internal/db/gorm/project_store.go
for project-scoped queries that read from the projects table.

Changes:
- internal/worker/handlers_data.go:1179 getProjectDisplayName()
  SELECT display_name FROM projects now filters removed_at IS NULL.
- internal/db/gorm/project_store.go ResolveProjectID()
  Legacy alias lookup now excludes soft-deleted projects.

No-change paths (intentionally read all rows including soft-deleted):
- handleDeleteProject UPDATE — the soft-delete write itself
- SyncProjectState SELECT id, removed_at — needs to see removed rows to classify them
- Reaper DELETE — needs to see removed rows to hard-delete them
- migration scripts — admin data management

GetAllProjects() queries sdk_sessions.project (text column), not projects table.
Zero remaining unfiltered lookups in non-admin non-reaper non-audit context.
Reaper struct with Start(ctx)/Stop()/PurgeOnce(ctx).
time.NewTicker(1h) loop; DELETE FROM projects WHERE removed_at < NOW()-retention.
ENGRAM_PROJECT_RETENTION_DAYS env var (default 30); idempotent; ctx-aware.

FK audit (no cascade needed):
- observations, sdk_sessions, injection_log: project is TEXT, no FK
- project_settings: standalone table, no FK to projects.id
- issues: source_project/target_project are TEXT, no FK
- patterns, embeddings, collections, memory_blocks: no FK to projects
VERDICT: no ON DELETE CASCADE FK exists; reaper deletes from projects only.
…+ DB

PurgesExpired: project removed 60d ago is hard-deleted.
PreservesUnexpired: project removed 1d ago survives default 30d retention.
RespectsRetentionEnvVar: ENGRAM_PROJECT_RETENTION_DAYS=1 purges 2d-old row.
StopsOnContextCancel: goroutine exits cleanly within 5s.
All t.Parallel; DB tests skip without DATABASE_DSN.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Warning

Rate limit exceeded

@thebtf has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 7 minutes and 6 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 7 minutes and 6 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 123328ec-d2c3-463b-a875-d24488e0897e

📥 Commits

Reviewing files that changed from the base of the PR and between ff57fec and 8f4ab4b.

⛔ Files ignored due to path filters (2)
  • proto/engram/v1/engram.pb.go is excluded by !**/*.pb.go
  • proto/engram/v1/engram_grpc.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (17)
  • internal/db/gorm/migrations.go
  • internal/db/gorm/models.go
  • internal/db/gorm/project_store.go
  • internal/grpcserver/project_events.go
  • internal/grpcserver/project_events_test.go
  • internal/grpcserver/server.go
  • internal/grpcserver/sync_project_state.go
  • internal/grpcserver/sync_project_state_test.go
  • internal/worker/handlers_data.go
  • internal/worker/handlers_projects.go
  • internal/worker/handlers_projects_test.go
  • internal/worker/projectevents/bus.go
  • internal/worker/projectevents/bus_test.go
  • internal/worker/reaper/reaper.go
  • internal/worker/reaper/reaper_test.go
  • internal/worker/service.go
  • proto/engram/v1/engram.proto
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/engram-server-proto-b0

Comment @coderabbitai help to get the list of available commands and usage tips.

@thebtf
Copy link
Copy Markdown
Owner Author

thebtf commented Apr 15, 2026

@coderabbitai review focus:security, performance, best-practices

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements project lifecycle management, including soft-deletion, heartbeat tracking, and real-time event notifications. Key changes include database migrations for removed_at and last_heartbeat columns, the addition of gRPC endpoints for project state synchronization and event streaming, and an HTTP handler for soft-deleting projects. It also introduces an in-process event bus and a periodic reaper job to hard-delete projects after a configurable retention period. Feedback was provided regarding a date typo in the reaper's audit documentation.

Comment thread internal/worker/reaper/reaper.go
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== handlers_projects.go ==="
fd -e go -g 'handlers_projects.go' --exec cat {}

echo "=== sync_project_state.go ==="
fd -e go -g 'sync_project_state.go' --exec cat {}

echo "=== project_events.go (grpcserver) ==="
fd -e go -g 'project_events.go' --exec cat {}

echo "=== reaper.go ==="
fd -e go -g 'reaper.go' --exec cat {}

echo "=== projectevents bus ==="
fd -e go -p 'projectevents' --exec cat {}

echo "=== project_store.go ==="
fd -e go -g 'project_store.go' --exec cat {}

echo "=== handlers_data.go lines around 1179 ==="
fd -e go -g 'handlers_data.go' --exec sed -n '1160,1200p' {}

echo "=== Service struct + NewService ==="
fd -e go -g 'service.go' -p 'worker' --exec cat {}

echo "=== migration 082 ==="
fd -e sql -g '*082*' --exec cat {}
fd -e go -g '*082*' --exec cat {}

Length of output: 25068


🏁 Script executed:

#!/bin/bash
echo "=== Service struct (find service.go in worker) ==="
find . -type f -name 'service.go' | xargs grep -l 'eventBus\|Service struct' 2>/dev/null | head -5 | xargs cat

echo "=== Migration 082 ==="
find . -type f | grep -i '082' | head -10 | xargs cat 2>/dev/null

echo "=== ValidateProjectName ==="
rg -n 'func ValidateProjectName' -A 20

echo "=== Route registration (chi Delete route) ==="
rg -n 'handleDeleteProject\|r\.Delete' -C 3

echo "=== grpcserver Server struct ==="
rg -n 'type Server struct' -A 10 --type go

echo "=== grpcserver token auth middleware ==="
rg -n 'token\|ApiKeyAuth\|Bearer\|Authorization' --type go -l

echo "=== ProjectEvents auth check ==="
rg -n 'token\|auth\|intercept' internal/grpcserver/ -A 3

echo "=== nowUnixMs unused? ==="
rg -n 'nowUnixMs' --type go

echo "=== retentionDays called per tick ==="
rg -n 'retentionDays' --type go -A 3

echo "=== ANY(?) postgres driver compatibility ==="
rg -n 'ANY\(\?\)' --type go -B 2 -A 2

echo "=== Stop called without Start? done channel block ==="
rg -n 'func.*Stop\(\)' internal/worker/reaper/ -A 15

echo "=== sync_project_state unknown includes caller's own IDs? ==="
rg -n 'unknown\|staleIDs' internal/grpcserver/sync_project_state.go

echo "=== grpc server unary interceptors ==="
rg -n 'UnaryInterceptor\|StreamInterceptor\|grpc.NewServer' --type go -A 5

Length of output: 157253


@thebtf Here is the focused security, performance, and best-practices review of this PR.


🔴 Security

1. ANY(?) without pq.Array() — runtime query failure

internal/grpcserver/sync_project_state.go, Lines 64 & 90

The two new raw queries pass a bare []string to PostgreSQL's ANY(?) operator:

// Line 64
Raw("SELECT id, removed_at FROM projects WHERE id = ANY(?)", localIDs).

// Line 90
Exec("UPDATE projects SET last_heartbeat = ? WHERE id = ANY(?) AND removed_at IS NULL", now, localIDs)

The existing codebase wraps Go slices in pq.Array() for this exact pattern (see internal/db/gorm/injection_log_store.go line 52: pq.Array(observationIDs)). Without the wrapper, lib/pq cannot encode a []string as a PostgreSQL array literal, causing a runtime driver error. This will silently succeed in unit tests (which skip when DATABASE_DSN is unset) but fail against a real database.

Fix:

import "github.com/lib/pq"

// Line 64
Raw("SELECT id, removed_at FROM projects WHERE id = ANY(?)", pq.Array(localIDs)).

// Line 90
Exec("UPDATE projects SET last_heartbeat = ? WHERE id = ANY(?) AND removed_at IS NULL",
    now, pq.Array(localIDs))

2. SyncProjectState — global information disclosure in unknown[]

internal/grpcserver/sync_project_state.go, Lines 96–105

The stale/orphan query is unbounded and not scoped to the caller's reported IDs or client_id:

s.db.WithContext(ctx).
    Raw(
        "SELECT id FROM projects WHERE removed_at IS NULL AND (last_heartbeat IS NULL OR last_heartbeat < ?)",
        staleThreshold,
    ).
    Pluck("id", &staleIDs)

Any authenticated daemon receives the full list of all stale projects across the entire server, including those belonging to unrelated daemons. This is an information-disclosure issue: daemon A can enumerate project IDs it has no relationship to.

Recommended fix: scope the query to IDs the caller has previously reported, or introduce a client_id-to-project ownership model. At minimum, add a LIMIT to cap the response size:

"SELECT id FROM projects WHERE removed_at IS NULL AND (last_heartbeat IS NULL OR last_heartbeat < ?) LIMIT 1000"

3. Reaper.Stop() deadlocks if called before Start()

internal/worker/reaper/reaper.go, Line 91–99

func (r *Reaper) Stop() {
    select {
    case <-r.stop:   // already closed
    default:
        close(r.stop)
    }
    <-r.done  // ← blocks forever if Start() was never called
}

r.done is only closed by the goroutine launched in Start(). If Stop() is ever called on an unstarted Reaper (e.g., in test teardown or after a startup error), it deadlocks permanently. The service Shutdown() path relies on context cancellation rather than Stop(), which avoids the issue in production — but the Stop() API is publicly exposed and unsafe.

Fix:

func (r *Reaper) Stop() {
    select {
    case <-r.stop:
    default:
        close(r.stop)
    }
    select {
    case <-r.done:
    default:
        // Start() was never called; nothing to wait for.
    }
}

🟡 Performance

4. retentionDays() performs an os.Getenv syscall on every hourly purge

internal/worker/reaper/reaper.go, Line 108

func (r *Reaper) purge(ctx context.Context) {
    retention := time.Duration(retentionDays()) * 24 * time.Hour  // env read every tick

retentionDays() re-reads ENGRAM_PROJECT_RETENTION_DAYS from the environment on every invocation. For an hourly job this is negligible in absolute terms, but it's inconsistent with the expectation that Start() logs the retention once. More importantly, it means the retention window can silently change between ticks without any log, making it harder to audit.

Fix: read and store the value once in New() or Start():

type Reaper struct {
    db            *gorm.DB
    stop          chan struct{}
    done          chan struct{}
    retentionDays int  // resolved once at construction
}

func New(db *gorm.DB) *Reaper {
    return &Reaper{
        db:            db,
        stop:          make(chan struct{}),
        done:          make(chan struct{}),
        retentionDays: retentionDays(),
    }
}

5. Unbounded unknown[] stale query — no LIMIT

Already noted in finding #2. For a table with thousands of projects, the SELECT id FROM projects WHERE removed_at IS NULL AND (last_heartbeat IS NULL OR last_heartbeat < ?) scan returns an unbounded result set included in every SyncProjectState response. Add a LIMIT aligned with maxSyncProjectIDs:

SELECT id FROM projects
WHERE removed_at IS NULL
  AND (last_heartbeat IS NULL OR last_heartbeat < ?)
LIMIT 10000

🔵 Best Practices

6. nowUnixMs() is dead code

internal/grpcserver/project_events.go, Lines 99–101

// nowUnixMs returns the current time as Unix milliseconds.
func nowUnixMs() int64 {
    return time.Now().UnixMilli()
}

This function is defined but never called anywhere in the codebase. It will produce a go vet / linter warning and adds noise. Remove it or replace the inline time.Now().UnixMilli() calls in busEventToProto with it for consistency — but don't leave it unused.


7. Heartbeat update errors silently discarded with no telemetry

internal/grpcserver/sync_project_state.go, Lines 87–93

// Ignore errors — heartbeat is best-effort.
if len(localIDs) > 0 {
    s.db.WithContext(ctx).
        Exec("UPDATE projects SET last_heartbeat = ? WHERE id = ANY(?) AND removed_at IS NULL", now, localIDs)
}

While best-effort is a legitimate design choice, completely discarding the error—even at debug level—makes it impossible to distinguish "DB is down" from "IDs don't exist." A single log.Debug().Err(result.Error) preserves the intent while improving diagnosability.


8. PurgeOnce() always returns nil — swallows purge errors from callers

internal/worker/reaper/reaper.go, Lines 143–149

func (r *Reaper) PurgeOnce(ctx context.Context) error {
    if r.db == nil {
        return fmt.Errorf("reaper: db is nil")
    }
    r.purge(ctx)  // ← errors logged internally but not returned
    return nil
}

purge() logs errors via log.Error() but PurgeOnce always returns nil. Integration tests using PurgeOnce cannot distinguish a successful purge from a failed one, defeating the purpose of the error return type. Fix:

func (r *Reaper) PurgeOnce(ctx context.Context) error {
    if r.db == nil {
        return fmt.Errorf("reaper: db is nil")
    }
    return r.purge(ctx)  // propagate
}

// Change purge() signature to return error
func (r *Reaper) purge(ctx context.Context) error { ... }

9. Redundant inner len(localIDs) > 0 guard

internal/grpcserver/sync_project_state.go, Lines 80–93

The heartbeat UPDATE block at line 80 is already inside the outer if len(localIDs) > 0 block (line 47). The inner guard is dead:

// Outer guard at line 47 already ensures len(localIDs) > 0
if len(localIDs) > 0 {   // ← redundant inner check
    s.db.WithContext(ctx).Exec(...)
}

Remove the inner check to reduce cognitive noise.


Summary Table

# Area Severity File Finding
1 Security 🔴 Critical sync_project_state.go ANY(?) without pq.Array() — runtime DB failure
2 Security 🔴 High sync_project_state.go unknown[] is a global scan — information disclosure + unbounded response
3 Security 🟠 Medium reaper/reaper.go Stop() deadlocks if called before Start()
4 Performance 🟡 Low reaper/reaper.go os.Getenv on every hourly tick
5 Performance 🟡 Low sync_project_state.go No LIMIT on stale project scan
6 Best Practice 🔵 Low project_events.go nowUnixMs() dead code — remove
7 Best Practice 🔵 Low sync_project_state.go Heartbeat errors silently discarded
8 Best Practice 🔵 Low reaper/reaper.go PurgeOnce() always returns nil
9 Best Practice 🔵 Info sync_project_state.go Redundant inner len(localIDs) > 0 guard

Finding #1 (pq.Array()) is the most urgent — it will cause SyncProjectState to fail against any real PostgreSQL deployment.

@thebtf thebtf merged commit d324f01 into main Apr 15, 2026
8 checks passed
@thebtf thebtf deleted the feat/engram-server-proto-b0 branch April 15, 2026 20:06
thebtf added a commit that referenced this pull request Apr 15, 2026
Phase 6 release ceremony for the v4.4.0 train. Bumps the unified
engram-server + plugin version after the 5-PR train completed:

  #167 — Phase B-1 plumbing tenant (loom integration)
  #168 — Phase 2 obs.MeterFor helper + loom OTel wiring
  #169 — Phase 3 4 loom_* tools + CLI worker with allowlist
  #170 — Phase 4 server-side gRPC proto extensions + soft-delete reaper
  #171 — Phase 5 daemon serverevents bridge for real OnProjectRemoved e2e

Per Constitution §15 the daemon version and plugin version must move
together, so this commit bumps:

  - cmd/engram/main.go daemonVersion "v4.3.0" → "v4.4.0"
    Reported to gRPC Initialize + logged in structured output.
  - plugin/engram/.claude-plugin/plugin.json version "4.3.0" → "4.4.0"
    Drives Claude Code plugin cache invalidation on /reload-plugins.
    Without this bump the marketplace update would not be detected.

No behaviour change, no test change. Release notes + git tag + gh
release + marketplace sync land in subsequent steps.
thebtf added a commit that referenced this pull request Apr 18, 2026
Integration test TestCredentialDecryptRoundTripAfterMigration in
internal/grpcserver/credential_migration_test.go verifies the core US3
invariant: credentials migrated from `observations` into the dedicated
`credentials` table via migration 090 preserve AES-256-GCM ciphertext
byte-for-byte, and the vault can still decrypt every migrated secret
back to its original plaintext with 100% fidelity.

Why this gate exists:
  Migration 090 copies encrypted_secret from observations into
  credentials via a SQL SELECT into a BYTEA column. If either end
  silently mutated the bytes (encoding, trimming, pad handling),
  AES-GCM decrypt would fail authentication-tag check and the
  production vault would become unusable. That would lose all 13
  prod credentials irrecoverably.

What the test does:
  1. Opens a real Postgres + pgvector DB, applies migrations through 090.
  2. Creates a real crypto.Vault with a deterministic test key
     (not the prod key — the test is scoped to a unique test project
     slug to avoid colliding with other data).
  3. Encrypts 5 diverse plaintexts via the real vault:
     - short ASCII
     - empty plaintext (GCM edge case: 0-byte plaintext produces
       nonce+tag = 28-byte ciphertext)
     - unicode mixed (Cyrillic + CJK + emoji, multi-byte UTF-8)
     - long plaintext (~1KB of repeated bytes)
     - binary-like (embedded NUL bytes + high-bit bytes, guards
       against any future code that reads the secret as a text string
       instead of raw bytes)
  4. Seeds each plaintext as an observations row with the ciphertext
     + fingerprint + type='credential' + the required NOT NULL columns.
  5. Seeds 3 ADVERSARIAL EXCLUDED observations (is_suppressed=true,
     is_archived=1, is_superseded=1 — one each). Migration 090 filters
     these out via its WHERE clause; if the filters ever get loosened,
     one of the assertions in sub-test adversarial_excluded_rows_not_
     migrated will fire.
  6. Seeds 1 ORPHAN observation encrypted with a different (wrong)
     fingerprint to exercise CountWithDifferentFingerprint.
  7. Re-runs the exact migration 090 credentials INSERT SQL scoped to
     the test project slug (the SQL is copy-pasted verbatim from
     internal/db/gorm/migrations.go — drift will fail the test).
  8. Reads each credential back via CredentialStore.Get and asserts:
     - encrypted_secret bytes == original ciphertext (byte-for-byte)
     - encryption_key_fingerprint == vault fingerprint (exact)
     - vault.Decrypt(migrated_ciphertext) == original plaintext
  9. Three adversarial sub-tests:
     - orphan_credential_visible_via_fingerprint_check: verifies the
       badFP orphan is visible in List and counted globally by
       CountWithDifferentFingerprint(goodFP).
     - adversarial_excluded_rows_not_migrated: verifies each of the
       three excluded observations (suppressed / archived / superseded)
       is NOT present in the credentials table. This is the load-
       bearing test for the WHERE clause filters.
     - tampered_ciphertext_fails_decrypt: flips one byte in the
       migrated ciphertext and verifies AES-GCM authentication
       catches the mutation. This guards against "decrypt accepts
       anything" — which would make the gate a stub.

Anti-stub proof:
  Swap vault.Decrypt body -> return "", nil ⇒ short_ascii / unicode /
  long / binary sub-tests all fail (decrypted != plaintext) AND
  tampered_ciphertext_fails_decrypt fails (require.Error vs nil).
  Test cannot pass with a stubbed decrypt path.

Resource hygiene:
  NewStore's *Store is captured and closed via defer gormStore.Close().
  A separate direct gorm.Open handle is used for raw INSERT seeds and
  the project-scoped migration re-run, explicitly closed via its
  sqlDB.Close() defer.

Scope guard:
  ONE new file: internal/grpcserver/credential_migration_test.go.
  No production code changed. Uses only exported APIs:
  crypto.NewVault, crypto.Vault.Encrypt/Decrypt/Fingerprint,
  gorm.NewStore, gorm.Config, gorm.Store.Close, gorm.NewCredentialStore,
  gorm.CredentialStore.Get/List/CountWithDifferentFingerprint.

Verified:
  DATABASE_DSN=postgres://.../engram_f1_test — TestCredentialDecrypt-
  RoundTripAfterMigration PASS (1 top-level + 8 sub-tests, ~1.8s wall).
  Sub-tests: short_ascii, empty_plaintext, unicode_mixed,
  long_plaintext_1kb, binary_like_with_nul_and_high_bytes,
  orphan_credential_visible_via_fingerprint_check,
  adversarial_excluded_rows_not_migrated, tampered_ciphertext_fails_decrypt.

Does NOT verify (intentional — out of scope for CI F-1 gate):
  - The production invariants credential_count=13, fingerprint=
    aa78e55cf896508c, mismatch_count=0. Those belong to the
    staging dry-run against the real 2026-04-17 pg_dump snapshot,
    run out of band by the operator before Commit G lands. F-1 CI
    gate proves the PRIMITIVE works (round-trip byte-preserving);
    staging dry-run proves the MIGRATION ran correctly on prod data.

Pre-existing test failures (NOT regressions from this commit):
  - TestSyncProjectState_* (internal/grpcserver) — introduced in
    d324f01 (PR #170, Phase 4) with a minimal schema that conflicts
    with the full projects-table migration. Fails on any DB with
    migrations 045+ applied.
  - TestHandleMemoryTriggers_BashCommand* (internal/worker) — from
    fd8ce26 (PR #135, learning-memory-v4) — returns 500 not 200
    on fresh pgvector DB. Unrelated to US3.

Ref: .agent/specs/engram-v5-baseline/changes/CR-001-initial-scope/
tasks.md T022 AC 1-5. Addresses two review findings:
(1) NewStore return value leaked pool — fixed via capture+Close.
(2) Exclusion filter coverage gap — fixed via adversarial sub-test.
thebtf added a commit that referenced this pull request Apr 18, 2026
Two pre-existing regressions in internal/grpcserver/sync_project_state.go
and its test, shipped since PR #170 (d324f01, 2026-04-18 gRPC Phase 4):

1. Raw(...ANY(?)..., localIDs) and Exec(...ANY(?)..., now, localIDs)
   passed []string directly. The gorm postgres driver does NOT encode
   []string as a PostgreSQL array for positional params, so the query
   failed with ERROR: malformed array literal (SQLSTATE 22P02). Fix:
   wrap localIDs with pq.Array() at both call sites. Mirrors the working
   pattern in internal/sessions/store.go:189.

2. TestSyncProjectState_HeartbeatUpdated flaked on Windows: Go's
   time.Now().UTC() has 100-nanosecond resolution from the monotonic
   clock; PostgreSQL timestamptz stores microsecond precision. When the
   captured 'before' fraction was in the same microsecond bucket as the
   server's UPDATE timestamp, the round-tripped newHB was strictly
   less-than 'before' and the assertion fired. Fix: Truncate(time.Microsecond)
   on 'before' so the comparison operates at the precision the DB stores.

Verified:
  go test -count=3 -run ^TestSyncProjectState ./internal/grpcserver/
  — 5 tests x 3 iterations = 15 passes, none flaked.
thebtf added a commit that referenced this pull request Apr 18, 2026
…ules (PR-A, US3) (#181)

* feat(us3): migration 087_credentials + CredentialStore skeleton (Commit A, PR-A)

Create dedicated credentials table as additive schema for US3 observations split.

Migration ID note: spec tasks.md says 077_credentials, but US1+US2 consumed 083-086
(083 drop orphan joins, 084 Gemini-fix Rollback, 085 drop content_chunks,
086 drop used_vector). US3 migrations therefore shift to 087 onwards (next-free).
This commit creates 087; subsequent commits in PR-A will create 088 (memories),
089 (behavioral_rules), 090 (data migration 3-way split).

Changes:
- internal/db/gorm/migrations.go: add migration 087_credentials
  CREATE TABLE credentials per spec.md §Data Model (id BIGSERIAL PK, project,
  key, encrypted_secret BYTEA, encryption_key_fingerprint, scope, version,
  edited_by, created_at, updated_at, deleted_at, UNIQUE(project, key)).
  Partial indexes on project and encryption_key_fingerprint WHERE deleted_at
  IS NULL. Rollback drops indexes + table.
- pkg/models/credential.go (new): domain Credential struct, no GORM tags
  (mirrors pkg/models/observation.go pattern — infra concerns stay in
  internal/db/gorm).
- internal/db/gorm/models.go: add GORM Credential row type with tags matching
  all columns + TableName() returning "credentials".
- internal/db/gorm/credential_store.go (new): CredentialStore with Create,
  Get, List, Delete (soft-delete via UPDATE deleted_at), CountCredentials,
  CountWithDifferentFingerprint, DeleteOrphanedByFingerprint. Signatures
  mirror ObservationStore vault helpers so callers in handlers_vault.go
  can swap receiver types without touching call sites. Conversion via
  credentialRowToModel() keeps pkg/models isolated from gorm internals.
- internal/db/gorm/credential_store_test.go (new): integration smoke tests
  skip unless DATABASE_DSN is set (matches migrations_integration_test.go
  pattern). Exercises Create→Get→List→Count→Delete round-trip against a
  real pgvector/pgvector:pg17 instance. Anti-stub: replacing any method
  body with `return nil` causes the round-trip test to fail.

Protected invariants (NONE violated by this commit — additive-only):
- observations table UNTOUCHED (still has 2775 live rows / 13 credentials)
- content table UNTOUCHED
- vault ciphertext bytes unchanged
- fingerprint aa78e55cf896508c / credential_count 13 / mismatch_count 0
  (verified pre-flight 2026-04-17, nothing in this commit touches prod data)

Verification:
- go build ./... — green
- go vet ./internal/db/gorm/... — green
- go test ./internal/db/gorm/... -run Credential — 3/3 PASS against
  pgvector/pgvector:pg17 with full migration chain 001→087 applied fresh
- TestMigrationsIntegration failure is pre-existing on main 83bbc62
  (stale vector-dim assertion, legacy US2 leftover) — not introduced here

Task: T017 Commit A
Spec: .agent/specs/engram-v5-baseline/spec.md §Data Model §credentials
Plan: .agent/specs/engram-v5-baseline/plan.md §Phase 5
Phase 5 gate: F-1 decrypt round-trip (T022) — not yet

* feat(us3): migrations 088_memories + 089_behavioral_rules (Commit B, PR-A)

CREATE TABLE only — no Go store code, no data migration, no handlers.
Commit C will add memory_store + behavioral_rules_store + gRPC dispatch.
Commit D will migrate data from observations.

Migration ID shift: spec.md says 078/079; Commit A took 087; next-free IDs
are 088 (memories) + 089 (behavioral_rules).

Schemas match spec.md §Data Model verbatim (column-for-column verified):

MEMORIES (internal/db/gorm/migrations.go migration 088):
- id BIGSERIAL PK, project TEXT NOT NULL, content TEXT NOT NULL,
  tags JSONB DEFAULT '[]', source_agent TEXT, version INT DEFAULT 1,
  edited_by TEXT, created_at/updated_at TIMESTAMPTZ DEFAULT NOW(),
  deleted_at TIMESTAMPTZ, search_vector tsvector GENERATED ALWAYS AS
  STORED (dual-dict english + simple per migration 076 pattern).
- Indexes: idx_memories_project_created partial WHERE deleted_at IS NULL,
  idx_memories_fts GIN, idx_memories_tags GIN.
- Rollback drops indexes (reverse order) then table.

BEHAVIORAL_RULES (migration 089):
- id BIGSERIAL PK, project TEXT (NULLable — NULL = global rule),
  content TEXT NOT NULL, priority INT DEFAULT 0, version INT DEFAULT 1,
  edited_by TEXT, created_at/updated_at TIMESTAMPTZ DEFAULT NOW(),
  deleted_at TIMESTAMPTZ.
- Indexes: idx_behavioral_rules_project_priority partial WHERE
  deleted_at IS NULL, idx_behavioral_rules_global partial WHERE
  project IS NULL AND deleted_at IS NULL.
- Rollback drops indexes (reverse order) then table.

NOT carried over from observations: importance_score, relevance_score,
effectiveness_*, cited, inject_count — per spec S1 (scoring dropped).
No FTS on behavioral_rules — rules are injected unconditionally at
session-start, not searched.

Protected invariants (NONE violated — additive only):
- observations table UNTOUCHED (still 2775 live rows / 13 credentials)
- credentials table from Commit A UNTOUCHED
- content table UNTOUCHED
- vault ciphertext bytes unchanged
- fingerprint aa78e55cf896508c / credential_count 13 / mismatch_count 0

Verification:
- go build ./... — green
- Migration chain 001→089 "all migrations passed" against
  pgvector/pgvector:pg17 fresh DB
- TestCredentialStore_* 3/3 PASS + rotation sub-test + 5 validation
  sub-tests (regression for Commit A — unchanged)
- psql SELECT tablename FROM pg_tables WHERE tablename IN
  ('credentials','memories','behavioral_rules') — 3 rows
- 11 indexes created (pkey + idx_* across 3 tables)
- TestMigrationsIntegration vector-dim failure is pre-existing on
  main 83bbc62 (US2 leftover, not introduced here)

Task: T018 Commit B
Spec: .agent/specs/engram-v5-baseline/spec.md §Data Model §memories + §behavioral_rules

* feat(us3): memory_store + behavioral_rules_store + store_rule/list_rules MCP tools (Commit C, PR-A)

Implement CRUD stores for the new static-entity tables (migrations 088_memories,
089_behavioral_rules from Commit B) and wire two new MCP tools (store_rule,
list_rules) that read/write through them.

Scope boundary respected (per tasks.md T019):
- handleStoreMemory / handleRecall STILL route to observations (Commit E switches them)
- handlers_vault.go + credential_store.go UNTOUCHED (Commit E scope)
- No data migrations touched (Commit D = T020)
- No observations dropped (Commit G = T023)

FILES CREATED (6 new):

- pkg/models/memory.go: domain Memory struct (no GORM tags, Tags []string real slice,
  search_vector intentionally absent from domain — it's a GENERATED column).
- pkg/models/behavioral_rule.go: domain BehavioralRule struct (Project *string for
  NULLable global-rule semantics; *time.Time pointers for all timestamps).
- internal/db/gorm/memory_store.go: MemoryStore with Create/Get/List/Update/Delete.
  Create returns a new *models.Memory (does not mutate input — per Commit A
  code-review HIGH-1 immutability rule). Delete is HARD (Unscoped) per Commit A
  code-review HIGH-2 rotation analysis — deleted_at column remains in schema
  for future use but is not populated. Update uses gorm.Expr("version + 1") +
  re-fetch. Empty-arg guards on Get/List.
- internal/db/gorm/memory_store_test.go: 3 integration tests skip without
  DATABASE_DSN. Covers full round-trip (Create→Get→Update→List→Delete),
  validation errors (nil/empty/empty), and per-project filtering (2 projects,
  3 memories, asserts isolation).
- internal/db/gorm/behavioral_rules_store.go: BehavioralRulesStore with same CRUD
  shape + global-rule semantics: List(nil, limit) returns only rows where
  project IS NULL; List(&proj, limit) returns rows WHERE project = proj OR
  project IS NULL (global rules always apply).
- internal/db/gorm/behavioral_rules_store_test.go: 3 integration tests.
  TestBehavioralRulesStore_List_GlobalRulesAlwaysIncluded validates the
  union semantics: 1 global + 1 project-scoped row; List(&"p1") returns both,
  List(nil) returns only the global.
- internal/mcp/tools_rules.go: handleStoreRule + handleListRules. Both
  validate inputs, wrap errors with %w, return JSON responses with RFC3339
  timestamps. handleStoreRule requires content, defaults priority to 0,
  project optional (nil = global). handleListRules bounds limit 1-500
  (default 50), project optional.

FILES MODIFIED (2):

- internal/db/gorm/models.go: +38 lines appending GORM row structs for Memory
  (Tags via models.JSONStringArray matching existing JSONB convention; deleted_at
  *time.Time; TableName returns "memories") and BehavioralRule (Project *string
  NULLable; TableName returns "behavioral_rules").
- internal/mcp/server.go: +47 lines.
  * Server struct: 2 new fields memoryStore + behavioralRulesStore after issueStore.
  * 2 new setters SetMemoryStore + SetBehavioralRulesStore (mirrors SetIssueStore
    pattern — called by worker bootstrap after async init).
  * handleToolsList: 2 new Tool entries (store_rule, list_rules) at tierUseful.
  * callTool switch: 2 new cases routing to tools_rules.go handlers (after
    merge_observations, before get_observation).
- internal/worker/service.go: +9 lines.
  * Construct memoryStore and behavioralRulesStore via gorm.New*Store(store)
    alongside reasoningStore.
  * Wire both into mcpServer via the new setters, alongside SetReasoningStore
    + SetIssueStore (consistent bootstrap order).

Protected invariants (NONE violated — additive only, no observation writes):
- observations table UNTOUCHED (still 2775 live rows / 13 credentials)
- credentials table UNTOUCHED
- memories + behavioral_rules tables gain functional CRUD (no production data yet)
- content table UNTOUCHED
- vault ciphertext bytes unchanged
- fingerprint aa78e55cf896508c / credential_count 13 / mismatch_count 0

Verification:
- go build ./... — green
- go vet ./internal/db/gorm/... ./internal/mcp/... — clean
- go test ./internal/db/gorm/... -run "Memory|BehavioralRules|Credential" -v
  against pgvector/pgvector:pg17 fresh DB:
  PASS TestBehavioralRulesStore_CreateGetUpdateListDelete (65s — includes migration chain)
  PASS TestBehavioralRulesStore_Create_ValidationErrors (2 sub-tests)
  PASS TestBehavioralRulesStore_List_GlobalRulesAlwaysIncluded
  PASS TestCredentialStore_CreateGetCountDelete (regression preserved)
  PASS TestCredentialStore_DeleteOrphanedByFingerprint (regression preserved)
  PASS TestCredentialStore_Create_ValidationErrors (5 sub-tests — regression)
  PASS TestMemoryStore_CreateGetUpdateListDelete
  PASS TestMemoryStore_Create_ValidationErrors (3 sub-tests)
  PASS TestMemoryStore_List_FiltersByProject
  TOTAL: 9 top-level + 10 sub-tests, all green, ~78s incl. migration chain.

Task: T019 Commit C
Spec: .agent/specs/engram-v5-baseline/spec.md §Data Model + §FR-4

* feat(us3): migration 090_observations_to_static_entities (Commit D, PR-A)

3-way data split: observations → credentials + behavioral_rules + memories.
Ciphertext preserved byte-for-byte; DO-block enforces 2 invariants.

Migration ID shift: plan.md originally drafted this as 080_...; actual next-free
is 090 (US1+US2 consumed 083-086; Commit A=087, Commit B=088+089).

Plan.md column mapping amended 2026-04-18 (see plan.md §Amendment 2026-04-18) —
the original draft assumed columns that do not exist on observations:
- content (creds)   → title                  (per ObservationStore.GetCredential)
- content (mem/rul) → COALESCE(NULLIF(TRIM(narrative),''), title, '')
- always_inject col → concepts @> ["always-inject"]::jsonb
- creation_path fil → dropped (migrate ALL live non-credential non-always-inject)
- tags              → concepts (JSONStringArray stored jsonb)
- source_agent      → agent_source
- updated_at        → reuse created_at (observations has no updated_at)
- timestamp cast    → TO_TIMESTAMP(created_at_epoch) — unambiguous from int64
- priority (rules)  → derived from importance_score (≥0.8→10, ≥0.5→5, else 0)
- live filter       → is_suppressed=false AND is_archived=0 AND is_superseded=0

Three INSERTs in order:

1. credentials — type='credential' rows with encrypted_secret IS NOT NULL AND
   encryption_key_fingerprint IS NOT NULL AND title IS NOT NULL AND live.
   Ciphertext + fingerprint bytes preserved verbatim (no re-encryption).
2. behavioral_rules — observations where concepts @> '["always-inject"]'::jsonb
   AND type != 'credential' AND live AND has content. priority from importance.
3. memories — remaining live non-credential non-always-inject observations with
   non-empty content. tags = concepts verbatim. project NULL → '' fallback.

Sanity check DO block (BOTH invariants active tripwires):
- (a) dst_count < src_count/2 → RAISE EXCEPTION
- (b) credentials.count != observations WHERE type='credential' live count
     → RAISE EXCEPTION — every vault credential MUST migrate byte-for-byte

Rollback: TRUNCATE the 3 static tables in reverse order (memories → behavioral_rules
→ credentials). Forward migration only READS observations, so rollback == undo
the data copy. Schema remains (migrations 087-089 Rollback handles schema).

Protected invariants (NONE violated by this commit — observations UNTOUCHED):
- observations table unchanged (2775 live / 13 credentials in production)
- credentials/memories/behavioral_rules tables GAIN data; no lossy transform
- vault ciphertext bytes preserved verbatim
- production fingerprint aa78e55cf896508c / count 13 / mismatch 0

Verification:
- go build ./... — green
- go vet ./internal/db/gorm/... — clean
- Migration chain 001→090 "all migrations passed" on empty pgvector/pgvector:pg17
- Credential/Memory/BehavioralRules tests 9 top-level + 10 sub-tests — all green
- Synthetic-data test: seeded 6 observations (5 live + 1 archived), ran migration
  logic via psql; got credentials=2 memories=2 behavioral_rules=1 (archived row
  correctly excluded); DO block NOTICE fired with correct counts
- Invariant (b) tripwire confirmed: DELETE 1 credential row → RAISE EXCEPTION
  "credential invariant FAIL cred=1 cred_live=2"
- Invariant (a) tripwire confirmed: TRUNCATE all static tables → RAISE EXCEPTION
  "sanity FAIL dst=0 src=5" (fires first since order is a → b)
- TestMigrationsIntegration vector-dim failure remains pre-existing (US2 leftover,
  not introduced here)

Next: Commit E (T021) switches handlers (handleStoreMemory/handleRecall) to use
the new stores; current commit leaves handlers routed to observations still.
After Commit E comes Commit F-1 (HARD GATE decrypt round-trip test) — BLOCKER
for PR-B drop migrations (commits G+H) per spec §FR-9 + §NFR-7.

Task: T020 Commit D
Spec: .agent/specs/engram-v5-baseline/spec.md §FR-4 + plan.md §Phase 5 (Amendment 2026-04-18)

* feat(us3): switch vault handlers to credentialStore + add /api/memories routes (Commit E, PR-A)

Route vault HTTP handlers through the new credentialStore, expose a new set of
/api/memories HTTP endpoints backed by memoryStore, and wire all three static-
entity stores (credentialStore + memoryStore + behavioralRulesStore) as fields
on the worker Service struct.

Dual-read preserved: this commit DOES NOT touch handleStoreMemory / handleRecall
inside internal/mcp - those still route to observationStore. The actual MCP
handler switch happens in Commit G (post Commit F-1 decrypt round-trip gate).

Files changed:
- internal/worker/service.go: 3 new fields on Service struct
  (credentialStore, memoryStore, behavioralRulesStore); initializeAsync
  constructs credentialStore alongside existing memoryStore + behavioralRulesStore
  (now promoted from locals to fields); initMu block wires all 3 pointers;
  setupRoutes registers POST/GET/DELETE /api/memories routes inside the
  authenticated router group.
- internal/worker/handlers_vault.go: handleVaultStatus switched to
  credentialStore.CountCredentials + .CountWithDifferentFingerprint;
  handleDeleteOrphanedCredentials switched to
  credentialStore.DeleteOrphanedByFingerprint. Signatures adapted to new
  method names on CredentialStore.
- internal/worker/handlers_memories.go (NEW): thin HTTP wrappers over
  memoryStore. handleStoreMemoryExplicit validates project+content non-empty,
  returns 201 on success, 503 if store not initialised. handleListMemories
  requires project query param, defaults limit 50, returns [] not null.
  handleDeleteMemoryByID parses chi URL param, maps gorm.ErrRecordNotFound
  to 404.
- internal/worker/handlers_memories_test.go (NEW): 4 integration tests
  (skip without DATABASE_DSN) - RoundTrip (POST+GET), ValidationErrors
  (2 sub-tests for empty project + empty content), DeleteByID RoundTrip
  (POST+DELETE+GET returns empty), DeleteByID NotFound (404 on missing id).
- .gitignore: add .tmp_gocache/ and .tmp_gomodcache/ underscore variants
  (GoCache and GoModCache directories use either hyphen or underscore
  depending on tool version).

Scope boundary respected:
- handleStoreMemory / handleRecall in internal/mcp UNTOUCHED (dual-read preserved)
- observationStore.CountCredentials / DeleteOrphanedCredentials methods still
  exist - only CALLERS migrated. Commit G removes the methods when observations
  drops.
- No migration changes (087-090 frozen)
- No observations data writes from this commit

Protected invariants (NONE violated - additive + caller redirection only):
- observations table UNTOUCHED (dual-read still works via MCP store/recall)
- credentials / memories / behavioral_rules schemas UNTOUCHED
- Vault ciphertext bytes unchanged
- Production fingerprint aa78e55cf896508c / count 13 / mismatch 0 preserved -
  vault handlers now read the same 13 credentials from the credentials table
  populated by migration 090 instead of observations

Verification:
- go build ./... - green
- go vet ./internal/worker/... ./internal/mcp/... ./internal/db/gorm/... - clean
- go test ./internal/worker/ -run "HandleStoreMemoryExplicit|HandleDeleteMemoryByID" - all 4 PASS + 2 sub-tests
- go test ./internal/db/gorm/ -run "Credential|Memory|BehavioralRules" - 9 top-level + 10 sub-tests PASS (regression)

Task: T021 Commit E
Spec: .agent/specs/engram-v5-baseline/spec.md FR-4 + plan.md Phase 5

* chore: ignore .cache/ directory (Go test cache artifacts)

* feat(us3): Commit F-1 — DECRYPT ROUND-TRIP HARD GATE (T022)

Integration test TestCredentialDecryptRoundTripAfterMigration in
internal/grpcserver/credential_migration_test.go verifies the core US3
invariant: credentials migrated from `observations` into the dedicated
`credentials` table via migration 090 preserve AES-256-GCM ciphertext
byte-for-byte, and the vault can still decrypt every migrated secret
back to its original plaintext with 100% fidelity.

Why this gate exists:
  Migration 090 copies encrypted_secret from observations into
  credentials via a SQL SELECT into a BYTEA column. If either end
  silently mutated the bytes (encoding, trimming, pad handling),
  AES-GCM decrypt would fail authentication-tag check and the
  production vault would become unusable. That would lose all 13
  prod credentials irrecoverably.

What the test does:
  1. Opens a real Postgres + pgvector DB, applies migrations through 090.
  2. Creates a real crypto.Vault with a deterministic test key
     (not the prod key — the test is scoped to a unique test project
     slug to avoid colliding with other data).
  3. Encrypts 5 diverse plaintexts via the real vault:
     - short ASCII
     - empty plaintext (GCM edge case: 0-byte plaintext produces
       nonce+tag = 28-byte ciphertext)
     - unicode mixed (Cyrillic + CJK + emoji, multi-byte UTF-8)
     - long plaintext (~1KB of repeated bytes)
     - binary-like (embedded NUL bytes + high-bit bytes, guards
       against any future code that reads the secret as a text string
       instead of raw bytes)
  4. Seeds each plaintext as an observations row with the ciphertext
     + fingerprint + type='credential' + the required NOT NULL columns.
  5. Seeds 3 ADVERSARIAL EXCLUDED observations (is_suppressed=true,
     is_archived=1, is_superseded=1 — one each). Migration 090 filters
     these out via its WHERE clause; if the filters ever get loosened,
     one of the assertions in sub-test adversarial_excluded_rows_not_
     migrated will fire.
  6. Seeds 1 ORPHAN observation encrypted with a different (wrong)
     fingerprint to exercise CountWithDifferentFingerprint.
  7. Re-runs the exact migration 090 credentials INSERT SQL scoped to
     the test project slug (the SQL is copy-pasted verbatim from
     internal/db/gorm/migrations.go — drift will fail the test).
  8. Reads each credential back via CredentialStore.Get and asserts:
     - encrypted_secret bytes == original ciphertext (byte-for-byte)
     - encryption_key_fingerprint == vault fingerprint (exact)
     - vault.Decrypt(migrated_ciphertext) == original plaintext
  9. Three adversarial sub-tests:
     - orphan_credential_visible_via_fingerprint_check: verifies the
       badFP orphan is visible in List and counted globally by
       CountWithDifferentFingerprint(goodFP).
     - adversarial_excluded_rows_not_migrated: verifies each of the
       three excluded observations (suppressed / archived / superseded)
       is NOT present in the credentials table. This is the load-
       bearing test for the WHERE clause filters.
     - tampered_ciphertext_fails_decrypt: flips one byte in the
       migrated ciphertext and verifies AES-GCM authentication
       catches the mutation. This guards against "decrypt accepts
       anything" — which would make the gate a stub.

Anti-stub proof:
  Swap vault.Decrypt body -> return "", nil ⇒ short_ascii / unicode /
  long / binary sub-tests all fail (decrypted != plaintext) AND
  tampered_ciphertext_fails_decrypt fails (require.Error vs nil).
  Test cannot pass with a stubbed decrypt path.

Resource hygiene:
  NewStore's *Store is captured and closed via defer gormStore.Close().
  A separate direct gorm.Open handle is used for raw INSERT seeds and
  the project-scoped migration re-run, explicitly closed via its
  sqlDB.Close() defer.

Scope guard:
  ONE new file: internal/grpcserver/credential_migration_test.go.
  No production code changed. Uses only exported APIs:
  crypto.NewVault, crypto.Vault.Encrypt/Decrypt/Fingerprint,
  gorm.NewStore, gorm.Config, gorm.Store.Close, gorm.NewCredentialStore,
  gorm.CredentialStore.Get/List/CountWithDifferentFingerprint.

Verified:
  DATABASE_DSN=postgres://.../engram_f1_test — TestCredentialDecrypt-
  RoundTripAfterMigration PASS (1 top-level + 8 sub-tests, ~1.8s wall).
  Sub-tests: short_ascii, empty_plaintext, unicode_mixed,
  long_plaintext_1kb, binary_like_with_nul_and_high_bytes,
  orphan_credential_visible_via_fingerprint_check,
  adversarial_excluded_rows_not_migrated, tampered_ciphertext_fails_decrypt.

Does NOT verify (intentional — out of scope for CI F-1 gate):
  - The production invariants credential_count=13, fingerprint=
    aa78e55cf896508c, mismatch_count=0. Those belong to the
    staging dry-run against the real 2026-04-17 pg_dump snapshot,
    run out of band by the operator before Commit G lands. F-1 CI
    gate proves the PRIMITIVE works (round-trip byte-preserving);
    staging dry-run proves the MIGRATION ran correctly on prod data.

Pre-existing test failures (NOT regressions from this commit):
  - TestSyncProjectState_* (internal/grpcserver) — introduced in
    d324f01 (PR #170, Phase 4) with a minimal schema that conflicts
    with the full projects-table migration. Fails on any DB with
    migrations 045+ applied.
  - TestHandleMemoryTriggers_BashCommand* (internal/worker) — from
    fd8ce26 (PR #135, learning-memory-v4) — returns 500 not 200
    on fresh pgvector DB. Unrelated to US3.

Ref: .agent/specs/engram-v5-baseline/changes/CR-001-initial-scope/
tasks.md T022 AC 1-5. Addresses two review findings:
(1) NewStore return value leaked pool — fixed via capture+Close.
(2) Exclusion filter coverage gap — fixed via adversarial sub-test.

* fix(grpcserver): SyncProjectState pq.Array + heartbeat precision

Two pre-existing regressions in internal/grpcserver/sync_project_state.go
and its test, shipped since PR #170 (d324f01, 2026-04-18 gRPC Phase 4):

1. Raw(...ANY(?)..., localIDs) and Exec(...ANY(?)..., now, localIDs)
   passed []string directly. The gorm postgres driver does NOT encode
   []string as a PostgreSQL array for positional params, so the query
   failed with ERROR: malformed array literal (SQLSTATE 22P02). Fix:
   wrap localIDs with pq.Array() at both call sites. Mirrors the working
   pattern in internal/sessions/store.go:189.

2. TestSyncProjectState_HeartbeatUpdated flaked on Windows: Go's
   time.Now().UTC() has 100-nanosecond resolution from the monotonic
   clock; PostgreSQL timestamptz stores microsecond precision. When the
   captured 'before' fraction was in the same microsecond bucket as the
   server's UPDATE timestamp, the round-tripped newHB was strictly
   less-than 'before' and the assertion fired. Fix: Truncate(time.Microsecond)
   on 'before' so the comparison operates at the precision the DB stores.

Verified:
  go test -count=3 -run ^TestSyncProjectState ./internal/grpcserver/
  — 5 tests x 3 iterations = 15 passes, none flaked.

* fix(gorm): command prefix matcher + correct test seed format

Two interconnected regressions in GetObservationsByCommandPrefix and its
test harness, shipped in PR #135 (fd8ce26, learning-memory-v4). Never
passed on a fresh pgvector DB.

1. SQL: ESCAPE '\\\\' in Go double-quoted source produces '\\\\' in the SQL
   text, which under standard_conforming_strings=on (PostgreSQL 9.1+ default)
   is a 2-character escape string. PostgreSQL rejects multi-character ESCAPE
   with SQLSTATE 22025 ('invalid escape string'). Any call to this function
   against a real Postgres returned 500 from handleMemoryTriggers.

2. Semantics: the function name says 'ByCommandPrefix' — intent is to find
   observations whose stored commands_run entry is a PREFIX of the incoming
   query (e.g. stored 'git push --force' matches query 'git push --force
   origin main'). The old SQL implemented the reverse direction:
   cmd LIKE 'query%' finds stored values that START WITH the query,
   which no real case exercises.

Fix: use starts_with(?, cmd) — PostgreSQL 11+ native function with clean
semantics (no escape hazard, no wildcard ambiguity, both sides treated as
literals). The escaper + % suffix are removed as no longer needed.

Additionally, internal/worker/trigger_matcher_bash_test.go seeded
commands with the narrative prefix 'Executed: ' (e.g. 'Executed: git
push --force'), which does not reflect how the Bash tool records commands
in production (raw command, no prefix). The test could never pass with
a correct prefix matcher because the seed shape mismatched the matcher
shape. Fix: strip 'Executed: ' from all 5 seed call sites.

Verified (fresh pgvector/pgvector:pg17 with truncated observations):
  go test -count=1 -run '^TestHandleMemoryTriggers_' ./internal/worker/
  — 6 subtests all PASS. TestHandleMemoryTriggers_BashCommandPrefixMatchesTop3Warnings
  now correctly returns the 3 bugfix+pitfall prefix matches (excluding
  the discovery-typed row and the non-prefix force-with-lease row).

* fix(gorm): migration 090 epoch ms + scope column + safe rollback + store hardening

Addresses 6 review findings across 3 files:

migrations.go (3 CRITICAL, 1 MAJOR, 1 MAJOR):
  - TO_TIMESTAMP(created_at_epoch) was interpreting milliseconds as
    seconds, producing year-56247 dates. Divided by 1000.0 in all
    three INSERT sites (credentials, behavioral_rules, memories).
  - credentials INSERT now copies 'scope' column via
    COALESCE(NULLIF(scope,''),'project'); schema has the column,
    legacy observations carry it, previous SELECT dropped it silently.
  - Migration 090 rollback previously TRUNCATEd the 3 tables, which
    would wipe data created AFTER the forward migration. Replaced
    with an explicit error instructing pg_restore from the 2026-04-17
    snapshot per rollback-procedure.md §3b.

observation_store.go (LOW):
  - Documentation note: starts_with() is PostgreSQL 9.1+ (gemini
    reviewer claimed 17+ — verified against release notes).
    engram targets PG 17 in all live deployments; no portability
    concern for current scope.

credential_store.go (MEDIUM, LOW, MEDIUM):
  - Create: capture a single 'now' up-front so CreatedAt==UpdatedAt
    deterministically; copy the secret bytes into a defensive slice
    so subsequent zeroize/reuse by the caller cannot alias the row.
  - Delete: added 'deleted_at IS NULL' WHERE predicate so soft-deleted
    rows remain soft-deleted (was possible to hard-delete a tombstone
    and return success, instead of gorm.ErrRecordNotFound).
  - Preserve hard-delete for the rotation case (UNIQUE(project,key)
    constraint + documented design note).

Verified: go build + go vet clean; TestCredentialDecryptRoundTripAfterMigration
+ TestCredentialStore_* all green on pgvector/pgvector:pg17.

* fix(stores): break pointer aliasing in BehavioralRules.Create

Copy rule.Project into row.Project by value so the returned model
does not share a backing string with the caller's variable. If the
caller mutates its original input after Create returns, the stored
model should not change — matches the API contract implied by
'Create returns the persisted row'.

* fix(handlers): memory limit cap, error disclosure, global-scope credential guards

Addresses 3 MAJOR + 2 MEDIUM review findings in HTTP handlers:

handlers_memories.go:
  - handleListMemories now caps 'limit' at 500 and rejects
    non-positive values with 400. Prevents an adversarial or
    misconfigured client from triggering an unbounded SELECT and
    multi-MB JSON response.
  - 500 responses no longer include err.Error() verbatim — both
    store + request failures now emit stable messages; full error
    stays in the server log (information-leak mitigation).

handlers_vault.go:
  - handleStoreCredential rejects scope='global' with 400 until a
    schema migration admits project-less credentials (credentials.project
    is NOT NULL); previously the write would either fail the NOT NULL
    constraint or create a row that no READ path could retrieve (LIST,
    GET, DELETE all require project).
  - handleGetCredential and handleDeleteCredential now require project
    to be non-empty; returning 400 instead of silently returning [] /
    ErrRecordNotFound for empty-project requests.
  - 500 responses no longer leak err.Error() — stable client message,
    full error in log.

* fix(mcp,test,ci): guard store_rule/list_rules on nil, sync F-1 to migration 090 schema, gitignore cleanup

Addresses 2 MEDIUM + 1 MAJOR review findings + infra hygiene:

internal/mcp/server.go (MEDIUM):
  - store_rule and list_rules MCP tools are now only registered when
    s.behavioralRulesStore != nil. Previously tools/list always
    advertised them even though the handlers returned 503 at dispatch
    time — clients would see the tool, invoke it, and get an error.
    Same conditional pattern as versionedDocumentStore elsewhere in
    this file.

internal/grpcserver/credential_migration_test.go (MAJOR — F-1 sync):
  - Test SQL updated to match migration 090's corrected form: scope
    column added to the INSERT list, TO_TIMESTAMP divided by 1000.0.
    Without this sync the HARD GATE would pass against stale SQL while
    production would use the fixed SQL — defeating the purpose of the
    gate. The test still copy-maintains the SQL verbatim; a drift
    detection mechanism is out of scope for this PR and noted for
    future work.
  - Seed timestamps now use UnixMilli() to match observations schema
    convention (created_at_epoch is milliseconds per models.go).

.gitignore:
  - Deduplicated Go cache patterns (collapsed hyphen/underscore
    variants into one), moved .cache/ into the dedicated Go cache
    section, removed stray duplicates.

Verified: all target tests green on pgvector/pgvector:pg17 after these
commits — TestCredentialDecryptRoundTripAfterMigration (1 top + 8 sub),
TestSyncProjectState_* (5), TestHandleMemoryTriggers_Bash* (3),
TestCredentialStore_* (9 incl sub-tests), behavioral_rules + memory
store suites.

* fix(handlers-vault): remove err.Error() from all 500 responses

Addresses CodeRabbit MINOR finding PRRT_kwDORY-ek8577YWO on re-review of PR #181:
'Don't disclose error details in 500-response' (handlers_vault.go:62).

Initial fix only covered some handlers; this commit extends the same
mitigation to the remaining five 500 sites in handlers_vault.go:

- handleListCredentials line 62: 'list credentials: ' + err.Error() → 'failed to list credentials'
- handleGetCredential line 111: 'vault not available: ' + err.Error() → 'vault not available'
- handleGetCredential line 137: 'decrypt credential: ' + err.Error() → 'failed to decrypt credential'
- handleStoreCredential line 207: 'vault not available: ' + err.Error() → 'vault not available'
- handleStoreCredential line 214: 'encrypt credential: ' + err.Error() → 'failed to encrypt credential'
- handleStoreCredential line 229: 'store credential: ' + err.Error() → 'failed to store credential'

All sites now call log.Error().Err(err).Msg(...) first so server-side
debugging still has full context. The 400 (invalid JSON body) site keeps
err.Error() in the body — 400 is a client-parse error and the detail
helps the client diagnose their own malformed request; this is intentional
and not flagged by the reviewer.

Verified: go build + go vet clean.

---------

Co-authored-by: Kirill Turanskiy <thebtf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant