Skip to content

tech debt: legacy agent-less flow fallback silently masks unowned pipelines from agent system features #2063

@chubes4

Description

@chubes4

Summary

AIStep::resolveAgentIdFromJobSnapshot() returns 0 when both agent_slug and agent_id are empty on a job snapshot. Downstream, AIStep falls back to the flow's user_id and runs the job anyway, with the comment block at AIStep.php:380-405 explicitly documenting this as the intended back-compat behavior for "legacy / agent-less flows":

// Legacy / agent-less flows: fall back to the flow's user_id.
if ( $owner_id <= 0 && $user_id > 0 ) {
    $owner_id = $user_id;
}

Combined with PermissionHelper's blanket bypass on the action_scheduler_run_queue hook (so jobs running under no WP user still pass capability checks), this means a pipeline or flow with agent_id = NULL executes normally — scrapes run, AI calls fire, posts publish, taxonomies update — but with NONE of the agent-system features attached:

  • Bundle export silently skips it (agent export <slug> only walks rows owned by that agent).
  • Per-agent memory / SOUL.md / MEMORY.md context never loads.
  • Per-agent capability ceilings are not enforced; the job runs with the flow user_id's full admin grant via the queue bypass.
  • Audit trail records "anonymous job" rather than "agent X ran this".
  • agent diff and agent install reconciliation can't see the rows.

The result: ownership drift is invisible. The system keeps working, looks healthy in dashboards, but a growing fraction of the operation gradually escapes the agent envelope without anything ever surfacing it.

How we discovered it

While planning a bundle-based workflow for the events platform we ran:

SELECT agent_id, COUNT(*) FROM datamachine_pipelines GROUP BY agent_id;
SELECT agent_id, COUNT(*) FROM datamachine_flows GROUP BY agent_id;

Result on the events subsite (blog 7):

  • 132 of 192 pipelines have agent_id = NULL (69%)
  • 417 of 599 flows have agent_id = NULL (70%)

Pipeline IDs 3–72 are all owned by extra-chill-bot; IDs 73–205 are all NULL. Same boundary on flows (3–293 owned, 294–725 NULL). This is a clean cutover where agent assignment stopped happening at some point and nothing caught it.

Production wasn't degraded — those 70% of pipelines have been running fine under the legacy fallback for months. The operator just didn't know they were running outside the agent system.

wp datamachine pipeline reassign --where-null --to-agent=events-bot --cascade-flows fixes the data, and we ran that today. The technical debt is the silent drift mechanism, not the immediate data state.

Root concerns

  1. The fallback path is too quiet. When agent_id resolves to 0 in AIStep::resolveAgentIdFromJobSnapshot(), nothing is logged, no admin notice fires, no metric increments. The system has no idea this is happening at scale.

  2. The action_scheduler_run_queue blanket bypass in PermissionHelper::can() is necessary for Action Scheduler workers to function (they run under no WP user), but it papers over the agent-less case — every ability call that should normally enforce a per-agent capability ceiling instead passes through, hiding the absence of agent context from the consumer.

  3. agent_id is nullable on datamachine_pipelines and datamachine_flows with no enforcement at creation time. New pipelines created via the REST API, the legacy CLI commands, or direct ability calls can land with NULL by default depending on the code path. There's no "agent attribution required" gate.

  4. agent diff / agent install / agent export have no surface for "rows belonging to this site that no agent owns". An operator who runs agent installed sees only the bundle-backed agents, has no signal that anything is unowned, and can go indefinitely without noticing.

Proposed mitigations (multiple shippable layers)

Layer 1 — observability (small, high-leverage)

Add a structured do_action( 'datamachine_log', 'warning', ... ) inside AIStep::resolveAgentIdFromJobSnapshot() when it returns 0 because both fields were empty. Tag with the flow_id, pipeline_id, and job_id. Same for any other site in the codebase that hits the legacy fallback. After the warning exists, an operator running wp datamachine logs --level=warning (or grepping logs) immediately sees the volume of unowned execution.

Add a new CLI: wp datamachine pipeline orphans and wp datamachine flow orphans that count + list rows with agent_id = NULL. These can be added as --where-null filters that already exist via the reassign surface but are not currently exposed as a query-only view.

Layer 2 — surfacing in operator-facing tools

Add an "Unowned pipelines/flows" section to wp datamachine system health output. Today the command runs other diagnostics; adding ownership coverage as a check would surface the drift on every health run.

Add the unowned count to whatever weekly/daily operator digest each site emits (e.g. for events: extrachill-events#79's qualify digest could include a one-line "N unowned pipelines on this site" warning when nonzero).

Layer 3 — prevent new drift at creation time

Enforce non-null agent_id at row insertion in datamachine_pipelines and datamachine_flows. Make every code path that creates a pipeline or flow either (a) accept an agent_id argument and require it, or (b) resolve to a default via a single helper datamachine_resolve_creating_agent() that returns the active agent for the current context (datamachine agent active), or fails loudly if no agent is set.

Possible escape hatch for system tasks / migrations: an explicit DATAMACHINE_AGENTLESS_INSERT constant that callers must define to opt out, so the absence of agent attribution becomes a deliberate, greppable choice rather than a silent default.

Layer 4 — migrate the back-compat path to a real concept

The current fallback assumes "agent-less = legacy = grandfather it in." But the legacy fallback path will keep producing new unowned rows until layer 3 lands. After layer 3, the fallback path is dead code for any post-migration creation.

Either:

  • Remove the fallback entirely once layer 3 is enforced and a one-time migration has assigned every existing row.
  • OR formalize the agent-less case as an explicit system-task or anonymous pseudo-agent with capped permissions, so consumers always have SOMETHING to resolve against (no special-case path in AIStep, no quiet user_id fallback).

Option B is the cleaner architectural answer: there is always an agent, even if it's the system one. Removes the special case.

Why this matters beyond extrachill events

This same pattern probably exists on every Data Machine install that's been running since before agents became mandatory. Each operator is running some unknown percentage of their pipelines outside the agent envelope. The bundle workflow, per-agent memory, granular permissions — none of these features can ever be 100% on for any install until this is addressed.

The fix doesn't have to be invasive. Layers 1 and 2 alone (logging + health check + digest line) would surface the drift everywhere and let operators decide what to do about it. Layers 3 and 4 are bigger projects that can land separately.

Out of scope

  • Migrating existing unowned rows. Operators can run wp datamachine pipeline reassign --where-null --to-agent=<slug> --cascade-flows today; the tooling is there. This issue is about preventing the drift, not cleaning up existing instances.
  • Changing the action_scheduler_run_queue blanket bypass in PermissionHelper. That bypass is correct for Action Scheduler workers per se; the issue is that it masks agent-less execution, which layer 1/3 fixes more directly.
  • A formal "anonymous agent" entity in the agents table. That's part of Layer 4 if we take option B; can be its own issue once Layer 1 and 2 are in.

Related

  • wp datamachine pipeline reassign --where-null --to-agent=<slug> --cascade-flows — the existing escape hatch the operator runs today to clean up state.
  • wp datamachine agent installed — the command that today shows nothing for an operator with 70% unowned pipelines, contributing to the invisibility of the problem.
  • Discovered while planning bundle-based scaling for extrachill-events.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions