Skip to content

feat: add Pi Coding Agent rollout seed source#514

Merged
johnnygreco merged 2 commits intomainfrom
johnny/feature/513-pi-coding-agent-rollout
Apr 9, 2026
Merged

feat: add Pi Coding Agent rollout seed source#514
johnnygreco merged 2 commits intomainfrom
johnny/feature/513-pi-coding-agent-rollout

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

📋 Summary

Add support for ingesting Pi Coding Agent session artifacts as an agent rollout seed source. Pi sessions are tree-structured JSONL files stored at ~/.pi/agent/sessions/; the handler resolves the active conversation path by walking from the last entry back to the root via id/parentId links, then normalizes all message and entry types into the shared rollout schema.

🔗 Related Issue

Closes #513

🔄 Changes

✨ Added

  • pi_coding_agent.py — Format handler with tree-structured session parsing, active-path resolution, and normalization of all Pi message roles (user, assistant, toolResult, bashExecution, custom, compactionSummary, branchSummary) and entry-level types (model_change, compaction, branch_summary, custom_message, thinking_level_change)
  • test_pi_coding_agent.py — 17 tests covering realistic session structure, parallel tool calls, branch resolution, bashExecution normalization, entry-level types, error cases, and edge cases
  • PI_CODING_AGENT enum value, default path (~/.pi/agent/sessions), and format defaults in seed_source.py
  • normalize_message_content shared utility in utils.py
  • Pi Coding Agent tab and column in agent-rollout-ingestion.md docs

🔧 Changed

  • hermes_agent.py — Replaced local _normalize_message_content with shared normalize_message_content from utils.py (identical logic was duplicated)
  • registry.py — Registered PiCodingAgentRolloutFormatHandler

🧪 Testing

  • make check-all-fix passes (lint + format)
  • 17 unit tests added — all pass
  • Full agent rollout test suite passes (42 tests)
  • Battle-tested against 8 real Pi session files with line-by-line content integrity audit
  • End-to-end DataDesigner preview verified (seed → expression column → LLM column)
  • E2E tests added/updated — N/A, follows existing handler pattern which has no e2e tests

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • License headers added via make update-license-headers
  • Architecture docs updated — N/A, no architectural changes

Add support for ingesting Pi Coding Agent session artifacts as an agent
rollout seed source. Pi sessions are tree-structured JSONL files; the
handler resolves the active conversation path by walking from the last
entry back to the root via id/parentId links.

Key points:
- Tree-structured sessions with automatic active-path resolution
- Entry-level types: model_change, compaction, branch_summary,
  custom_message, thinking_level_change
- Message roles: user, assistant (inline ToolCall/ThinkingContent/
  TextContent blocks), toolResult, bashExecution (synthesized as
  tool-call pairs), custom, compactionSummary, branchSummary
- Extract shared normalize_message_content to utils.py (was duplicated
  in Hermes handler)
@johnnygreco johnnygreco requested a review from a team as a code owner April 8, 2026 19:13
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Docs preview: https://b77fb86c.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

PR Review: #514 — feat: add Pi Coding Agent rollout seed source

Summary

This PR adds support for ingesting Pi Coding Agent session artifacts as an agent rollout seed source. Pi sessions are tree-structured JSONL files; the handler resolves the active conversation path by walking from the last entry back to the root via id/parentId links, then normalizes all message and entry types into the shared rollout schema.

Scope: 7 files changed (+1345 / -42), 2 new files (handler + tests), 5 modified files (config, registry, utils, hermes handler, docs).

CI Status: Lint, license headers, all E2E tests, and most unit test suites pass. A few CI jobs are still pending at review time (coverage check, some test matrix entries).

Findings

Positive Observations

  1. Follows established patterns closely. The handler structure (is_handled_file, parse_file, internal normalization functions) mirrors the existing Hermes, Claude Code, and Codex handlers. The class/registration pattern is consistent with the registry.

  2. Good refactoring of shared utility. The _normalize_message_content function duplicated between hermes_agent.py and the new handler was correctly extracted into utils.py as normalize_message_content. The Hermes handler was updated to use the shared version. This is a clean deduplication.

  3. Comprehensive test coverage. 17 tests covering: happy-path with realistic session structure, parallel tool calls, branch resolution, bash execution normalization, excluded bash executions, entry-level types (compaction, branch_summary, custom_message, model_change, thinking_level_change), compactionSummary/branchSummary message roles, bare string content, forked sessions with parentSession, error cases (missing session header, empty files), and file-type filtering.

  4. Robust tree-walking algorithm. _resolve_active_path handles the tree structure correctly — walks from the last entry backward via parentId, includes cycle detection via visited set, and reverses to chronological order.

  5. Documentation updated consistently. The agent-rollout-ingestion docs add a Pi Coding Agent tab with example code and update the normalized field mapping table with correct Pi-specific values.

Potential Issues

  1. _detect_branches gives false positives on linear sessions (low severity). The function checks if any parentId is referenced by more than one child. This works for detecting tree branches, but it also means if there's ever a data anomaly where two entries accidentally share a parentId in a non-branching session, it would incorrectly report has_branches: True. This is acceptable since it's purely metadata and doesn't affect normalization logic.

  2. _resolve_active_path assumes last entry is on the active branch (low severity). The algorithm always starts from entries[-1] (the last entry in file order). This is a reasonable assumption for append-only JSONL sessions — the most recent write is the active tip. If Pi ever reorders entries within a file, this would break. The assumption is documented in the docstring and matches the PR description, so this is fine as-is.

  3. Timestamps collected from all entries, not just active path (info). In parse_pi_session, started_at/ended_at are derived from timestamps of all entries (including abandoned branches), while messages only includes the active path. This means timestamp range could be wider than the actual conversation. This is a deliberate design choice — it represents the session's full time span, which is arguably more useful for analytics. Worth noting but not a bug.

  4. No should_warn_unhandled_file override (negligible). The handler inherits the default should_warn_unhandled_file which warns for all unhandled files. Since Pi sessions live in ~/.pi/agent/sessions with *.jsonl pattern, and the handler accepts all .jsonl files, there shouldn't be stray non-JSONL files to warn about. No action needed.

  5. normalize_message_role is not called for Pi messages (by design). Unlike Hermes which passes raw roles through normalize_message_role, the Pi handler maps roles explicitly in _normalize_pi_message (user → user, assistant → assistant, toolResult → tool, etc.) and returns [] for unknown roles. This is correct because Pi role names (bashExecution, toolResult, compactionSummary, etc.) don't map through the shared normalizer. The handler correctly calls build_message with already-normalized roles.

  6. thinking_level_change entries are silently skipped (by design). These entries contribute to entry_types metadata but don't produce messages. The entry_type != "message" check at line ~324 of the handler catches them after model_change, compaction, branch_summary, and custom_message are handled. This is the correct behavior — thinking level changes are configuration state, not conversation content.

Nits (Non-blocking)

  1. Inconsistent content access for user messages. User messages pass raw_message.get("content") through normalize_message_content, which passes lists through unchanged. This means block-list content like [{"type": "text", "text": "..."}] is preserved as-is, which is correct and consistent with how build_message handles content blocks. No issue here.

  2. Test file is 854 lines. This is on the larger side but each test is well-scoped and tests a distinct behavior. The test helpers (_make_session_header, _make_entry) reduce boilerplate effectively. Acceptable.

Verdict

Approve. This is a clean, well-structured addition that follows established patterns in the codebase. The handler correctly normalizes Pi's tree-structured sessions into the shared rollout schema. The refactoring of normalize_message_content into shared utils is a welcome deduplication. Test coverage is thorough with 17 tests covering realistic scenarios and edge cases. Documentation is updated consistently. No blocking issues found.

The few observations noted above (timestamp scope, branch detection edge cases) are informational and don't require changes.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR adds a PiCodingAgentRolloutFormatHandler that ingests Pi Coding Agent JSONL session files as a new rollout seed source format. It implements tree-structured active-path resolution (walking from the last entry back to root via parentId links), normalizes all Pi message roles into the shared schema, and extracts a shared normalize_message_content utility from Hermes to avoid duplication.

The implementation is clean and correctly follows the established handler pattern. All message types, entry-level types, branch detection, and edge cases (empty files, excluded bash executions, forked sessions) are covered by 17 targeted unit tests.

Confidence Score: 5/5

Safe to merge — clean, well-tested implementation with no correctness issues found.

No P0 or P1 findings. The active-path resolution, message normalization, branch detection, and metadata extraction are all logically correct. The implementation follows established handler patterns, deduplication of models/stop-reasons works correctly, and 17 tests cover realistic session structures including edge cases. The shared utility extraction from Hermes is a pure refactor with identical behaviour.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/pi_coding_agent.py New handler: active-path resolution via parentId walk, full message-role normalization, bashExecution synthetic tool-call pairs, branch detection — logic is correct and well-structured.
packages/data-designer-engine/tests/engine/resources/agent_rollout/test_pi_coding_agent.py 17 tests covering happy path, parallel tool calls, branch resolution, excluded bash executions, all entry/message types, and error cases — comprehensive coverage.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/utils.py Added shared normalize_message_content utility extracted from Hermes — identical logic, no behavioral change.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/hermes_agent.py Replaced local _normalize_message_content with the shared utility from utils.py — pure refactor, identical behaviour.
packages/data-designer-config/src/data_designer/config/seed_source.py Added PI_CODING_AGENT enum value, default path function, format defaults tuple, and updated field descriptions — straightforward additions consistent with existing pattern.
packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/registry.py Registered PiCodingAgentRolloutFormatHandler alongside existing handlers — minimal, correct change.
docs/concepts/agent-rollout-ingestion.md Added Pi Coding Agent tab, expanded the normalized-field table with a new column, and updated the Notes section — accurate and consistent with the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[JSONL file] --> B[load_jsonl_rows]
    B --> C{First row type == 'session'?}
    C -- No --> ERR[AgentRolloutSeedParseError]
    C -- Yes --> D[Extract session header\nsession_id, cwd, version, timestamp]
    D --> E[entries = rows 1..end]
    E --> F[_resolve_active_path\nwalk entries-last back to root via parentId]
    F --> G[active_entries in chronological order]
    G --> H{entry_type?}
    H -- model_change --> I[Track models_used]
    H -- compaction / branch_summary --> J[Emit system message with summary]
    H -- custom_message display=true --> K[Emit system message]
    H -- message --> L{role?}
    L -- user --> M[build_message role=user]
    L -- assistant --> N[_normalize_pi_assistant_message\ntext + thinking + toolCalls]
    L -- toolResult --> O[build_message role=tool]
    L -- bashExecution --> P[_normalize_pi_bash_execution\nassistant tool-call + tool-result pair]
    L -- custom display=true --> Q[build_message role=system]
    L -- compactionSummary / branchSummary --> R[build_message role=system with summary]
    I & J & K & M & N & O & P & Q & R --> S[messages list]
    E --> T[_detect_branches\nparentId seen twice?]
    S & T & D --> U[NormalizedAgentRolloutRecord]
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into johnny/feature/..." | Re-trigger Greptile

@johnnygreco johnnygreco requested a review from eric-tramel April 9, 2026 14:58
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing stood out for me, thanks @johnnygreco !

@johnnygreco johnnygreco merged commit fdd5ebb into main Apr 9, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add Pi Coding Agent rollout seed source

2 participants