feat: implement new_session:true support for multi-turn task session …#330
feat: implement new_session:true support for multi-turn task session …#330HaotianChen616 wants to merge 1 commit intopinchbench:mainfrom
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge The new commits revert the Files Reviewed (5 files changed in new commits)
Reviewed by claude-4.6-sonnet-20260217 · 172,090 tokens |
|
👋 Hi @HaotianChen616! I'm @olearycrew's OpenClaw bot doing a triage pass. Heads up — this PR now has merge conflicts with main, likely from the batch of task PRs that were merged yesterday (#315, #319, #320, #322, and others all touched Would you be able to rebase on main when you get a chance? The Kilo Code Review still shows no issues and recommends merge, so once conflicts are resolved this should be good to go. Thanks for implementing the |
Of course — thanks for the heads up! I'll rebase on main to resolve the merge conflicts and follow up once that's done. |
…isolation - Implement new_session: true in lib_agent.py: when a session entry has new_session: true, archive the current transcript, clean up the agent's session state, and generate a new session_id so the agent starts with no conversation history (workspace files are preserved) - Add _archive_transcript() helper for per-session transcript preservation before session resets - Merge all archived session transcripts after all sessions complete so the grading engine can inspect the full conversation history - Document sessions and new_session fields in TASK_TEMPLATE.md with a dedicated Multi-Session Tasks section and updated author checklist - Add task_iterative_code_refine: a 3-session iterative refinement task demonstrating multi-turn conversation and new_session isolation - Add test_multi_session.py with 11 tests covering frontmatter parsing, new_session flag detection, transcript archiving, and integration
765c841 to
ac6734e
Compare
I've rebased on |
|
@olearycrew Rebased on main and resolved the merge conflicts. Should be good to go now — CI and code review both look clean. Let me know if anything else is needed. |
feat: implement full multi-turn user prompt support with session isolation
Summary
PinchBench currently has partial multi-session support — the
sessionsfield in task frontmatter is processed to send sequential prompts, but thenew_session: trueflag (already defined intask_second_brain.md) is completely ignored by the Python codebase. This means tasks that need to test cross-session memory or fresh-context behavior cannot work correctly.This PR implements full multi-turn user prompt support by:
Implementing
new_session: truesession isolation inlib_agent.py— when a session entry hasnew_session: true, the agent's session state is cleaned up and a newsession_idis generated, simulating a user returning after closing the agent. The workspace (and any files created) is preserved across sessions.Adding transcript archiving per session — before starting a new session, the current session's transcript is archived. After all sessions complete, transcripts are merged so the grading engine can evaluate the full conversation history.
Documenting multi-session tasks in
TASK_TEMPLATE.md— adds a complete "Multi-Session Tasks" section with field descriptions, usage guidelines, and YAML examples.Adding two new multi-turn tasks:
task_iterative_code_refine.md— tests iterative code refinement across 3 sessions, with the final session usingnew_session: trueto verify the agent can work from file-based context alonetask_session_chain_analysis.md— tests 4-turn structured code analysis (single-session multi-turn) where the agent reads TypeScript source files, produces JSON chain maps, designs minimal changes, extracts code evidence, and writes a delivery summaryAdding tests —
test_multi_session.pycovers frontmatter parsing,new_sessionflag handling, transcript archiving, and task loading.Changes
scripts/lib_agent.py_archive_transcript()helper to save per-session transcripts before session resetsexecute_openclaw_task()multi-session loop to:current_session_idseparately (changes whennew_session: true)cleanup_agent_sessions()and generate a new session ID whennew_session: truenew_sessionwas usedtasks/TASK_TEMPLATE.mdsessionsandnew_sessionYAML examples to the frontmatter templatetasks/task_iterative_code_refine.md(new)calculator.pyandreview.txttasks/task_session_chain_analysis.md(new)new_session): structured ingest → minimal design → evidence extraction → delivery summarysession.ts,session-store.ts,delivery.ts,agent-command.ts) copied totasks/assets/session_chain/tasks/assets/session_chain/(new)session.ts— OpenClaw session resolution logic (resolveSession,resolveSessionKeyForRequest)session-store.ts— Session store update logic (updateSessionStoreAfterAgentRun)delivery.ts— Agent command delivery logic (deliverAgentCommandResult)agent-command.ts— Agent command execution pipeline (prepareAgentCommandExecution,agentCommandInternal)tasks/manifest.yamltask_iterative_code_refineandtask_session_chain_analysisto the task listtests/test_multi_session.py(new)TestMultiSessionFrontmatterParsing— sessions list, new_session flag, string entriesTestNewSessionHandling— archive-before-cleanup behavior verificationTestArchiveTranscript— transcript archiving with/without transcript pathTestMultiSessionTaskLoading— integration test verifying task_second_brain and task_iterative_code_refine load correctlyMotivation
Real-world AI coding agents are used in multi-turn conversations. Users:
PinchBench should test all of these scenarios. The existing
task_second_brain.mdwas designed to test cross-session memory, but it couldn't work correctly becausenew_session: truewas never implemented.Testing
Backward Compatibility
new_session: No change in behavior (all prompts share one session, same as before)new_session: true: Now correctly isolates sessions. Previously,new_sessionwas silently ignored, so this is a bug fix that makestask_second_brain.mdwork as originally intended.