Add generic meeting analysis tasks by ScuttleBot · Pull Request #315 · pinchbench/skill

ScuttleBot · 2026-04-14T13:44:10Z

Adds 6 generic meeting analysis tasks using the GitLab Product Marketing Meeting transcript:

task_meeting_executive_summary - Generate executive summary from meeting transcript
task_meeting_sentiment_analysis - Analyze meeting sentiment and team dynamics
task_meeting_follow_up_email - Draft professional follow-up email
task_meeting_blog_post - Transform meeting insights into a blog post
task_meeting_tldr - Generate an ultra-concise TL;DR (~150 words)
task_meeting_searchable_index - Create structured, searchable index of meeting content

All tasks use assets/meetings/2021-06-28-gitlab-product-marketing-meeting.md as the source transcript.

Closes #202, Closes #203, Closes #204, Closes #205, Closes #206, Closes #207

…blog, tldr, searchable index)

kilo-code-bot · 2026-04-14T13:45:51Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid addition of 6 well-structured meeting analysis tasks. The grading functions consistently handle missing-file edge cases, use hardcoded regex patterns (no injection surface), and perform no dangerous operations. The source transcript asset (assets/meetings/2021-06-28-gitlab-product-marketing-meeting.md) is confirmed present in the repository. Task structure, weights, and manifest entries all look correct.

Files Reviewed (7 files)

tasks/manifest.yaml
tasks/task_meeting_blog_post.md
tasks/task_meeting_executive_summary.md
tasks/task_meeting_follow_up_email.md
tasks/task_meeting_searchable_index.md
tasks/task_meeting_sentiment_analysis.md
tasks/task_meeting_tldr.md

_{Reviewed by claude-sonnet-4.6 · 173,967 tokens}

ScuttleBot · 2026-04-14T14:46:15Z

🧪 Test Started

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, Ubuntu 22.04, ATL)
Instance ID: d0fbf771-6f6f-4bd4-9c57-a2295d8a3d36

Models Being Tested

openrouter/anthropic/claude-opus-4.6
openrouter/openai/gpt-5.4
openrouter/google/gemini-3.1-pro-preview

Tasks Being Tested

task_meeting_executive_summary
task_meeting_sentiment_analysis
task_meeting_follow_up_email
task_meeting_blog_post
task_meeting_tldr
task_meeting_searchable_index

Plan

All 3 models will run in parallel. Each model runs all 6 meeting tasks.

Estimated completion: ~30-45 minutes from now (~11:25 AM ET)

Automated test by ScuttleBot 🦀

ScuttleBot · 2026-04-14T15:19:23Z

🧪 Test Results — PR #315 (Meeting Generic Tasks)

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/meeting-generic
Duration: ~25 minutes total (all 3 models in parallel)

Scores

Task	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
`executive_summary`	⚠️ 50%	⚠️ 50%	⚠️ 50%
`sentiment_analysis`	⚠️ 79%	⚠️ 50%	⚠️ 50%
`follow_up_email`	❌ 0%	⚠️ 50%	❌ 0%
`blog_post`	✅ 84%	✅ 89%	✅ 82%
`tldr`	✅ 89%	✅ 89%	❌ 0%
`searchable_index`	❌ 0%	✅ 82%	❌ 0%
OVERALL	50.4% (3.03/6.0)	68.5% (4.11/6.0)	30.3% (1.82/6.0)
⏱️ Total exec time	558s	436s	258s

Analysis

🟢 Working well

blog_post — All 3 models score 82-89%. Strong task with clear rubric. Automated + LLM judge agree.
tldr — Claude and GPT both hit 89%. Clean task design.

🟡 Judge issues dragging scores down

executive_summary — All models pass 100% of automated checks (file_created, meeting_date, topics, events, announcements, decisions, action_items, concise) but the LLM judge failed to produce parseable output for all 3 models, capping scores at 50%. This is a judge problem, not a task problem.
sentiment_analysis — Similar pattern. GPT and Gemini pass all automated checks but LLM judge fails, giving 50%. Claude got a judge score (79%) which looks reasonable.

🔴 Real failures

follow_up_email — Claude and Gemini both failed to create the output file (file_created=0.0). GPT created it and passed all automated checks. The agent didn't produce follow_up_email.md — possible issue with task instructions or workspace setup? Worth investigating the transcript.
searchable_index — Only GPT created the output file (82%). Claude and Gemini both failed to produce meeting_index.md. These are complex multi-section documents — agents may be running into context/timeout issues with the large transcript.
tldr (Gemini only) — Gemini created blog_post.md instead of meeting_tldr.md. Confused this task with the blog post task. Task instructions may need to be more explicit about the output filename.

Key Observations

LLM judge reliability is the biggest issue. 5 out of 18 task-model combinations had "LLM judge failed: no parseable response" — that's 28%. When the judge fails, hybrid scoring caps at 50% even if all automated checks pass. This inflates the gap between automated and final scores.
Automated checks are solid. When agents produce the output file, they consistently pass the automated rubric checks. The task definitions and grading criteria are well-designed.
Complex output tasks (index, email) are harder. The searchable_index task requires 6+ structured sections — agents struggle to complete this reliably. May need a longer timeout or clearer section-by-section instructions.
GPT-5.4 was the most reliable — only model to produce output for all 6 tasks.

Recommendation

Needs work before merge:

Investigate LLM judge failures — The judge is failing to parse responses on ~28% of evaluations. This may be a rubric formatting issue, prompt length issue, or judge model compatibility problem. The tasks themselves are fine — automated checks prove the agents are doing good work.
Check follow_up_email task — 2 of 3 models failed to create the output file. Review the task prompt to ensure the expected filename is clearly stated.
Check searchable_index task — Same issue, 2 of 3 models didn't create the file. May need clearer instructions or longer timeout.
tldr task — Gemini confused it with blog_post. Consider adding a more explicit "DO NOT write a blog post" or making the filename more prominent in the instructions.

Once the judge reliability issue is addressed, these scores should jump significantly — the automated checks indicate the tasks are well-calibrated.

Automated test by ScuttleBot 🦀 | Instance destroyed after test

ScuttleBot · 2026-04-15T12:33:37Z

🧪 Test Started

Instance: 155.138.225.173 (vc2-2c-4gb, Ubuntu 22.04, Vultr ATL)
Instance ID: 268edb6c-fd85-41fe-a299-4a9499493d63

Models:

openrouter/anthropic/claude-opus-4.6
openrouter/openai/gpt-5.4
openrouter/google/gemini-2.5-pro

Tasks (6 new generic meeting analysis):

task_meeting_executive_summary
task_meeting_sentiment_analysis
task_meeting_follow_up_email
task_meeting_blog_post
task_meeting_tldr
task_meeting_searchable_index

ETA: ~30-45 minutes (3 models running in parallel)
Triggered by: ScuttleBot automated PR testing

ScuttleBot · 2026-04-15T13:29:27Z

🧪 PR #315 Test Results — Generic Meeting Analysis Tasks

Instance: 155.138.225.173 (vc2-2c-4gb, Ubuntu 22.04, Vultr ATL)
Branch: tasks/meeting-generic @ 1c8157b
Judge: openrouter/anthropic/claude-opus-4.5 (hybrid grading: automated checks + LLM judge)
Total runtime: ~50 min (including retry — see notes)

Overall Scores

Model	Overall Score	Rank
claude-opus-4.6	88.3% (5.3/6.0)	🥇
gpt-5.4	85.5% (5.1/6.0)	🥈
gemini-2.5-pro	61.2% (3.7/6.0)	🥉

Task-by-Task Breakdown

Task	Claude Opus 4.6	GPT-5.4	Gemini 2.5 Pro
`executive_summary`	90%	90%	92%
`sentiment_analysis`	86%	82%	84%
`follow_up_email`	89%	82%	53% ⚠️
`blog_post`	90%	86%	51% ⚠️
`tldr`	89%	90%	88%
`searchable_index`	86%	82%	0% ❌

Observations

All 6 tasks are functional and produce meaningful differentiation across models. This is a solid task suite.

Claude Opus 4.6 (88.3%) — Consistently strong

Perfect automated scores on exec summary, sentiment, and sentiment analysis
Minor deductions: blog post slightly over length target (appropriate_length: 0.5), TLDR slightly over 150-word limit (word_count_ok: 0.5), email not quite concise enough (concise: 0.5)
action_items_log: 0.0 on searchable index (known weakness in that rubric check)
Judge noted excellent insights: "bundling MVCs into themes", "stealth mode mental exercise for GA-ready features"

GPT-5.4 (85.5%) — Solid performer

Perfect automated scores on first 3 tasks and TLDR
Blog post: appropriate_length: 0.5, professional_tone: 0.5
Searchable index: action_items_log: 0.0 (same as Claude — may indicate rubric is too strict here)
Judge: "Comprehensive 21KB index with 14 detailed topics (exceeding minimum 5)"

Gemini 2.5 Pro (61.2%) — Significant issues

follow_up_email (53%): Agent wrote email WITHOUT reading the transcript first, resulting in completely fabricated content. Judge noted hallucinated topics (DevOps World, CI platform vendor decisions) and people not in the meeting.
blog_post (51%): Rubric mismatch — judge noted "Rubric evaluates blog post transformation but the actual task requested a TL;DR". Appears the agent may have confused task ordering or failed to properly scope its output.
searchable_index (0%): Agent failed completely — tried to spawn sub-agents twice (both failed), read the transcript but never created the output file meeting_index.md.
Strong on exec summary (92% — highest of all models!) and TLDR (88%)

Infrastructure Note

Initial parallel run hit a race condition: 3 benchmark processes tried to create OpenClaw agents simultaneously. Claude and Gemini got Unknown agent id errors and scored 0% across all tasks. GPT survived via embedded fallback. Retried Claude and Gemini with 5s stagger — both worked. This is a benchmark infrastructure issue, not a task issue. Recommend serializing agent creation or adding a mutex.

Rubric Observations

action_items_log in searchable_index: Both Claude and GPT scored 0 here despite otherwise excellent output. Worth checking if the automated check is too rigid.
Gemini hallucination on follow_up_email: The task design correctly exposes a real model weakness (hallucinating content instead of reading the file first). Good test.
Word count checks on TLDR: Claude slightly over 150 words — the automated check caught this appropriately.

Verdict: Tasks are well-designed and ready to merge. They test meaningful meeting comprehension skills and expose real model differences. The hybrid (automated + judge) grading produces fair, nuanced scores.

Add meeting generic tasks (exec summary, sentiment, follow-up email, …

1c8157b

…blog, tldr, searchable index)

Merge main into tasks/meeting-generic, resolve manifest conflicts

fc1a083

olearycrew merged commit f2c2a0d into main Apr 15, 2026
1 check passed

ScuttleBot mentioned this pull request Apr 16, 2026

feat: implement new_session:true support for multi-turn task session … #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generic meeting analysis tasks#315

Add generic meeting analysis tasks#315
olearycrew merged 2 commits intomainfrom
tasks/meeting-generic

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

ScuttleBot commented Apr 14, 2026

Uh oh!

ScuttleBot commented Apr 14, 2026

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

ScuttleBot commented Apr 14, 2026

🧪 Test Started

Models Being Tested

Tasks Being Tested

Plan

Uh oh!

ScuttleBot commented Apr 14, 2026

🧪 Test Results — PR #315 (Meeting Generic Tasks)

Scores

Analysis

🟢 Working well

🟡 Judge issues dragging scores down

🔴 Real failures

Key Observations

Recommendation

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 Test Started

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 PR #315 Test Results — Generic Meeting Analysis Tasks

Overall Scores

Task-by-Task Breakdown

Observations

Claude Opus 4.6 (88.3%) — Consistently strong

GPT-5.4 (85.5%) — Solid performer

Gemini 2.5 Pro (61.2%) — Significant issues

Infrastructure Note

Rubric Observations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading