Skip to content

Add generic meeting analysis tasks#315

Merged
olearycrew merged 2 commits intomainfrom
tasks/meeting-generic
Apr 15, 2026
Merged

Add generic meeting analysis tasks#315
olearycrew merged 2 commits intomainfrom
tasks/meeting-generic

Conversation

@ScuttleBot
Copy link
Copy Markdown

Adds 6 generic meeting analysis tasks using the GitLab Product Marketing Meeting transcript:

  1. task_meeting_executive_summary - Generate executive summary from meeting transcript
  2. task_meeting_sentiment_analysis - Analyze meeting sentiment and team dynamics
  3. task_meeting_follow_up_email - Draft professional follow-up email
  4. task_meeting_blog_post - Transform meeting insights into a blog post
  5. task_meeting_tldr - Generate an ultra-concise TL;DR (~150 words)
  6. task_meeting_searchable_index - Create structured, searchable index of meeting content

All tasks use assets/meetings/2021-06-28-gitlab-product-marketing-meeting.md as the source transcript.

Closes #202, Closes #203, Closes #204, Closes #205, Closes #206, Closes #207

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid addition of 6 well-structured meeting analysis tasks. The grading functions consistently handle missing-file edge cases, use hardcoded regex patterns (no injection surface), and perform no dangerous operations. The source transcript asset (assets/meetings/2021-06-28-gitlab-product-marketing-meeting.md) is confirmed present in the repository. Task structure, weights, and manifest entries all look correct.

Files Reviewed (7 files)
  • tasks/manifest.yaml
  • tasks/task_meeting_blog_post.md
  • tasks/task_meeting_executive_summary.md
  • tasks/task_meeting_follow_up_email.md
  • tasks/task_meeting_searchable_index.md
  • tasks/task_meeting_sentiment_analysis.md
  • tasks/task_meeting_tldr.md

Reviewed by claude-sonnet-4.6 · 173,967 tokens

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 Test Started

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, Ubuntu 22.04, ATL)
Instance ID: d0fbf771-6f6f-4bd4-9c57-a2295d8a3d36

Models Being Tested

  • openrouter/anthropic/claude-opus-4.6
  • openrouter/openai/gpt-5.4
  • openrouter/google/gemini-3.1-pro-preview

Tasks Being Tested

  • task_meeting_executive_summary
  • task_meeting_sentiment_analysis
  • task_meeting_follow_up_email
  • task_meeting_blog_post
  • task_meeting_tldr
  • task_meeting_searchable_index

Plan

All 3 models will run in parallel. Each model runs all 6 meeting tasks.

Estimated completion: ~30-45 minutes from now (~11:25 AM ET)


Automated test by ScuttleBot 🦀

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 Test Results — PR #315 (Meeting Generic Tasks)

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/meeting-generic
Duration: ~25 minutes total (all 3 models in parallel)

Scores

Task Claude Opus 4.6 GPT-5.4 Gemini 3.1 Pro
executive_summary ⚠️ 50% ⚠️ 50% ⚠️ 50%
sentiment_analysis ⚠️ 79% ⚠️ 50% ⚠️ 50%
follow_up_email ❌ 0% ⚠️ 50% ❌ 0%
blog_post ✅ 84% ✅ 89% ✅ 82%
tldr ✅ 89% ✅ 89% ❌ 0%
searchable_index ❌ 0% ✅ 82% ❌ 0%
OVERALL 50.4% (3.03/6.0) 68.5% (4.11/6.0) 30.3% (1.82/6.0)
⏱️ Total exec time 558s 436s 258s

Analysis

🟢 Working well

  • blog_post — All 3 models score 82-89%. Strong task with clear rubric. Automated + LLM judge agree.
  • tldr — Claude and GPT both hit 89%. Clean task design.

🟡 Judge issues dragging scores down

  • executive_summary — All models pass 100% of automated checks (file_created, meeting_date, topics, events, announcements, decisions, action_items, concise) but the LLM judge failed to produce parseable output for all 3 models, capping scores at 50%. This is a judge problem, not a task problem.
  • sentiment_analysis — Similar pattern. GPT and Gemini pass all automated checks but LLM judge fails, giving 50%. Claude got a judge score (79%) which looks reasonable.

🔴 Real failures

  • follow_up_email — Claude and Gemini both failed to create the output file (file_created=0.0). GPT created it and passed all automated checks. The agent didn't produce follow_up_email.md — possible issue with task instructions or workspace setup? Worth investigating the transcript.
  • searchable_index — Only GPT created the output file (82%). Claude and Gemini both failed to produce meeting_index.md. These are complex multi-section documents — agents may be running into context/timeout issues with the large transcript.
  • tldr (Gemini only) — Gemini created blog_post.md instead of meeting_tldr.md. Confused this task with the blog post task. Task instructions may need to be more explicit about the output filename.

Key Observations

  1. LLM judge reliability is the biggest issue. 5 out of 18 task-model combinations had "LLM judge failed: no parseable response" — that's 28%. When the judge fails, hybrid scoring caps at 50% even if all automated checks pass. This inflates the gap between automated and final scores.

  2. Automated checks are solid. When agents produce the output file, they consistently pass the automated rubric checks. The task definitions and grading criteria are well-designed.

  3. Complex output tasks (index, email) are harder. The searchable_index task requires 6+ structured sections — agents struggle to complete this reliably. May need a longer timeout or clearer section-by-section instructions.

  4. GPT-5.4 was the most reliable — only model to produce output for all 6 tasks.

Recommendation

Needs work before merge:

  1. Investigate LLM judge failures — The judge is failing to parse responses on ~28% of evaluations. This may be a rubric formatting issue, prompt length issue, or judge model compatibility problem. The tasks themselves are fine — automated checks prove the agents are doing good work.

  2. Check follow_up_email task — 2 of 3 models failed to create the output file. Review the task prompt to ensure the expected filename is clearly stated.

  3. Check searchable_index task — Same issue, 2 of 3 models didn't create the file. May need clearer instructions or longer timeout.

  4. tldr task — Gemini confused it with blog_post. Consider adding a more explicit "DO NOT write a blog post" or making the filename more prominent in the instructions.

Once the judge reliability issue is addressed, these scores should jump significantly — the automated checks indicate the tasks are well-calibrated.


Automated test by ScuttleBot 🦀 | Instance destroyed after test

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 Test Started

Instance: 155.138.225.173 (vc2-2c-4gb, Ubuntu 22.04, Vultr ATL)
Instance ID: 268edb6c-fd85-41fe-a299-4a9499493d63

Models:

  • openrouter/anthropic/claude-opus-4.6
  • openrouter/openai/gpt-5.4
  • openrouter/google/gemini-2.5-pro

Tasks (6 new generic meeting analysis):

  1. task_meeting_executive_summary
  2. task_meeting_sentiment_analysis
  3. task_meeting_follow_up_email
  4. task_meeting_blog_post
  5. task_meeting_tldr
  6. task_meeting_searchable_index

ETA: ~30-45 minutes (3 models running in parallel)
Triggered by: ScuttleBot automated PR testing

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR #315 Test Results — Generic Meeting Analysis Tasks

Instance: 155.138.225.173 (vc2-2c-4gb, Ubuntu 22.04, Vultr ATL)
Branch: tasks/meeting-generic @ 1c8157b
Judge: openrouter/anthropic/claude-opus-4.5 (hybrid grading: automated checks + LLM judge)
Total runtime: ~50 min (including retry — see notes)


Overall Scores

Model Overall Score Rank
claude-opus-4.6 88.3% (5.3/6.0) 🥇
gpt-5.4 85.5% (5.1/6.0) 🥈
gemini-2.5-pro 61.2% (3.7/6.0) 🥉

Task-by-Task Breakdown

Task Claude Opus 4.6 GPT-5.4 Gemini 2.5 Pro
executive_summary 90% 90% 92%
sentiment_analysis 86% 82% 84%
follow_up_email 89% 82% 53% ⚠️
blog_post 90% 86% 51% ⚠️
tldr 89% 90% 88%
searchable_index 86% 82% 0%

Observations

All 6 tasks are functional and produce meaningful differentiation across models. This is a solid task suite.

Claude Opus 4.6 (88.3%) — Consistently strong

  • Perfect automated scores on exec summary, sentiment, and sentiment analysis
  • Minor deductions: blog post slightly over length target (appropriate_length: 0.5), TLDR slightly over 150-word limit (word_count_ok: 0.5), email not quite concise enough (concise: 0.5)
  • action_items_log: 0.0 on searchable index (known weakness in that rubric check)
  • Judge noted excellent insights: "bundling MVCs into themes", "stealth mode mental exercise for GA-ready features"

GPT-5.4 (85.5%) — Solid performer

  • Perfect automated scores on first 3 tasks and TLDR
  • Blog post: appropriate_length: 0.5, professional_tone: 0.5
  • Searchable index: action_items_log: 0.0 (same as Claude — may indicate rubric is too strict here)
  • Judge: "Comprehensive 21KB index with 14 detailed topics (exceeding minimum 5)"

Gemini 2.5 Pro (61.2%) — Significant issues

  • follow_up_email (53%): Agent wrote email WITHOUT reading the transcript first, resulting in completely fabricated content. Judge noted hallucinated topics (DevOps World, CI platform vendor decisions) and people not in the meeting.
  • blog_post (51%): Rubric mismatch — judge noted "Rubric evaluates blog post transformation but the actual task requested a TL;DR". Appears the agent may have confused task ordering or failed to properly scope its output.
  • searchable_index (0%): Agent failed completely — tried to spawn sub-agents twice (both failed), read the transcript but never created the output file meeting_index.md.
  • Strong on exec summary (92% — highest of all models!) and TLDR (88%)

Infrastructure Note

Initial parallel run hit a race condition: 3 benchmark processes tried to create OpenClaw agents simultaneously. Claude and Gemini got Unknown agent id errors and scored 0% across all tasks. GPT survived via embedded fallback. Retried Claude and Gemini with 5s stagger — both worked. This is a benchmark infrastructure issue, not a task issue. Recommend serializing agent creation or adding a mutex.

Rubric Observations

  1. action_items_log in searchable_index: Both Claude and GPT scored 0 here despite otherwise excellent output. Worth checking if the automated check is too rigid.
  2. Gemini hallucination on follow_up_email: The task design correctly exposes a real model weakness (hallucinating content instead of reading the file first). Good test.
  3. Word count checks on TLDR: Claude slightly over 150 words — the automated check caught this appropriately.

Verdict: Tasks are well-designed and ready to merge. They test meaningful meeting comprehension skills and expose real model differences. The hybrid (automated + judge) grading produces fair, nuanced scores.

@olearycrew olearycrew merged commit f2c2a0d into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants