Conversation
…banned title/description patterns
- Fix extractHighlights() to reject metadata field labels (Committee:, Filed by:, Published:)
- Remove BANNED ": {Topic} in Focus" suffix from generateDynamicTitle()
- Add sanitizeAlternativeHeadline() to strip boilerplate from Schema.org alternativeHeadline
- Add generateSeoDescription() for proper meta description without banned patterns
- Add countWords() for accurate Schema.org wordCount (was using length/5)
- Add speakable property to Schema.org NewsArticle structured data
- Add dateCreated to Schema.org structured data
- Improve all 14-language title templates in generators.ts - remove generic "This Week"/"Battle Lines" phrases
- Add 2 new banned patterns to shared.ts BANNED_PATTERNS array
- Update SHARED_PROMPT_PATTERNS.md with new banned patterns for AI workflows
Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/568e92a9-076f-4d3e-ad26-025340de26d0
Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…de review Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/568e92a9-076f-4d3e-ad26-025340de26d0 Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…titles/descriptions Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/8e049658-aedb-4cdc-9e35-b48c1a80e47a Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…ractHighlights/extractDominantTheme, replace all subtitle templates with AI stubs Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/8e049658-aedb-4cdc-9e35-b48c1a80e47a Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
… test assertion Agent-Logs-Url: https://github.com/Hack23/riksdagsmonitor/sessions/8e049658-aedb-4cdc-9e35-b48c1a80e47a Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🏷️ Automatic Labeling SummaryThis PR has been automatically labeled based on the files changed and PR metadata. Applied Labels: documentation,workflow,ci-cd,testing,refactor,size-l,news,agentic-workflow Label Categories
For more information, see |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
There was a problem hiding this comment.
Pull request overview
This PR deprecates code-generated article titles/descriptions and shifts responsibility for all title/SEO metadata generation to the AI agent, using synthesis-driven workflow guidance and placeholder stubs in generators.
Changes:
- Replaces dynamic/heuristic title generation (regex highlights + theme extraction) with a v5.0 stub and updates generator subtitle templates to AI-attribution stubs (removing doc-count interpolation).
- Updates the article HTML template to enforce SERP-length descriptions and improve Schema.org metadata fields.
- Updates methodology docs and workflow prompts to mandate “analysis → synthesis → title/SEO” sequencing, and adjusts tests to validate the new stub behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
tests/dynamic-title.test.ts |
Rewrites unit/integration tests to validate stubbed title generation and to scan generators for disallowed subtitle interpolation. |
scripts/news-types/weekly-review/generator.ts |
Replaces doc-count subtitle templates with AI-attribution stub subtitles; keeps API compatibility. |
scripts/news-types/monthly-review.ts |
Replaces doc-count subtitle templates with AI-attribution stub subtitles; keeps API compatibility. |
scripts/news-types/month-ahead.ts |
Replaces event-count subtitle templates with AI-attribution stub subtitles; keeps API compatibility. |
scripts/news-types/breaking-news.ts |
Replaces breaking-news subtitles with AI-attribution stub subtitles across languages. |
scripts/generate-news-enhanced/helpers.ts |
Removes regex-based highlight/theme extraction and converts generateDynamicTitle() into a deprecated v5.0 stub. |
scripts/generate-news-enhanced/generators.ts |
Replaces generator subtitle templates with AI-attribution stubs (removing ${*.length} patterns) and simplifies titles. |
scripts/data-transformers/content-generators/shared.ts |
Extends banned-pattern detection to catch additional template artifacts (e.g., “in Focus” suffixes). |
scripts/article-template/template.ts |
Adds SEO/structured-data helpers (altHeadline sanitization, meta description truncation, wordCount calculation, speakable) and wires them into HTML + JSON-LD. |
analysis/templates/synthesis-summary.md |
Adds mandatory “AI-Recommended Article Metadata” fields to make synthesis the single source of truth for titles/descriptions. |
analysis/methodologies/political-style-guide.md |
Adds v5.0 title/SEO standards and banned title patterns; bumps methodology version metadata. |
analysis/methodologies/ai-driven-analysis-guide.md |
Adds v5.0 “absolute ban” on code-generated titles and a mandatory analysis-driven decision protocol. |
.github/workflows/SHARED_PROMPT_PATTERNS.md |
Adds an explicit “Analysis→Title Pipeline” protocol and expands banned title/description examples. |
.github/workflows/news-propositions.md |
Updates Step 3b to require synthesis-first title/SEO generation and all-language metadata updates. |
.github/workflows/news-motions.md |
Updates Step 3b to require synthesis-first title/SEO generation and all-language metadata updates. |
.github/workflows/news-interpellations.md |
Updates Step 3c to require synthesis-first title/SEO generation and all-language metadata updates. |
.github/workflows/news-evening-analysis.md |
Updates Step 3b to require cross-type synthesis-first title/SEO generation and all-language metadata updates. |
.github/workflows/news-committee-reports.md |
Updates Step 3b to require synthesis-first title/SEO generation and all-language metadata updates. |
| const HEADLINE_BANNED_PATTERNS: readonly RegExp[] = [ | ||
| /Political intelligence briefing on [A-Za-z:]+\s+and\s+[A-Za-z:]+\s*[—–-]\s*\d+ parliamentary documents analyzed/i, | ||
| /In-depth analysis of [A-Za-z:]+\s+based on \d+ parliamentary documents/i, |
There was a problem hiding this comment.
HEADLINE_BANNED_PATTERNS (and the generateSeoDescription safety-net checks below) are too restrictive to catch the real leaked boilerplate seen in existing pages, e.g. "Political intelligence briefing on Filed by: and Published: — 10 parliamentary documents analyzed". The current patterns assume single-word field labels (A-Za-z / \w) and therefore won’t match "Filed by:" / "Committee:" cases; broaden these regexes to allow multi-word field labels (and ideally Unicode letters) so the sanitizer actually strips the known template artifacts.
| const HEADLINE_BANNED_PATTERNS: readonly RegExp[] = [ | |
| /Political intelligence briefing on [A-Za-z:]+\s+and\s+[A-Za-z:]+\s*[—–-]\s*\d+ parliamentary documents analyzed/i, | |
| /In-depth analysis of [A-Za-z:]+\s+based on \d+ parliamentary documents/i, | |
| const HEADLINE_FIELD_LABEL_WORD = String.raw`[\p{L}\p{M}]+`; | |
| const HEADLINE_FIELD_LABEL = String.raw`${HEADLINE_FIELD_LABEL_WORD}(?:[\s-]+${HEADLINE_FIELD_LABEL_WORD})*\s*:`; | |
| const HEADLINE_BANNED_PATTERNS: readonly RegExp[] = [ | |
| new RegExp( | |
| String.raw`Political intelligence briefing on ${HEADLINE_FIELD_LABEL}\s+and\s+${HEADLINE_FIELD_LABEL}\s*[—–-]\s*\d+ parliamentary documents analyzed`, | |
| 'iu', | |
| ), | |
| new RegExp( | |
| String.raw`In-depth analysis of ${HEADLINE_FIELD_LABEL}\s+based on \d+ parliamentary documents`, | |
| 'iu', | |
| ), |
| function sanitizeAlternativeHeadline(subtitle: string, maxLen: number = 110): string { | ||
| let clean = subtitle; | ||
| for (const bp of HEADLINE_BANNED_PATTERNS) { | ||
| if (bp.test(clean)) { | ||
| clean = clean.replace(bp, '').trim(); | ||
| } | ||
| } | ||
| // If cleaning emptied the string, return a safe fallback | ||
| if (clean.length < MIN_HEADLINE_LENGTH) { | ||
| clean = subtitle.substring(0, maxLen); | ||
| } | ||
| return clean.substring(0, maxLen); |
There was a problem hiding this comment.
sanitizeAlternativeHeadline() falls back to the original subtitle when the banned-pattern removal leaves < MIN_HEADLINE_LENGTH. That re-introduces the exact boilerplate you’re trying to strip (so Schema.org alternativeHeadline can still contain banned template text). Prefer a fallback that cannot contain the banned patterns (e.g., use the article title, or a known-safe AI stub), and consider trimming leftover punctuation after replacements.
| * Calculate accurate word count from HTML content by stripping tags first. | ||
| */ | ||
| function countWords(html: string): number { | ||
| const text = html.replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim(); |
There was a problem hiding this comment.
countWords() will return misleadingly low counts for languages that don’t use spaces between words (notably ja/zh/ko in this repo). Since wordCount is emitted into Schema.org, consider using Intl.Segmenter (Node 25 supports it) for per-language word segmentation, or add a fallback heuristic (e.g., character-based estimate) when the stripped text contains few/no spaces.
| const text = html.replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim(); | |
| const text = html.replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim(); | |
| if (text.length === 0) { | |
| return 0; | |
| } | |
| const segmenter = new Intl.Segmenter('und', { granularity: 'word' }); | |
| const wordLikeCount = Array.from(segmenter.segment(text)).filter(segment => segment.isWordLike).length; | |
| if (wordLikeCount > 0) { | |
| return wordLikeCount; | |
| } | |
| const hasWhitespace = /\s/u.test(text); | |
| const cjkCharacters = text.match(/[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/gu); | |
| if (!hasWhitespace && cjkCharacters !== null) { | |
| return cjkCharacters.length; | |
| } |
| { label: 'analysisOfNDocuments: "Analysis of N documents covering…"', pattern: /Analysis of \d+ documents covering/i }, | ||
| { label: 'policySignificanceGeneric: "Requires committee review and chamber debate…"', pattern: /Requires committee review and chamber debate/i }, | ||
| { label: 'topicInFocusSuffix: "…: {Topic} in Focus"', pattern: /:\s+\w[\w\s]*\bin Focus\b/i }, | ||
| { label: 'briefingOnFieldLabels: "Political intelligence briefing on {Field}: and {Field}:"', pattern: /Political intelligence briefing on \w+:\s+and\s+\w+:/i }, |
There was a problem hiding this comment.
The new banned-pattern detector for "Political intelligence briefing on {Field}: and {Field}:" won’t match the actual leaked boilerplate (e.g. "Filed by:" / "Committee:"), because the regex uses \w+ which stops at spaces. Update the pattern to allow multi-word field labels (and ideally Unicode letters) so detectBannedPatterns() reliably flags these known template artifacts.
| { label: 'briefingOnFieldLabels: "Political intelligence briefing on {Field}: and {Field}:"', pattern: /Political intelligence briefing on \w+:\s+and\s+\w+:/i }, | |
| { label: 'briefingOnFieldLabels: "Political intelligence briefing on {Field}: and {Field}:"', pattern: /Political intelligence briefing on [\p{L}\p{N}][\p{L}\p{N}\s,&/()-]*:\s+and\s+[\p{L}\p{N}][\p{L}\p{N}\s,&/()-]*:/iu }, |
| for (const file of generatorFiles) { | ||
| const content = fs.readFileSync(file, 'utf-8'); | ||
| const subtitleLines = content.match(/subtitle:\s*`[^`]+`/g) ?? []; | ||
| const aiStubs = subtitleLines.filter(s => s.includes('AI-generat') || s.includes('AI-genererad') || s.includes('AI-genereret') || s.includes('AI-generert') || s.includes('tekoäly') || s.includes('KI-generierte') || s.includes('AI-gegenereerde') || s.includes('الذكاء الاصطناعي') || s.includes('בינה מלאכותית') || s.includes('AI生成') || s.includes('AI 생성') || s.includes('générée par IA') || s.includes('generado por IA')); | ||
| // At least some subtitles should have the AI attribution marker | ||
| if (subtitleLines.length > 0) { | ||
| expect(aiStubs.length, `${file} should have AI attribution in subtitle stubs`).toBeGreaterThan(0); |
There was a problem hiding this comment.
The "subtitle templates contain AI attribution stub" assertion only checks that at least one subtitle template in a generator file contains an AI marker. The PR description/intent says subtitles are stubs specifically so the agent can reliably detect & overwrite them, so this test should enforce that every discovered subtitle: template literal in these generators includes the attribution marker (or otherwise explicitly whitelist exceptions).
| for (const file of generatorFiles) { | |
| const content = fs.readFileSync(file, 'utf-8'); | |
| const subtitleLines = content.match(/subtitle:\s*`[^`]+`/g) ?? []; | |
| const aiStubs = subtitleLines.filter(s => s.includes('AI-generat') || s.includes('AI-genererad') || s.includes('AI-genereret') || s.includes('AI-generert') || s.includes('tekoäly') || s.includes('KI-generierte') || s.includes('AI-gegenereerde') || s.includes('الذكاء الاصطناعي') || s.includes('בינה מלאכותית') || s.includes('AI生成') || s.includes('AI 생성') || s.includes('générée par IA') || s.includes('generado por IA')); | |
| // At least some subtitles should have the AI attribution marker | |
| if (subtitleLines.length > 0) { | |
| expect(aiStubs.length, `${file} should have AI attribution in subtitle stubs`).toBeGreaterThan(0); | |
| const aiAttributionMarkers = [ | |
| 'AI-generat', | |
| 'AI-genererad', | |
| 'AI-genereret', | |
| 'AI-generert', | |
| 'tekoäly', | |
| 'KI-generierte', | |
| 'AI-gegenereerde', | |
| 'الذكاء الاصطناعي', | |
| 'בינה מלאכותית', | |
| 'AI生成', | |
| 'AI 생성', | |
| 'générée par IA', | |
| 'generado por IA', | |
| ]; | |
| for (const file of generatorFiles) { | |
| const content = fs.readFileSync(file, 'utf-8'); | |
| const subtitleLines = content.match(/subtitle:\s*`[^`]+`/g) ?? []; | |
| const aiStubs = subtitleLines.filter(subtitleLine => | |
| aiAttributionMarkers.some(marker => subtitleLine.includes(marker)), | |
| ); | |
| // Every discovered subtitle template must be an AI-attribution stub so the | |
| // agent can reliably detect and replace it. Files without subtitle templates | |
| // are ignored by this assertion. | |
| if (subtitleLines.length > 0) { | |
| expect( | |
| aiStubs.length, | |
| `${file} should include AI attribution in every subtitle stub`, | |
| ).toBe(subtitleLines.length); |
| subtitle: `Ugentlig gennemgang — AI-genereret politisk analyse fra det svenske parlament` | ||
| }, | ||
| no: { | ||
| title: `Ukentlig gjennomgang: Stortinget i retrospekt`, |
There was a problem hiding this comment.
Norwegian (no) subtitle now explicitly says the analysis is from the Swedish parliament, but the corresponding title still says "Stortinget" (Norway’s parliament). This is internally inconsistent and misleading in the rendered metadata; adjust the no title (or subtitle) to refer to Sweden’s Riksdag consistently.
| title: `Ukentlig gjennomgang: Stortinget i retrospekt`, | |
| title: `Ukentlig gjennomgang: Riksdagen i retrospekt`, |
| subtitle: `Månedsgennemgang — AI-genereret politisk analyse fra det svenske parlament` | ||
| }, | ||
| no: { | ||
| title: `Månedsgjennomgang: Stortinget i perspektiv`, |
There was a problem hiding this comment.
Norwegian (no) subtitle now explicitly says the analysis is from the Swedish parliament, but the corresponding title still says "Stortinget" (Norway’s parliament). This mismatch was introduced by the subtitle change and will surface in page titles/SEO; update the no title (or subtitle) to consistently reference Sweden’s Riksdag.
| title: `Månedsgjennomgang: Stortinget i perspektiv`, | |
| title: `Månedsgjennomgang: Riksdagen i perspektiv`, |
Article titles, meta descriptions, and Schema.org structured data were generated by TypeScript code (regex extraction from
<strong>tags, keyword matching, document count interpolation) rather than by AI analysis of actual political content. This produced generic, repetitive titles like"Committee Reports: Defense in Focus"instead of newsworthy headlines derived from synthesis findings.Code changes
generateDynamicTitle()— gutted to a stub returning base title + AI-attribution marker. The AI agent overwrites this during agentic workflows after readingsynthesis-summary.mdextractHighlights(),extractDominantTheme()— removed entirely. Regex scanning HTML for<strong>tags is not political intelligence${docs.length} documentsinterpolation, added static AI-attribution stubsgenerateSeoDescription()— simplified to SERP length enforcement onlyMethodology & workflow prompt updates
[Active Verb] + [Actor/Institution] + [Policy Action]Tests
dynamic-title.test.tsrewritten for stub behavior — validates no.lengthinterpolation in subtitles across all discovered generators, verifies AI-attribution markers present