Skip to content

Add working link checker workflow using lychee#705

Merged
richarddushime merged 22 commits intomasterfrom
fix/link-checker-workflow
Mar 19, 2026
Merged

Add working link checker workflow using lychee#705
richarddushime merged 22 commits intomasterfrom
fix/link-checker-workflow

Conversation

@LukasWallrich
Copy link
Contributor

Summary

  • Replaces the broken filiph/linkcheck workflow (currently disabled as link-check.yaml_OLD) with lychee, a fast Rust-based link checker
  • Crawls the live https://forrt.org site weekly (Mondays 01:30 UTC) and on manual dispatch
  • Creates a GitHub issue with label link-check listing all broken links found
  • Adds .lychee.toml config with exclusions for common false positives (LinkedIn, Twitter/X, doi.org, web.archive.org, etc.)

What changed

  • .github/workflows/link-check.yaml — new workflow using lychee/lychee-action@v2
  • .lychee.toml — link checker config (reusable locally with lychee --config .lychee.toml https://forrt.org)

Key design decisions

  • Crawls the live site (not source files) since many pages are dynamically generated by Hugo
  • Does not fail the workflow — only creates an issue when broken links are found
  • Uses default GITHUB_TOKEN, no custom secrets needed
  • Limits concurrency to 8 requests to avoid overwhelming the server

Test plan

  • Trigger manually via Actions tab → "Link Checker" → "Run workflow"
  • Verify issue is created with broken link report (if any)
  • Confirm excluded domains (LinkedIn, doi.org, etc.) don't appear as false positives

🤖 Generated with Claude Code

Richard Dushime and others added 4 commits March 12, 2026 01:13
Replace the broken filiph/linkcheck workflow with lychee, which crawls
the live forrt.org site weekly and creates a GitHub issue listing any
broken links found. Includes .lychee.toml config with exclusions for
common false positives (LinkedIn, Twitter/X, doi.org, web.archive.org).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@LukasWallrich LukasWallrich requested a review from a team as a code owner March 18, 2026 16:17
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

⚠️ Image files/references in png/jpg format detected

Note that we generally rely on webp format for this webpage, so please consider converting these images to WebP format and updating references accordingly.

References to image files:

  • content/educators-corner/022-repro-metrics-forrt-irise/index.md: app](reprometrics_app.png

LukasWallrich and others added 2 commits March 18, 2026 16:18
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

📝 Spell Check Results

Found 1 potential spelling issue(s) when checking 24 changed file(s):

📄 content/educators-corner/004-Teaching-why-how-replication/index.md

Line Issue
94 pre-selected ==> preselected

ℹ️ How to address these issues:

  1. Fix the typo: If it's a genuine typo, please correct it.
  2. Add to whitelist: If it's a valid word (e.g., a name, technical term), add it to .codespell-ignore.txt
  3. False positive: If this is a false positive, please report it in the PR comments.

🤖 This check was performed by codespell

LukasWallrich and others added 2 commits March 18, 2026 16:22
Lychee doesn't support recursive crawling, so fetch all page URLs
from forrt.org/sitemap.xml and check links on every page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@LukasWallrich
Copy link
Contributor Author

LukasWallrich commented Mar 18, 2026

Staging Deployment Status

This PR has been successfully deployed to staging as part of an aggregated deployment.

Deployed at: 2026-03-19 14:22:07 UTC
Staging URL: https://staging.forrt.org

The staging site shows the combined state of all compatible open PRs.

LukasWallrich and others added 6 commits March 18, 2026 16:27
Replace grep -oP (Perl regex) with grep -o + sed for broader
shell compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Download the latest deploy artifact instead of crawling the live site.
Lychee scans the local HTML files and checks every link it finds,
both internal and external. This catches broken outbound links that
the sitemap-only approach missed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove email addresses from author fields in educators-corner posts
  (Sarah von Grebmer, Rachel Heyard)
- Fix YAML syntax in Berit Barthelmes author profile (stray 'Name' prefix)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Internal links resolve to remote fetches via --base-url, causing
thousands of false 404s for assets. Exclude forrt.org since those
are already local files. Also exclude Sage, T&F, APA which block
automated requests with 403s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Academic publishers (Sage, T&F, APA, etc.) return 403 for all
automated requests — valid and invalid URLs alike. Accept 403
as non-broken so these links are still checked but don't produce
false positives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LukasWallrich and others added 3 commits March 18, 2026 16:54
- Convert 488 publisher-specific DOI URLs to canonical https://doi.org/
  format across 11 content files (glossary excluded as auto-generated)
- Strip session-specific casa_token query params from all URLs
- Remove doi.org from lychee exclusion list (it returns proper 404s
  for invalid DOIs, unlike publishers that block all bot requests)
- Add workflow step to flag remaining publisher DOI URLs in the
  link checker issue report

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Flag any direct publisher URL (not just those with visible DOIs) so
contributors know to look up and use the doi.org format. Added
ScienceDirect, JSTOR, LWW, and Royal Society to the pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove 403 from accepted status codes so they appear in lychee output,
then post-process to move them into a collapsed <details> block. This
keeps the main report focused on actionable errors while still surfacing
bot-blocked URLs for reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@LukasWallrich
Copy link
Contributor Author

@richarddushime Could you review and merge this when you get a chance? The merge is needed to make publisher URLs more checkable — we've converted ~490 publisher-specific DOI URLs to doi.org format, and the link checker now flags any remaining ones in the weekly report. Until this is merged, the workflow checks the old build artifact which still has the publisher URLs.

LukasWallrich and others added 5 commits March 18, 2026 17:12
Lychee reports the same broken URL once per page it appears on,
making the issue body exceed GitHub's 65KB limit. Post-process to
show each broken URL only once, with shortened output. Also moves
per-page headers out in favour of a flat deduplicated list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub limits issue bodies to 65KB. Cap 403 and publisher URL lists
at 100 entries each with a count of remaining items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Track which page(s) each broken URL appears on so they can be found
- Keep publisher URL section open (not collapsed) as last section
- 403s still collapsed and capped at 100

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The full grep line content from reversals.md made the issue body
exceed 65KB. Use grep -o to extract just the URL, with file:line
prefix, and deduplicate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@richarddushime richarddushime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
but this will create conflicts with this PR #699
merging this then i will fix the conflict later

@richarddushime richarddushime merged commit b47e64c into master Mar 19, 2026
5 checks passed
@richarddushime richarddushime deleted the fix/link-checker-workflow branch March 19, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants