Add working link checker workflow using lychee#705
Conversation
Replace the broken filiph/linkcheck workflow with lychee, which crawls the live forrt.org site weekly and creates a GitHub issue listing any broken links found. Includes .lychee.toml config with exclusions for common false positives (LinkedIn, Twitter/X, doi.org, web.archive.org). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 Spell Check ResultsFound 1 potential spelling issue(s) when checking 24 changed file(s): 📄
|
| Line | Issue |
|---|---|
| 94 | pre-selected ==> preselected |
ℹ️ How to address these issues:
- Fix the typo: If it's a genuine typo, please correct it.
- Add to whitelist: If it's a valid word (e.g., a name, technical term), add it to
.codespell-ignore.txt - False positive: If this is a false positive, please report it in the PR comments.
🤖 This check was performed by codespell
Lychee doesn't support recursive crawling, so fetch all page URLs from forrt.org/sitemap.xml and check links on every page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
✅ Staging Deployment Status This PR has been successfully deployed to staging as part of an aggregated deployment. Deployed at: 2026-03-19 14:22:07 UTC The staging site shows the combined state of all compatible open PRs. |
Replace grep -oP (Perl regex) with grep -o + sed for broader shell compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Download the latest deploy artifact instead of crawling the live site. Lychee scans the local HTML files and checks every link it finds, both internal and external. This catches broken outbound links that the sitemap-only approach missed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove email addresses from author fields in educators-corner posts (Sarah von Grebmer, Rachel Heyard) - Fix YAML syntax in Berit Barthelmes author profile (stray 'Name' prefix) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Internal links resolve to remote fetches via --base-url, causing thousands of false 404s for assets. Exclude forrt.org since those are already local files. Also exclude Sage, T&F, APA which block automated requests with 403s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Academic publishers (Sage, T&F, APA, etc.) return 403 for all automated requests — valid and invalid URLs alike. Accept 403 as non-broken so these links are still checked but don't produce false positives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Convert 488 publisher-specific DOI URLs to canonical https://doi.org/ format across 11 content files (glossary excluded as auto-generated) - Strip session-specific casa_token query params from all URLs - Remove doi.org from lychee exclusion list (it returns proper 404s for invalid DOIs, unlike publishers that block all bot requests) - Add workflow step to flag remaining publisher DOI URLs in the link checker issue report Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Flag any direct publisher URL (not just those with visible DOIs) so contributors know to look up and use the doi.org format. Added ScienceDirect, JSTOR, LWW, and Royal Society to the pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove 403 from accepted status codes so they appear in lychee output, then post-process to move them into a collapsed <details> block. This keeps the main report focused on actionable errors while still surfacing bot-blocked URLs for reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@richarddushime Could you review and merge this when you get a chance? The merge is needed to make publisher URLs more checkable — we've converted ~490 publisher-specific DOI URLs to |
Lychee reports the same broken URL once per page it appears on, making the issue body exceed GitHub's 65KB limit. Post-process to show each broken URL only once, with shortened output. Also moves per-page headers out in favour of a flat deduplicated list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub limits issue bodies to 65KB. Cap 403 and publisher URL lists at 100 entries each with a count of remaining items. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Track which page(s) each broken URL appears on so they can be found - Keep publisher URL section open (not collapsed) as last section - 403s still collapsed and capped at 100 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The full grep line content from reversals.md made the issue body exceed 65KB. Use grep -o to extract just the URL, with file:line prefix, and deduplicate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
richarddushime
left a comment
There was a problem hiding this comment.
LGTM 👍
but this will create conflicts with this PR #699
merging this then i will fix the conflict later
Summary
filiph/linkcheckworkflow (currently disabled aslink-check.yaml_OLD) with lychee, a fast Rust-based link checkerhttps://forrt.orgsite weekly (Mondays 01:30 UTC) and on manual dispatchlink-checklisting all broken links found.lychee.tomlconfig with exclusions for common false positives (LinkedIn, Twitter/X, doi.org, web.archive.org, etc.)What changed
.github/workflows/link-check.yaml— new workflow usinglychee/lychee-action@v2.lychee.toml— link checker config (reusable locally withlychee --config .lychee.toml https://forrt.org)Key design decisions
GITHUB_TOKEN, no custom secrets neededTest plan
🤖 Generated with Claude Code