Skip to content

Stabilize wiki retry handling and release-time wiki pointer reconciliation #309

@coisa

Description

@coisa

Problem

The wiki automation currently has two connected reliability gaps.

First, Retry Transient Workflow Failures can fail instead of helping when the failed run comes from wiki maintenance. On 2026-04-30, retry run 25150628268 failed while inspecting Maintain Wiki run 25150613235 and raised:

Failed to download logs for job maintenance / Publish Wiki Master: 401 Unauthorized

That leaves transient wiki failures unretried and turns the safety net itself into another failing workflow.

Second, wiki publication is still vulnerable to pointer races around release merges. A pull request can compete with wiki pointer publication and still merge, while the release publication path is the authoritative moment that should guarantee the final .github/wiki pointer matches the released state.

Current Behavior

  • .github/workflows/retry-transient-failures.yml throws when it cannot download logs for a failed job exposed through a reusable workflow boundary, such as maintenance / Publish Wiki Master from Maintain Wiki.
  • Because the retry workflow itself fails, maintainers get no rerun decision and no useful summary for that failure.
  • Wiki publication can still leave a stale .github/wiki pointer behind when preview/publication activity and merge timing overlap near a release.

Expected Behavior

  • The transient retry workflow should treat unreadable child-job logs as a handled condition and finish with a deterministic summary instead of crashing.
  • Wiki maintenance failures should still be retried when logs are available and every failed job matches the transient GitHub-side signatures.
  • Release publication should reassert the final wiki pointer from the authoritative released state so concurrency with merged pull requests cannot leave the wiki pointer stale.

Failure Surface

  • .github/workflows/retry-transient-failures.yml
  • .github/workflows/wiki-maintenance-entry.yml
  • .github/workflows/wiki-maintenance.yml
  • The merged release publication path in .github/workflows/changelog.yml

Proposal

Make the transient retry workflow robust when GitHub refuses job-log downloads for reusable-workflow jobs, and add a release-side wiki reconciliation step that guarantees the final pointer is revalidated or republished from the authoritative release state.

Implementation Strategy

  • Isolate the retry workflow's failed-job inspection so it can distinguish inspectable failed jobs from jobs whose logs are unavailable because of GitHub API limitations or nested workflow boundaries.
  • Keep retry decisions deterministic: when a failed job cannot be inspected, emit an explicit summary status instead of throwing.
  • Preserve the existing transient-signature matching for inspectable jobs.
  • Add a bounded release-publication safety net that reruns or revalidates wiki pointer publication against the final release state on main, so the release path can correct any stale pointer left behind by concurrent pull request merges.
  • Add or update focused coverage for reusable-workflow retry handling and for release-side wiki pointer reconciliation.

Non-goals

  • Redesigning the entire wiki workflow architecture.
  • Moving wiki automation into another repository.
  • Broad refactors to unrelated reports, Pages, or release automation.

Acceptance Criteria

Functional Criteria

  • Retry Transient Workflow Failures no longer fails when a failed job belongs to a reusable workflow and GitHub does not allow direct log download for that child job.
  • The retry workflow emits a deterministic summary explaining whether a run was retried, skipped because no transient signature matched, or skipped because failed-job logs were not inspectable.
  • Maintain Wiki and Maintain Wiki Publication failures are still retried when logs are available and every failed job matches the configured transient GitHub-side signatures.
  • Merged release publication reasserts the final .github/wiki pointer from the authoritative released state so concurrent preview/publication activity cannot leave a stale pointer behind.
  • Re-running the release-side wiki reconciliation is idempotent and stays within the existing bounded publish flow.

Regression Criteria

  • Add or update coverage for reusable-workflow failed-job handling in the transient retry logic.
  • Add or update coverage for the release-side wiki reconciliation path, including a stale-pointer or concurrent-merge scenario.

Architectural / Isolation Criteria

  • MUST: The core logic MUST be isolated into dedicated classes or services instead of living inside command or controller entrypoints.
  • MUST: Responsibilities MUST be separated across input resolution, domain logic, processing or transformation, and output rendering when the change is non-trivial.
  • MUST: The command or controller layer MUST act only as an orchestrator.
  • MUST: The implementation MUST avoid tight coupling between core behavior and CLI or framework-specific I/O.
  • MUST: The design MUST allow future extraction or reuse with minimal changes.
  • MUST: The solution MUST remain extensible without requiring major refactoring for adjacent use cases.
  • MUST: Data gathering or transformation MUST be isolated from filesystem writes or publishing steps.
  • MUST: Generated output ordering and formatting MUST remain deterministic across runs.
  • MUST: Re-running the workflow MUST be idempotent or clearly bounded in its side effects.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Merged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions