Skip to content

Improve deploy-time error classification and bounded recovery #142

@ewega

Description

@ewega

Problem

gh devlake deploy local and gh devlake deploy azure already recover a few known failure modes, but deployment resilience is still uneven.

Today the CLI can:

  • auto-run az login when Azure auth is missing
  • start a stopped MySQL server during Azure deploy
  • purge a soft-deleted Key Vault before retrying Azure deployment
  • save partial Azure state so cleanup still works after mid-flight failure
  • print a friendlier message for some local Docker port conflicts

But local deploy recovery is still too narrow for real-world edge cases. In particular, Docker bind failures like:

  • ports are not available
  • address already in use
  • failed programming external connectivity

can bypass the current friendly path, dump raw compose errors, and force the user to manually inspect and retry. The CLI also lacks a shared deploy-time error taxonomy and a consistent bounded retry story across local and Azure flows.

Proposed Solution

Add a deterministic deploy recovery layer for known failure classes. This issue is intentionally non-agentic and does not depend on Copilot SDK.

The CLI should:

  1. classify known deploy-time failures
  2. apply only safe, explicit repairs
  3. retry at most once
  4. print exactly what it detected, what repair it applied, and what happened next

Scope

Local deploy

  • Expand Docker bind/port conflict detection to catch additional variants, including:
    • ports are not available
    • address already in use
    • failed programming external connectivity
  • Keep arbitrary custom ports out of scope for now.
  • Instead, support only the already-known alternate local port bundle:
    • backend: 8085
    • grafana: 3004
    • config UI: 4004
  • For official and fork local deploy flows, allow a bounded fallback from the default bundle (8080/3002/4000) to the alternate bundle (8085/3004/4004), then retry once.
  • Preserve the current custom flow as a manual/user-controlled path.
  • Improve remediation output when the CLI can identify the conflicting container or compose file.

Azure deploy

  • Formalize the existing bounded-recovery behavior under a clearer error-classification model.
  • Keep existing safe repairs first-class and visible in output:
    • missing Azure auth -> az login
    • stopped MySQL -> start and continue
    • soft-deleted Key Vault -> purge and continue
    • migration still warming up -> bounded wait with clear status text
  • Do not add silent infinite retries or broad catch-all retry loops.

Architecture Notes

  • Reuse the repo's existing assumption that local discovery/start/status know two well-known local port bundles (8080/3002/4000 and 8085/3004/4004) instead of inventing arbitrary port support.
  • Keep repair actions as deterministic Go code in the CLI.
  • Prefer a shared helper/classifier over scattering more substring matching across commands.
  • Ensure follow-on commands (status, start, discovery-driven configure flows) continue to work against whichever healthy endpoint actually comes up.

Likely Files

File Change
cmd/deploy_local.go Broaden local error classification, bounded port fallback, retry/output flow
cmd/deploy_azure.go Normalize bounded-recovery reporting and retry behavior
cmd/start.go Keep companion URL handling aligned with the fallback port bundle
cmd/status.go Ensure status output reflects the actual healthy local endpoint after fallback
internal/devlake/discovery.go Keep discovery aligned with supported local port bundles
docs/deploy.md Document deterministic recovery behavior and manual fallback steps
docs/status.md / README.md Update local endpoint and troubleshooting guidance if behavior changes

Acceptance Criteria

  • Local deploy recognizes the common Docker bind-error variants listed above and surfaces a friendly remediation path.
  • When the default local port bundle is unavailable, official/fork flows can recover via the alternate well-known bundle (8085/3004/4004) with a single bounded retry.
  • custom local deploy remains manual; no arbitrary port-management feature is introduced.
  • Recovery output clearly states: detected failure, chosen repair, retry attempt, and final outcome.
  • Azure deploy keeps existing safe repairs but reports them through a clearer bounded-recovery flow.
  • There are no unbounded or silent retry loops.
  • go build ./..., go test ./..., and go vet ./... pass.
  • Docs/help text are updated where behavior or guidance changes.

Dependencies

Blocks: #143 — preview agentic self-healing deployment workflow

Target Version

v0.4.1 — near-term deployment resilience and UX improvement that belongs with the post-v0.4.0 hardening line, even though it also supports the later AI operations roadmap.

References

  • cmd/deploy_local.go — current compose-up path and narrow port-conflict handling
  • cmd/deploy_azure.go — existing bounded Azure repairs already in place
  • cmd/start.go — current well-known local companion URL mapping
  • cmd/status.go — current endpoint discovery/status reporting
  • internal/devlake/discovery.go — current local endpoint discovery logic

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions