Skip to content

TopoLVM needs startup buffer to wait on dns pod startup#104

Merged
ggiguash merged 4 commits intomicroshift-io:mainfrom
copejon:buffer-topolvm-startup
Oct 30, 2025
Merged

TopoLVM needs startup buffer to wait on dns pod startup#104
ggiguash merged 4 commits intomicroshift-io:mainfrom
copejon:buffer-topolvm-startup

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented Oct 23, 2025

Resolves #94

Topolvm pods need a startup grace period to allow the dns pod to become ready. Otherwise, topolvm pods fall into a backoff loop very early and can take up to half an hour to stabilize.

Summary by CodeRabbit

  • Chores
    • Tuned health checks and added startup probes to improve TopoLVM component startup resilience and failure detection.
    • Extended readiness and liveness probe timings for the controller to reduce false restarts during initialization.
    • Added startup probe for node components to better handle longer initialization.
    • Fixed deployment replica count to 1 for predictable resource usage.

@copejon copejon requested a review from a team as a code owner October 23, 2025 20:58
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 23, 2025

Walkthrough

Patch TopoLVM manifests: adjust controller liveness/readiness probe timing and add startupProbe; add startupProbe to topolvm-node DaemonSet; ensure Deployment replicas remain set to 1 and include new 03-topolvm.yaml in generated kustomization resources.

Changes

Cohort / File(s) Change Summary
Manifest post-processing script
src/topolvm/generate_manifests.sh
Add generation of 03-topolvm.yaml to kustomization resources and apply probe patches: set topolvm-controller readinessProbe.timeoutSeconds=3, readinessProbe.failureThreshold=3, readinessProbe.periodSeconds=60, livenessProbe.failureThreshold=3; add startupProbe to controller (HTTP GET /healthz, periodSeconds=60, timeoutSeconds=3, failureThreshold=3). Preserve replicas patch (replicas=1).
TopoLVM manifest asset
src/topolvm/assets/03-topolvm.yaml
Add startupProbe to topolvm-node DaemonSet containers (HTTP GET /healthz on port healthz, failureThreshold=60, periodSeconds=2, timeoutSeconds=3) and add startupProbe for controller as above; align probe parameters across containers.

Sequence Diagram(s)

sequenceDiagram
  participant KubeAPI as Kubernetes API
  participant ControllerPod as topolvm-controller Pod
  participant NodePod as topolvm-node Pod
  participant Kubelet as Kubelet

  rect rgb(235,245,255)
    KubeAPI->>ControllerPod: Create Pod (Deployment replicas=1)
    Kubelet->>ControllerPod: startupProbe GET /healthz (period=60s, timeout=3s, failures=3)
    alt startup succeeds
      Kubelet->>ControllerPod: readinessProbe GET /healthz (period=60s, timeout=3s, failures=3)
      Kubelet->>ControllerPod: livenessProbe (failureThreshold=3)
    else startup fails
      Kubelet->>KubeAPI: Restart Pod
    end
  end

  rect rgb(245,255,235)
    KubeAPI->>NodePod: Create Pod (DaemonSet)
    Kubelet->>NodePod: startupProbe GET /healthz (period=2s, timeout=3s, failures=60)
    alt startup succeeds
      Kubelet->>NodePod: readiness/liveness probes (existing)
    else startup fails
      Kubelet->>KubeAPI: Restart Pod
    end
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify probe parameter correctness (periodSeconds, timeoutSeconds, failureThreshold) for controller and node.
  • Confirm HTTP path /healthz and port name healthz exist and are correct in containers.
  • Ensure generate_manifests.sh correctly inserts 03-topolvm.yaml into kustomization and preserves other patches (replicas=1).

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "TopoLVM needs startup buffer to wait on dns pod startup" accurately reflects the main objective of the changeset. The changes add startup probes and extend readiness/liveness probe configurations across both the topolvm-controller Deployment and topolvm-node DaemonSet, specifically implementing a startup buffer mechanism to allow DNS pod initialization before TopoLVM begins. The title is concise, specific, and directly communicates the primary change without vagueness.
Linked Issues Check ✅ Passed The changes directly address the requirement from issue #94 to prevent intermittent CI failures by ensuring TopoLVM waits for dependent services like DNS before starting. The PR implements startup probes with appropriate failure thresholds (60 for topolvm-node, 3 for topolvm-controller) and extended periodSeconds (60 for controller, 2 for node) to provide the necessary grace period for DNS initialization. The modifications to both generate_manifests.sh and the base manifest (03-topolvm.yaml) implement the startup buffer mechanism needed to avoid prolonged backoff loops and achieve reliable stabilization.
Out of Scope Changes Check ✅ Passed All changes in the PR are narrowly scoped to the startup buffer requirement. Modifications to generate_manifests.sh add patches for startup and probe configurations, while changes to 03-topolvm.yaml implement the corresponding probe enhancements in the base manifest. No unrelated refactoring, style changes, dependency updates, or other out-of-scope modifications are present. The changes also align with the reviewer's feedback to apply modifications via the manifest generation script rather than manual updates.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8be40e4 and 17c45d8.

📒 Files selected for processing (2)
  • src/topolvm/assets/03-topolvm.yaml (2 hunks)
  • src/topolvm/generate_manifests.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/topolvm/generate_manifests.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: fedora-bootc
  • GitHub Check: centos10-bootc
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: centos9-bootc
  • GitHub Check: isolated-network-ovnk
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: quick-start-and-clean
🔇 Additional comments (1)
src/topolvm/assets/03-topolvm.yaml (1)

868-874: Clarify whether this manifest should be manually edited or regenerated from a script.

The file contains source comments indicating generation from Helm templates (# Source: topolvm/templates/...). The PR comment from @ggiguash explicitly requests that startup buffer changes be applied to src/topolvm/generate_manifests.sh instead of this manifest directly, to avoid losing changes on the next manifest update.

If 03-topolvm.yaml is a derived artifact, the changes should live in the generation script, not here. If it's permanent, confirm that and document the approach.

If these changes belong in the generation script, please verify that src/topolvm/generate_manifests.sh contains corresponding patches for:

  • topolvm-node DaemonSet: startupProbe with 60 failures × 2s intervals
  • topolvm-controller Deployment: startupProbe with 3 failures × 60s intervals, plus readiness/liveness probe adjustments

Also applies to: 989-1016


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/topolvm/assets/03-topolvm.yaml (3)

989-1002: Consider adding startupProbe to topolvm-controller.

The topolvm-controller now has an enhanced livenessProbe and readinessProbe, but lacks a startupProbe. If it also depends on DNS during startup (like topolvm-node), it may benefit from a similar startup grace period to prevent early backoff loops.


767-774: lvmd container: Consider adding startupProbe for consistency.

The lvmd container still uses the older pattern (initialDelaySeconds) without a startupProbe. If it shares DNS dependencies with topolvm-node, adding a startupProbe would improve consistency and resilience during early pod initialization.


885-891: csi-registrar: Old probe pattern retained.

The csi-registrar container still uses initialDelaySeconds: 10 without a startupProbe. If this is intentional (e.g., csi-registrar has no DNS dependency), document the rationale. Otherwise, align it with the modern probe pattern for consistency.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72ba79a and b6dc853.

📒 Files selected for processing (1)
  • src/topolvm/assets/03-topolvm.yaml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: quick-start-and-clean
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: fedora-bootc
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: isolated-network-ovnk
  • GitHub Check: centos10-bootc
  • GitHub Check: centos9-bootc
🔇 Additional comments (2)
src/topolvm/assets/03-topolvm.yaml (2)

850-856: Good startup grace period for DNS initialization.

The startupProbe configuration provides a reasonable 120-second window (60 failures × 2-second period) to allow DNS pod readiness. The httpGet probe on /healthz is consistent with the livenessProbe endpoint.


843-849: Verify endpoint path change for livenessProbe.

The livenessProbe now explicitly uses /healthz path and adds failureThreshold: 3. Confirm this endpoint exists and behaves as expected during normal container operation (not just startup).

@ggiguash
Copy link
Copy Markdown
Contributor

@copejon , we have generate_manifests.sh script for TopoLVM manifest generation.
Could we add these changes to the script instead of manually making them? Otherwise, they'll get lost on the next update.

@ggiguash ggiguash marked this pull request as draft October 25, 2025 08:02
@ggiguash ggiguash changed the title topolvm needs startup buffer to wait on dns pod startup TopoLVM needs startup buffer to wait on dns pod startup Oct 25, 2025
@ggiguash ggiguash self-assigned this Oct 28, 2025
@copejon copejon force-pushed the buffer-topolvm-startup branch from b6dc853 to 150c847 Compare October 29, 2025 15:15
@copejon copejon marked this pull request as ready for review October 29, 2025 15:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/topolvm/generate_manifests.sh (1)

73-89: Remove duplicate kustomization.yaml generation.

Lines 73-80 and 82-89 generate identical content to the same file; the second write is redundant. Remove the duplicate block.

  cat >"${ASSETS_DIR}/kustomization.yaml" <<'EOF'
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - 01-namespace.yaml
  - 02-cert-manager.yaml
  - 03-topolvm.yaml
EOF
-
-  cat >"${ASSETS_DIR}/kustomization.yaml" <<'EOF'
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-resources:
-  - 01-namespace.yaml
-  - 02-cert-manager.yaml
-  - 03-topolvm.yaml
-EOF
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6dc853 and 150c847.

📒 Files selected for processing (1)
  • src/topolvm/generate_manifests.sh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: isolated-network-ovnk
  • GitHub Check: centos10-bootc
  • GitHub Check: centos9-bootc
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: fedora-bootc
🔇 Additional comments (1)
src/topolvm/generate_manifests.sh (1)

40-55: Verify yq syntax produces correct manifests.

The topolvm-controller patch uses chained pipe assignments; confirm the generated manifest reflects all probe configuration changes as expected.

Comment thread src/topolvm/generate_manifests.sh Outdated
@copejon copejon force-pushed the buffer-topolvm-startup branch from 150c847 to 0f5f022 Compare October 29, 2025 17:01
@copejon copejon force-pushed the buffer-topolvm-startup branch from 0f5f022 to 8be40e4 Compare October 29, 2025 17:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/topolvm/generate_manifests.sh (1)

70-87: Duplicate kustomization.yaml generation.

Lines 71–78 and 80–87 generate identical kustomization.yaml files; the second write overwrites the first. Remove the duplicate (lines 80–87).

Apply this diff:

   cat >"${ASSETS_DIR}/kustomization.yaml" <<'EOF'
 apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
 resources:
   - 01-namespace.yaml
   - 02-cert-manager.yaml
   - 03-topolvm.yaml
 EOF
-
-  cat >"${ASSETS_DIR}/kustomization.yaml" <<'EOF'
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-resources:
-  - 01-namespace.yaml
-  - 02-cert-manager.yaml
-  - 03-topolvm.yaml
-EOF
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 150c847 and 8be40e4.

📒 Files selected for processing (1)
  • src/topolvm/generate_manifests.sh (1 hunks)
🔇 Additional comments (2)
src/topolvm/generate_manifests.sh (2)

41-55: Well-structured probe configuration for DNS initialization delay.

The startup and readiness probe tuning (failureThreshold=3, periodSeconds=60) provides a ~180-second window for topolvm-controller to initialize, addressing the PR objective of preventing DNS timing issues. The with() syntax correctly targets the specific container and applies all mutations in one operation.


57-68: Verify DaemonSet startup probe settings align with controller probe logic.

The topolvm-node DaemonSet startupProbe uses a longer total window (failureThreshold=60 × periodSeconds=2 ≈ 120 seconds) compared to the controller. Confirm this asymmetry is intentional (e.g., node workloads may start faster) or if both should use the same retry window.

Comment thread src/topolvm/generate_manifests.sh
@ggiguash
Copy link
Copy Markdown
Contributor

The failure in the ubuntu job is not related to the current changes. I'm approving and mering the PR.

@ggiguash ggiguash merged commit 85d28d3 into microshift-io:main Oct 30, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix occasional CI failures due to TopoLVM not initializing properly

2 participants