Skip to content

Conversation

@rh-amarin
Copy link
Collaborator

@rh-amarin rh-amarin commented Jan 20, 2026

Proposal of changes to API resource conditions and adapter status report

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive Sentinel Pulses design doc describing root API conditions (Available, Ready), mandatory adapter conditions (Applied, Available, Health), generation-aware status reconciliation, deprecation of legacy phase semantics in favor of condition-style reporting, rules to discard Unknown statuses, async reporting and retry behaviors, handling mixed/in-progress generations, sample condition expressions, and a roadmap of next actions.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2026

Walkthrough

Adds a new design document "sentinel-pulses.md" that defines a revised resource status and reporting model. It specifies root conditions (Available, Ready), requires adapters to report three conditions (Applied, Available, Health) with generation-aware fields (observed_generation, observed_time), defines sentinel publication rules (TTL-driven pulses), reconciler semantics for generation mismatches and Unknown statuses, and lists API/sentinel/adapter workflow changes and next actions for implementation. The change is documentation-only; no code or public API modifications are included.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Spec as "Cluster / NodePool Spec"
    participant Sentinel as "Sentinel (pulse generator)"
    participant Maestro as "Maestro API"
    participant Adapter as "Adapter"
    participant K8s as "Kubernetes / External APIs"

    Spec->>Sentinel: spec change or TTL expiry triggers pulse
    Sentinel->>Maestro: POST /pulses (pulse + generation)
    Maestro->>Adapter: notify adapter of pulse (includes generation)
    Adapter->>K8s: reconcile resources (create/update/idempotent ops)
    K8s-->>Adapter: resource status / job events
    Adapter->>Maestro: POST /clusters/{id}/statuses (conditions + observed_generation)
    Maestro->>Maestro: reconcile conditions vs spec.generation / observed_generation
    Maestro->>Spec: update root status (Available / Ready) and persisted adapter status
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

Suggested reviewers

  • ciaranRoche
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'HYPERFLEET-386 - doc: sentinel pulses' is directly related to the changeset, which adds documentation for a sentinel pulses design. It clearly identifies the ticket and the primary change (documentation for sentinel pulses feature).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Around line 47-48: The document contains a heading "### Example for adapter
task using k8s resources" that jumps from a higher level (MD001); fix by
changing this h3 to an h2 (replace "###" with "##") or insert the appropriate
missing h2 above it to preserve correct heading hierarchy; locate the heading
text "Example for adapter task using k8s resources" in sentinel-pulses.md and
adjust the markdown level so it follows an h2 properly.
- Line 14: Change the sentence "The adapters performs some actions..." to proper
subject-verb agreement and clearer phrasing; replace it with "Adapters perform
some actions in their 'resources' phase to query state and then report to the
HyperFleet API with at least three mandatory conditions." Update the text in the
document where that sentence appears (look for the exact phrase "The adapters
performs some actions in their \"resources\" phase") to apply this correction.
- Around line 123-131: The fenced code block containing the pseudo-code for
Available/Ready expressions is missing a language tag; update the
triple-backtick fence that precedes the expressions (the block starting with
available=self.items.all(... and ending with generationN)) to include a language
identifier such as text (e.g., change ``` to ```text) so the markdown linter
MD040 is satisfied and the pseudo-code is treated as plain text.
- Around line 20-27: Make the bullet list under the "resources" phase into full
sentences and parallel structure: replace fragments with a subject like "These
actions" or "They" and rewrite each bullet to be a complete sentence (e.g.,
"These actions should be idempotent." and "These actions are calls to APIs with
a payload, such as the k8s API to create objects (CRDs or Jobs), the Maestro
API, or other APIs."). Ensure sub-bullets follow the same sentence structure and
punctuation for consistency.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Around line 94-98: The "Another alternative would be to make reporting from
adapter configurable with some sort of post-condition" bullet is mis‑indented;
adjust its indentation so it matches the other sub-bullets under the numbered
item (the lines starting with "- `Applied=True`" and "- `Available=Unknown`") so
it is at the same list level as those dashes and renders as a peer sub-bullet of
the "1. Adapter starts a new adapter task" item.
- Line 63: The markdown heading "#### K8s jobs" triggers a level jump (MD001);
either promote that heading to "### K8s jobs" or add the missing parent "###"
heading immediately above it so the heading levels are sequential; update the
heading text "K8s jobs" accordingly in sentinel-pulses.md to fix the lint error
and ensure proper document structure.
♻️ Duplicate comments (2)
hyperfleet/docs/sentinel-pulses.md (2)

36-36: Grammar: “adapters performs” → “adapters perform”.
Same issue flagged previously; please apply the prior suggestion.


119-125: Add a language tag to the fenced block (MD040).
Specify a language (e.g., text) for the pseudo‑code.

✏️ Suggested edit
-```
+```text
 available=self.items.all(i, 
   i.observed_generation == self.items[0].observed_generation && 
   i.conditions.exists(c, c.type == 'Available' && c.status == 'True')
 )
 ready= available && self.items.all(i, i.observed_generation == generationN)
</details>

</blockquote></details>

</blockquote></details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Line 148: Remove the extra closing double quote in the sentence containing
"Available"" so it reads "Available" (i.e., replace the token "Available"" with
"Available"); update the sentence where the status example appears to ensure
only a single closing quote is present.
- Line 14: Replace the unhyphenated compound modifier "root level conditions"
with "root-level conditions" in the sentinel-pulses.md content; search for the
exact phrase "root level conditions" (used as a modifier before a noun) and
update it to the hyphenated form "root-level conditions" to follow standard
compound adjective style.
- Line 76: Replace the phrase "Since we can not recreate jobs, if we allow it
finishing with failure, the adapter will report `Available=False` always" with
the corrected wording using "cannot": "Since we cannot recreate jobs, if we
allow it finishing with failure, the adapter will report `Available=False`
always" so the document uses the preferred "cannot" form (edit the markdown
sentence that currently starts "Since we can not recreate jobs").
- Line 81: Replace the incorrect verb form "should be manage" with the correct
passive past participle "should be managed" in the sentence that reads "Applying
a CRD or manifest (like namespace manifest) should be manage by the same
lifecycle management as for jobs." so it reads "Applying a CRD or manifest (like
namespace manifest) should be managed by the same lifecycle management as for
jobs."
♻️ Duplicate comments (4)
hyperfleet/docs/sentinel-pulses.md (4)

36-36: Fix subject-verb agreement.

"The adapters performs" should be "adapters perform" (subject-verb agreement) and "in order to query" can be simplified to "to query".

✏️ Proposed fix
-The adapters performs some actions in their "resources" phase in order to query the state of the resource and then reports to the HyperFleet API with at least 3 mandatory conditions:
+Adapters perform some actions in their "resources" phase to query the state of the resource and then report to the HyperFleet API with at least 3 mandatory conditions:

63-63: Fix heading level jump.

This h4 heading jumps from h2, violating markdown hierarchy. Promote to h3 or add a missing h3 parent heading above it.

✏️ Proposed fix
-#### K8s jobs
+### K8s jobs

99-99: Fix list indentation.

The "Another alternative..." bullet is mis-indented. It should be indented 6 spaces to align with its peer sub-bullets.

✏️ Proposed fix
-    - Another alternative would be to make reporting from adapter configurable with some sort of post-condition
+      - Another alternative would be to make reporting from adapter configurable with some sort of post-condition

121-127: Add language tag to code block.

The fenced code block should specify a language identifier for proper rendering and linting compliance.

✏️ Proposed fix
-```
+```text
 available=self.items.all(i, 
   i.observed_generation == self.items[0].observed_generation && 
   i.conditions.exists(c, c.type == 'Available' && c.status == 'True')
 )
 ready= available && self.items.all(i, i.observed_generation == generationN)
 ```

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Around line 137-151: Edit the rules section so the contradictory phrase
"Available only transitions to True" is rewritten as two clear rules: "Available
transitions to True when ..." and "Available transitions to False when ...";
update the corresponding bullets that describe conditions (e.g., the bullet
about all adapters at same generation and the bullet about adapters reporting
older observed_generation) to be split and labeled under the new "transitions to
True" and "transitions to False" headings, and remove the word "only" to avoid
implying impossibility of False transitions.
♻️ Duplicate comments (8)
hyperfleet/docs/sentinel-pulses.md (8)

14-14: Hyphenate compound modifier.
“root level conditions” → “root-level conditions”.


37-37: Fix subject–verb agreement and tighten phrasing.
“The adapters performs … in order to” → “Adapters perform … to”.


64-64: Heading level jump (MD001).
Promote “#### K8s jobs” to ### or add the missing parent heading.


77-77: Use “cannot” instead of “can not.”


82-82: Fix verb form.
“should be manage” → “should be managed”.


97-101: Align list indentation (MD005).
The “Another alternative…” bullet is mis‑indented.


122-128: Add a language tag to the fenced block (MD040).
Use ```text for the pseudo‑code.


149-149: Fix double quote typo.
“Available""” → “Available”.

Copy link
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of points :

  • Moving away from phase we introduce top level available/ready conditions. These differ to adapter conditions. Can we get a clear definition of these conditions both top level object and adapter level. (Might need to update some other docs to ensure they are aligned)
  • With the pattern proposed, if the adapter reports available unknown on retries, does this mean last_updated_time is not updated and sentinel will pulse every 10s, is there a risk of event storms?
  • Similarly when discarding, unknown will the last known status remain?

@rh-amarin rh-amarin force-pushed the sentinel-pulses branch 2 times, most recently from 464cc74 to de0d132 Compare January 20, 2026 15:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Line 174: Replace the ungrammatical phrase "conditions what will be always
present" with the corrected phrase "conditions that will always be present" in
the sentence at the noted location (look for the exact original phrase to find
the spot) so the line reads "...Create tests for new conditions that will always
be present."
- Line 3: Replace the incorrect phrase "how API and adapter behaves" with "how
the API and adapters behave" in the document text (locate the sentence
containing "how API and adapter behaves"); ensure the definite article "the" is
added before "API", "adapter" is pluralized to "adapters", and the verb changes
from "behaves" to "behave" so subject-verb agreement and clarity are corrected.
- Around line 76-78: Update the "We will create k8s jobs that will retry
indefinitely until success" paragraph to explicitly state this is a known MVP
limitation and not an oversight: acknowledge the indefinite retry/absence of TTL
deletion as a temporary trade-off, reference the existing "lifecycle management"
and the phrase "errors or TTL exceeded" already mentioned elsewhere in the doc,
and add a short note that full lifecycle management (including TTL-based cleanup
and proper job deletion) is planned for the post-MVP implementation to mitigate
resource exhaustion and change the adapter behavior.
♻️ Duplicate comments (8)
hyperfleet/docs/sentinel-pulses.md (8)

77-77: Use "cannot" instead of "can not".

"cannot" is the standard form in formal writing.

✏️ Proposed fix
-  - Since we can not recreate jobs, if we allow it finishing with failure, the adapter will report `Available=False` always
+  - Since we cannot recreate jobs, if we allow it finishing with failure, the adapter will report `Available=False` always

82-82: Fix passive voice verb form.

"should be manage" should be "should be managed" (past participle required for passive voice).

✏️ Proposed fix
-Applying a CRD or manifest (like namespace manifest) should be manage by the same lifecycle management as for jobs.
+Applying a CRD or manifest (like namespace manifest) should be managed by the same lifecycle management as for jobs.

14-14: Hyphenate compound modifier.

"root level conditions" should be "root-level conditions" when used as a compound modifier before a noun.

✏️ Proposed fix
-- Introduce root level conditions on API resources (clusters, nodepools)
+- Introduce root-level conditions on API resources (clusters, nodepools)

100-100: Fix list indentation.

The bullet "Another alternative would be to make reporting..." is mis-indented. It should align with the other sub-bullets under item 1.

✏️ Proposed fix
-    - Another alternative would be to make reporting from adapter configurable with some sort of post-condition
+      - Another alternative would be to make reporting from adapter configurable with some sort of post-condition

37-37: Fix subject-verb agreement.

"The adapters performs" should be "Adapters perform" for correct subject-verb agreement.

✏️ Proposed fix
-The adapters performs some actions in their "resources" phase in order to query the state of the resource and then reports to the HyperFleet API with at least 3 mandatory conditions:
+Adapters perform some actions in their "resources" phase to query the state of the resource and then report to the HyperFleet API with at least 3 mandatory conditions:

64-64: Fix heading level jump.

The heading "#### K8s jobs" jumps from h2 to h4, skipping h3. Either promote this to h3 or add a missing h3 parent heading above it.

✏️ Proposed fix
-#### K8s jobs
+### K8s jobs

144-155: Clarify Available transition rules (remove contradiction).

Rule 4 states "Available only transitions to True" but rule 5 describes transitions to False. This phrasing is contradictory in a specification. Rewrite as separate, clear rules: "Available transitions to True when..." and "Available transitions to False when...".

✏️ Proposed fix
-1. Available only transitions to True
+1. Available transitions to True when:
     - If all the adapters are at the same generation report Available=True
     - Could be that the adapter's reports observed_generation < generation
       - This can happen when updating quickly a spec
       - API will be getting "older" adapter statuses first
-      - It can be "Available"" for that generation, but needs reconciliation
+      - It can be "Available" for that generation, but needs reconciliation
       - Eventually it will get newer responses
-1. Available transitions to False
+1. Available transitions to False when:
     - If any adapter of the `observed_generation` in the condition reports `Available=False`
     - But not for adapters reporting `Available=false` at other `observed_generation`
     - This keeps `Available=true` for `observed_generation` while `Ready=False`
       - Meaning, that the last known generation is still marked available

122-128: Add language tag to fenced code block.

Specify a language identifier (e.g., text) for the pseudo-code expression to satisfy markdown linting.

✏️ Proposed fix
-```
+```text
 available=self.items.all(i, 
   i.observed_generation == self.items[0].observed_generation && 
   i.conditions.exists(c, c.type == 'Available' && c.status == 'True')
 )
 ready= available && self.items.all(i, i.observed_generation == generationN)
 ```

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Around line 140-143: Update the doc to clarify that adapter-reported
"Available=Unknown" is a transient, in-progress signal that the API will ignore
when computing the root Available condition (it should not change the root
condition or update the adapter's last_updated_time), but the API may persist
those Unknown reports to a status log/tracing table for debugging; explicitly
state how adapters should communicate in-progress work (e.g., via "Unknown" for
transient state plus separate progress/events or a dedicated "InProgress" field)
and note that discarding means it is not used in health aggregation while still
being available for traceability.
- Around line 194-217: This section duplicates the API behavior rules for
computing `Available` and `Ready` conditions; remove the repeated bulleted rules
under the "Compute `Available` and `Ready` conditions" block and replace them
with a short cross-reference to the earlier ruleset (the existing section that
defines API acceptance of observed_generation, discarding `Available=Unknown`,
Available/Ready transition semantics). Keep the unique phrasing "Compute
`Available` and `Ready` conditions" as the anchor and ensure the reference
points readers to the original rules so maintenance is centralized and the
semantics for `Available` and `Ready` remain consistent.
- Around line 135-159: The numbered list in the sentinel-pulses rules is using
"1." for every item which confuses readers; update the list so the five distinct
rules are numbered sequentially 1–5 and ensure each rule’s heading matches the
content (1: "API should only accept conditions for same or increased
observed_generation", 2: "API should discard conditions with Available=Unknown",
3: "Available transitions to True", 4: "Available transitions to False", 5:
"Ready transitions to False"); keep the existing bullets/text under each rule
intact but adjust numbering and any follow-up sub-bullets so the structure is
clear and linear.
- Line 261: Replace the advice to set backoffLimit: 1000000 with guidance to use
Kubernetes Pod Failure Policy and related settings: recommend configuring Pod
Failure Policy (with restartPolicy: Never) to ignore transient platform failures
and only count application errors, add activeDeadlineSeconds to bound total job
runtime, and for indexed jobs prefer backoffLimitPerIndex; update examples to
follow repo patterns (use backoffLimit: 3 as in architecture-summary.md and
maestro-adapter-integration-strategy.md) instead of suggesting an extremely high
backoffLimit.
🧹 Nitpick comments (5)
hyperfleet/docs/sentinel-pulses.md (5)

14-20: Clarify the structure of the Available condition.

The proposal mentions Available contains an "additional observed_generation property" (line 16), but later sections describe conditions having observed_generation as a standard field for adapter-reported conditions. It's unclear whether:

  • Available is a root-level condition with a special structure
  • All conditions have observed_generation
  • Only adapter conditions vs. root conditions differ

Consider adding a brief explanation or forward reference to clarify the data model, e.g., "Root conditions like Available aggregate adapter-reported conditions, each of which includes observed_generation."


18-18: Replace informal phrasing.

"Kind of a" is too casual for a technical design document.

✏️ Proposed fix
-    - Kind of a "last known good configuration"
+    - Represents the "last known good configuration"

29-29: Use consistent spelling.

The document title uses "behavior" (American spelling) but line 29 uses "behaviour" (British spelling). Standardize throughout the document.

✏️ Proposed fix
-Sentinel behaviour changes slightly, now watches for the resources `Ready` condition
+Sentinel behavior changes slightly, now watches for the resources `Ready` condition

34-35: Justify TTL threshold values.

The TTL values (10sec for Ready==False vs 30min for Ready==True) appear arbitrary. Consider adding a brief rationale for these specific thresholds or noting they are initial values subject to tuning.


265-265: Fix typo.

"retying" should be "retrying".

✏️ Proposed fix
-    - Jobs for adapter tasks always finish with success or keep retying
+    - Jobs for adapter tasks always finish with success or keep retrying

Comment on lines +140 to +143
1. API should discard conditions with `Available=Unknown`
- It can still store in some status log table/file for tracing
- We can think of optimizing by having a post-condition that discards adapter report
- It also doesn't update `last_updated_time` for the adapter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Clarify the handling of Available=Unknown.

There's an apparent contradiction:

  • Line 50 states adapters will return Available=Unknown for asynchronous work
  • Lines 140-143 state the API should discard conditions with Available=Unknown

If the API discards these reports, then:

  1. How does the adapter communicate in-progress state?
  2. What's the value of reporting Unknown if it's always discarded?
  3. Should the document clarify that Unknown is a transient state that doesn't update the root Available condition?

Consider adding a note explaining that adapter-reported Unknown status is discarded from the root condition calculation but may be useful for debugging/tracing purposes.

🤖 Prompt for AI Agents
In `@hyperfleet/docs/sentinel-pulses.md` around lines 140 - 143, Update the doc to
clarify that adapter-reported "Available=Unknown" is a transient, in-progress
signal that the API will ignore when computing the root Available condition (it
should not change the root condition or update the adapter's last_updated_time),
but the API may persist those Unknown reports to a status log/tracing table for
debugging; explicitly state how adapters should communicate in-progress work
(e.g., via "Unknown" for transient state plus separate progress/events or a
dedicated "InProgress" field) and note that discarding means it is not used in
health aggregation while still being available for traceability.

Detail:
- Change adapter task jobs to retry indefinitely
- Set job `restartPolicy: OnFailure`
- Set `backoffLimit` to a high number like 1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for existing Job manifests with backoffLimit
fd -e yaml -e yml . | xargs rg -l "kind: Job" 2>/dev/null | head -10 | xargs rg "backoffLimit" -A 2 -B 2 2>/dev/null

Repository: openshift-hyperfleet/architecture

Length of output: 2717


🌐 Web query:

Kubernetes backoffLimit practical limits and best practices

💡 Result:

Summary (practical limits + best practices)

Key facts

  • backoffLimit is a retry count (default 6). When the count is reached the Job is marked Failed. [1][4]
  • Pod recreations use an exponential backoff (10s, 20s, 40s, …) capped at ~6 minutes. You cannot configure that delay directly. [1][8]
  • For Indexed Jobs you can use backoffLimitPerIndex (per-index retries) and maxFailedIndexes to avoid one bad index consuming the global budget. (GA in recent releases). [1][2]
  • Pod Failure Policy lets you classify failures (Ignore, Count, FailJob, FailIndex) so only relevant failures count toward backoffLimit; requires restartPolicy: Never. [5][6]

Practical limits to choose

  • Small values (1–3): good for short, idempotent tasks or CronJob runs to fail fast and avoid wasted resources. [7]
  • Default (6): reasonable for transient infra blips or flaky external dependencies. [1][4]
  • Large values (>20 or MaxInt): typically not recommended unless you have strong reason and monitoring; long retry budgets can hide systemic failures and waste cluster resources. Use backoffLimitPerIndex instead for indexed workloads. [1][2]

Best practices

  • Prefer restartPolicy: Never when debugging Jobs or when using Pod Failure Policy; otherwise OnFailure works but can complicate counting. [1][5]
  • Combine backoffLimit with activeDeadlineSeconds to cap total run time regardless of retries. activeDeadlineSeconds takes precedence. [3]
  • Use Pod Failure Policy to ignore transient platform-caused failures (spot preemptions, disruptions) and to fail immediately on non-retriable exit codes. [5][6]
  • For large-scale or indexed Jobs use backoffLimitPerIndex + maxFailedIndexes to isolate failing indexes and avoid starving others. [2]
  • Emit logs and preserve failed Pod logs (centralized logging or set restartPolicy appropriately) so you can debug after retries/termination. [1]
  • Monitor Job failure rates and alert on repeated retries (don’t rely on high backoffLimit to mask problems). Treat high retry counts as a signal to fix the underlying error. [4]

Concrete example patterns

  • CronJob that should fail fast: backoffLimit: 1–3 + activeDeadlineSeconds small + good logging. [7][3]
  • Large parallel indexed processing: completionMode: Indexed + backoffLimitPerIndex: 1–2 + maxFailedIndexes to allow progress while bounding bad indexes. [2]
  • Spot/ephemeral nodes: PodFailurePolicy rules to Ignore disruption-related failures + moderate backoffLimit. [5]

Sources
[1] Kubernetes Job docs (backoffLimit behavior, exponential delay).
[2] Kubernetes blog: Backoff limit per index (feature and recommended usage).
[3] Stack Overflow explanation of backoffLimit vs activeDeadlineSeconds.
[4] Baeldung: practical guidance on backoffLimit (defaults and resource implications).
[5] Kubernetes blog/docs: Pod Failure Policy (rules, actions, restartPolicy requirements).
[6] Kubernetes blog: Pod Failure Policy GA guidance.
[7] Community CronJob best-practices (recommendations for small backoffLimit on CronJobs).

If you want, I can: (a) suggest specific backoffLimit/activeDeadlineSeconds values for your workload, or (b) produce example Job manifests illustrating Pod Failure Policy and backoffLimitPerIndex.


Use Pod Failure Policy instead of high backoffLimit for indefinite retries.

A backoffLimit: 1000000 contradicts Kubernetes best practices. Kubernetes documentation explicitly states that large backoffLimit values are "typically not recommended" as they can hide systemic failures and waste cluster resources. The exponential backoff is also capped at ~6 minutes regardless of the limit.

For indefinite retries of transient failures:

  • Use Pod Failure Policy with restartPolicy: Never to classify failures (ignore transient platform failures like spot preemptions; count only application errors)
  • Set activeDeadlineSeconds to cap total job runtime independent of retry count
  • For indexed workloads, use backoffLimitPerIndex to isolate failing indexes

Current repo patterns (architecture-summary.md, maestro-adapter-integration-strategy.md) use backoffLimit: 3, which is more aligned with Kubernetes defaults.

🤖 Prompt for AI Agents
In `@hyperfleet/docs/sentinel-pulses.md` at line 261, Replace the advice to set
backoffLimit: 1000000 with guidance to use Kubernetes Pod Failure Policy and
related settings: recommend configuring Pod Failure Policy (with restartPolicy:
Never) to ignore transient platform failures and only count application errors,
add activeDeadlineSeconds to bound total job runtime, and for indexed jobs
prefer backoffLimitPerIndex; update examples to follow repo patterns (use
backoffLimit: 3 as in architecture-summary.md and
maestro-adapter-integration-strategy.md) instead of suggesting an extremely high
backoffLimit.


The actions performed in the "resources" phase make calls to an API:

- k8s to create k8s objects like CRDs or Job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- k8s to create k8s objects like CRDs or Job
- k8s to create k8s objects like CRs or Job

@xueli181114
Copy link
Contributor

/lgtm

1 similar comment
@yasun1
Copy link
Contributor

yasun1 commented Jan 22, 2026

/lgtm

Sentinel will publish messages whenever:

- User changes the cluster/nodepool spec
- Time from adapter `last_report_time` has exceeded some TTL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean observed_time (Take landing-zone adapter as an example)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapter reports with observed_time but is stored in the API as last_report_time, so from the point of view of the Sentinel, it knows nothing about this observed_time

Not the most intuitive thing....

@rh-amarin
Copy link
Collaborator Author

@ciaranRoche

  • With the pattern proposed, if the adapter reports available unknown on retries, does this mean last_updated_time is not updated and sentinel will pulse every 10s, is there a risk of event storms?
  • last_updated_time will not be updated because of an Unknown status report
  • Before, we were reporting False and that was already doing the 10s retry, so this change doesn't change the behavior
  • Our design has had always the "inefficency" of spamming with events while waiting for the last adapter to report True
    • Adapters that already report True keep receiving messages and repeating themselves responding True
    • This can be acceptable if the number of clusters in Ready==False state is low
    • Or we could introduce some backoff in Sentinel, to reduce stress in the future
    • Or change the broadcast of events from the Sentinel to a more selective one (more complex)
  • Similarly when discarding, unknown will the last known status remain?
  • Yes, the last known status remain
  • There is really little value on reporting Unknown
    • That is why I thought on post-conditions to be able to skip report on adapters
    • IMO is just debugging info, or some keep-alive signal that we are not really using today

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@hyperfleet/docs/sentinel-pulses.md`:
- Line 59: Fix the grammar in the list item that currently reads "Does not
exists" in the sentinel-pulses.md content: locate the list item string "Does not
exists" and change it to "Does not exist" so the verb agrees with the subject.
- Around line 262-265: Replace the ambiguous phrase "This makes retries 10x2
until 360seconds" with a clear timing description (e.g., "Retries use
exponential backoff, doubling the delay each attempt up to a maximum delay of
360 seconds") or remove the timing claim if not essential, and correct the typo
in the Acceptance criteria by changing "retying" to "retrying"; update the
sentence under "Acceptance criteria" to read "Jobs for adapter tasks always
finish with success or keep retrying."

Copy link
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ciaranRoche ciaranRoche merged commit 6d7f085 into openshift-hyperfleet:main Jan 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants