Skip to content

docs: fatal codes, re-init, and retry policy#1818

Merged
toddbaert merged 4 commits intomainfrom
docs/provider-spec-updates
Jan 9, 2026
Merged

docs: fatal codes, re-init, and retry policy#1818
toddbaert merged 4 commits intomainfrom
docs/provider-spec-updates

Conversation

@toddbaert
Copy link
Copy Markdown
Member

This PR specifies some provider behavior, specifically around stream health, gRPC retry policy, and FATAL codes.

Specifically, it:

@toddbaert toddbaert requested review from a team as code owners October 30, 2025 16:23
@netlify
Copy link
Copy Markdown

netlify Bot commented Oct 30, 2025

Deploy Preview for polite-licorice-3db33c ready!

Name Link
🔨 Latest commit e075313
🔍 Latest deploy log https://app.netlify.com/projects/polite-licorice-3db33c/deploys/696146f2ab838d0008e8a3bf
😎 Deploy Preview https://deploy-preview-1818--polite-licorice-3db33c.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 30, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @toddbaert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and predictability of flagd provider behavior by formalizing how providers handle stream health, gRPC retry mechanisms, and fatal error conditions. It introduces a standardized retry policy for transient network issues and a critical new feature to recognize and react to non-transient (fatal) gRPC errors, preventing endless reconnection loops. Additionally, the documentation now explicitly outlines provider re-initialization capabilities, ensuring clearer and more reliable provider operations.

Highlights

  • Standardized Retry Policy: A clear gRPC retry policy has been published, to be adopted by all flagd providers, specifically for UNAVAILABLE and UNKNOWN status codes.
  • Fatal Error Handling: A new mechanism has been introduced to mark certain gRPC status codes as FATAL, which will cause providers to cease reconnection attempts for non-transient errors.
  • Provider Re-initialization: The state diagram has been updated to explicitly clarify that flagd providers should support re-initialization, provided they are not in a FATAL state.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment on lines +69 to +74
STALE --> NOT_READY: shutdown
ERROR --> READY: reconnected
ERROR --> [*]: shutdown
ERROR --> NOT_READY: shutdown
ERROR --> [*]: Error code == PROVIDER_FATAL

note right of STALE
note left of STALE
Copy link
Copy Markdown
Member Author

@toddbaert toddbaert Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old:

Image

new:

Image

The main different is we make it clear transitions are possible from non-fatal ERROR, back to NOT_READY... many implementations already support this, but not all.
I think it makes sense to specify this so we can be consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the impression that PROVIDER_FATAL can only happen during initialization, where the error can be surfaced and handled by the caller.

With the current proposal, PROVIDER_FATAL can be a result of a failing sync. As a user, it seems that I'll get the default value and an error. Am I supposed to handle this error and exit the program?

Copy link
Copy Markdown
Member Author

@toddbaert toddbaert Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tangenti I need to make some updates to reflect the discussion here.

We decided the best path forward is to provide an option to enumerate the status codes that a user considers FATAL. In the case those are received, whether it's the initial connection or not, the program can exit (or rebuild a new provider). We believed this was the best trade-off between usability and complexity, and it's easy to understand: select what you want to consider FATAL, and take the action you want when those codes are received; by marking a code is FATAL you are telling the provider that this code represents a non-transient error state.

I will make the related updates.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to go this way is it saves us from keeping track of even more state.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the provider specification to clarify behavior around stream health, gRPC retry policies, and fatal error codes. The changes include updating the state diagram, defining a gRPC retry policy, and introducing the concept of fatal status codes that stop reconnection attempts. The documentation is clearer as a result. I've found a few issues: an invalid JSON example for the retry policy, an inconsistency in the number of retries described, and a minor stylistic point.

Comment thread docs/reference/specifications/providers.md Outdated
Comment thread docs/reference/specifications/providers.md Outdated
Comment thread docs/reference/specifications/providers.md Outdated
While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
```json
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard retryPolicy, accepted in this JSON format by most gRPC implementations.

| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only new option - the other changes are just whitespace.

Comment thread docs/reference/specifications/providers.md Outdated
Comment thread docs/reference/specifications/providers.md
Comment thread docs/reference/specifications/providers.md Outdated
@toddbaert
Copy link
Copy Markdown
Member Author

@aepfli @alexandraoberaigner made changes from your feedback, plz re-review.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Dec 1, 2025

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the docs/provider-spec-updates branch from 7cb3e07 to 5760050 Compare January 9, 2026 18:08
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jan 9, 2026

@toddbaert toddbaert merged commit ace1a7c into main Jan 9, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants