-
Notifications
You must be signed in to change notification settings - Fork 292
WIP: gRPC Best practices. #196
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,105 @@ | |||||||||||||||||||||||||||
| # Evolving gRPC APIs — Best Practices | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| This document covers rules and best practices for evolving gRPC APIs to allow | |||||||||||||||||||||||||||
| safe cross-version usage. | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ## High Level Guidance | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| The most important takeaway from this document should be that changing protocol | |||||||||||||||||||||||||||
| buffers used across software releases requires careful planning and analysis of | |||||||||||||||||||||||||||
| behavior both in future releases and past releases. Every change to a protobuf | |||||||||||||||||||||||||||
| defintion should include an analysis of how future releases will cope with the | |||||||||||||||||||||||||||
| field if it is not present as well as how past releases will behave given that | |||||||||||||||||||||||||||
| they will be blind to the new field. | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We currently require two additional pullapprove acks for any protocol buffer changes: Also the agent is implicitly guaranteed to be upgraded before any other component using the protocol since the In terms of Kata, I think we should document that we require the analysis you mention to be added as a comment by those who are acking the protocol buffer changes. As such, it might be useful to expand this doc to include advice specific for Kata after the general gRPC advice section. |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ## Rules | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### R1. Never, _EVER_ Re-use Protcol Buffer Field Numbers | |||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typo => Protocol |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Every protocol buffer field has an identifying number. When | |||||||||||||||||||||||||||
| [encoded](https://developers.google.com/protocol-buffers/docs/encoding) on the | |||||||||||||||||||||||||||
| wire, this number and the field's type are the only information that is actually | |||||||||||||||||||||||||||
| passed. The result is that it is _never_ safe to re-use a field number. Protocol | |||||||||||||||||||||||||||
| buffers provides the `reserved` keyword allowing field numbers to be marked as | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. buffers provide the ... |
|||||||||||||||||||||||||||
| reserved and prevent them from being re-used. Any field number that has been | |||||||||||||||||||||||||||
| used as part of publicly available code (not _just_ releases; this rule applies | |||||||||||||||||||||||||||
| to anything checked into an official repo) should be considered "spent" and | |||||||||||||||||||||||||||
| marked as reserved if the field is later removed. | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we could contrive an additional automated check in https://github.com/kata-containers/tests/blob/master/.ci/static-checks.sh#L111 to look at the diff for all
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I love automated check @jodh-intel 😆 |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ```proto | |||||||||||||||||||||||||||
| reserved 1, 3; | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| string name = 2; | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| int32 size = 4; // <-- The first available field number is 4. | |||||||||||||||||||||||||||
| ``` | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### R2. Avoid Introducing and Enable a Feature in the same Release | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Enabling a Feature in the Same Release |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Maintaining downgrade compatibility when launching a new release requires that | |||||||||||||||||||||||||||
| the release not include any features not supported by the previous release. In | |||||||||||||||||||||||||||
| the strictest sense this means that new features cannot be "enabled" until one | |||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. strictest sense, <- (comma) |
|||||||||||||||||||||||||||
| release after they are deemed "working". As a concrete example, when Google | |||||||||||||||||||||||||||
| Compute Engine introduced multi-queue networking we did so one release later | |||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (tm) ? |
|||||||||||||||||||||||||||
| than when it was "fully supported" in GCE production. This allowed us to safely | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. release after it was "fully... |
|||||||||||||||||||||||||||
| roll-back the release when a bug elsewhere was encountered while not bricking | |||||||||||||||||||||||||||
| VMs that had booted after multi-queue had been enabled. | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds like good advice but is going to be difficult to ensure. I don't think any of the gRPC features have corresponding runtime configuration file options so we'd be faced with having to update the |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| TODO: Find a concrete example from Kata rather than from an external source. | |||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case it helps - I think there is some non-backwards compat change in one of the components between the current head, and the v1.1.0 tags. If I roll the runtime back to 1.1.0 for instance I can not launch docker containers. I've not figured out where the break was though - if anybody knows (and I suspect an agent/proxy relationship maybe), please speak up... |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### R3. Enums are Problematic | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be made into a rule: "Do Not Use Enums Unless There Is No Other Alternative," or similar.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free? |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Related to R2, enum types are problematic as adding additional values in a new | |||||||||||||||||||||||||||
| releases can create undefined behavior in prior releases. | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: in a new release |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### R4. Rarely Change Field Types | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Related to the R1, while not always strictly unsafe it is usually unwise to | |||||||||||||||||||||||||||
| change the type of an existing field. In most cases it is better to inroduce a | |||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/inroduce/introduce/ |
|||||||||||||||||||||||||||
| new field with the new type and slowly phase out the old field once it has left | |||||||||||||||||||||||||||
| the backward compatibility window. | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Exceptions to this rule can include converting enum types to int32 or int64, | |||||||||||||||||||||||||||
| provided no values other than the original enum definition are used | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are used. |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ## Guidelines | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### GL1. Have Clearly Defined Compatibility Windows | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| TODO: Change this guideline to encompass Kata's agreed-upon compatibiltiy | |||||||||||||||||||||||||||
| policy. | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| It is critical to have a clear policy on what upgrade and downgrade paths are | |||||||||||||||||||||||||||
| supported. One policy that has been generally successful for other large | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're going to need a careful testing matrix for this too. And these compatibility tests are going to have to block a release. /cc @chavafg.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:
Do I understand it right? @jon |
|||||||||||||||||||||||||||
| projects (Google Compute Engine, in particular) it to support unbounded forward | |||||||||||||||||||||||||||
| upgrades and single-release downgrades. | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### GL2. The Scope of Compatibility is Only Releases that may Mix | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Many compatibility problems can been addressed be avoiding mixing components. | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/be avoiding mixing components/by avoiding mixing component versions/ ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can be addressed |
|||||||||||||||||||||||||||
| For example, if each Kata release includes both the agent and the runtime, the | |||||||||||||||||||||||||||
| agent can be | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Incomplete sentence :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can be ????? |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### GL3. Redundancy is a Useful Tool | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| While it can be aesthetically displeasing, redundantly encoding similar (or | |||||||||||||||||||||||||||
| identical) information can allow slow migration from one behavior | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And another. I'm rather intrigued to read the rest of this sentence though :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. behavior to another.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It must be stripped when copy 😄 |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### GL4. Changes to Defaults Should Straddle Compatibility Boundaries | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| As per R2 above, new fields cannot generally be introduced and used in the same | |||||||||||||||||||||||||||
| release. More broadly, new behaviors (typically new features) should not become | |||||||||||||||||||||||||||
| the default until the _oldest_ release not supporting that behavior is no longer | |||||||||||||||||||||||||||
| in the downgrade compatibiltiy window. | |||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typo. |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| ### GL5. On-the-Wire Versioning is Rarely Useful | |||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| Very few protocol buffers at Google include version information as part of the | |||||||||||||||||||||||||||
| wire format (either explicitly or implicitly). For example, the Google Compute | |||||||||||||||||||||||||||
| Engine API is still on "version 1" despite dozens of calls and arguments being | |||||||||||||||||||||||||||
| added since its inception. This is possible because new features are rolled out | |||||||||||||||||||||||||||
| as "additions" to existing APIs, and by following R2 above. | |||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As kata-agent will keeps running after kata-runtime is upgraded, I think we need a mechanism to let kata-runtime know which field isn't supported by this version of kata-agent, and kata-runtime should downgrade it's grpc message version to appropriate version for better adapt to the agent. so question is, without on wire versioning support, how could kata-runtime do this?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One way I think is, we give the on-disk storage config of a POD a version number, and this version number will represent the kata-agent version, kata-runtime will send appropriate message according to the on-disk version(which equals to agent version). What do you think, do you have better practice to share? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that @jon is suggesting we should not perform any version checks. A possible solution which was mentioned briefly yesterday is that we could introduce a new Alternatively, rather than re-execing the agent, we could apparently squirt bytes down the wire via gRPC from the runtime to the agent so that the agent could read the byte stream and create a new agent instance out of it (write the bytes to disk and re-exec?). I think @jon mentioned that would be slower though? I'd vote for the re-exec option as it's proven and used by init systems such as systemd and upstart. One issue with both approaches is being able to rewrite the agent binary. This might not be possible or desirable. We could share the agent into all pods images via 9p and then the runtime would only need to update the (host) agent binary file and ask all the running agent instances to re-exec to use it. Again, performance might take a hit here due to 9p. Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How long it takes to get a new agent binary into the VM, and how, I'm not so worried about. This will be a very rare sequence (it's not like something we would be doing every minute, or even daily - this is probably a once every 3 month or more type scenario I would think). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Upstart, we used a pipe - see https://wiki.ubuntu.com/FoundationsTeam/Specs/QuantalUpstartStatefulReexec#State-Passing for the logic.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sound achievable, also as @jodh-intel mentioned: This is also the biggest concern. Thanks @grahamwhaley @jodh-intel for explanation ! |
|||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||
| While it can seem attractive to add explicit versions to the protocol this | |||||||||||||||||||||||||||
| typically requires complex version conversion infrastructure that must be | |||||||||||||||||||||||||||
| maintained (and tested) alongside the actual code. | |||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of our md docs have a toc at the top or just below the first level title - consider if it is worth adding one to this doc as well.