Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions best-practices/grpc-apis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Evolving gRPC APIs — Best Practices
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of our md docs have a toc at the top or just below the first level title - consider if it is worth adding one to this doc as well.


This document covers rules and best practices for evolving gRPC APIs to allow
safe cross-version usage.

## High Level Guidance

The most important takeaway from this document should be that changing protocol
buffers used across software releases requires careful planning and analysis of
behavior both in future releases and past releases. Every change to a protobuf
defintion should include an analysis of how future releases will cope with the
field if it is not present as well as how past releases will behave given that
they will be blind to the new field.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently require two additional pullapprove acks for any protocol buffer changes:

Also the agent is implicitly guaranteed to be upgraded before any other component using the protocol since the *.proto files live in the agent repo and other repos need to re-vendor at a later stage to obtain details of the new protocol version.

In terms of Kata, I think we should document that we require the analysis you mention to be added as a comment by those who are acking the protocol buffer changes.

As such, it might be useful to expand this doc to include advice specific for Kata after the general gRPC advice section.


## Rules

### R1. Never, _EVER_ Re-use Protcol Buffer Field Numbers
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo => Protocol


Every protocol buffer field has an identifying number. When
[encoded](https://developers.google.com/protocol-buffers/docs/encoding) on the
wire, this number and the field's type are the only information that is actually
passed. The result is that it is _never_ safe to re-use a field number. Protocol
buffers provides the `reserved` keyword allowing field numbers to be marked as
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffers provide the ...

reserved and prevent them from being re-used. Any field number that has been
used as part of publicly available code (not _just_ releases; this rule applies
to anything checked into an official repo) should be considered "spent" and
marked as reserved if the field is later removed.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could contrive an additional automated check in https://github.com/kata-containers/tests/blob/master/.ci/static-checks.sh#L111 to look at the diff for all *.proto changes and ensure that any dropped that removed members should have a corresponding reserved N addition?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love automated check @jodh-intel 😆


```proto
reserved 1, 3;

string name = 2;

int32 size = 4; // <-- The first available field number is 4.
```

### R2. Avoid Introducing and Enable a Feature in the same Release
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling a Feature in the Same Release


Maintaining downgrade compatibility when launching a new release requires that
the release not include any features not supported by the previous release. In
the strictest sense this means that new features cannot be "enabled" until one
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictest sense, <- (comma)

release after they are deemed "working". As a concrete example, when Google
Compute Engine introduced multi-queue networking we did so one release later
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tm) ?

than when it was "fully supported" in GCE production. This allowed us to safely
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release after it was "fully...

roll-back the release when a bug elsewhere was encountered while not bricking
VMs that had booted after multi-queue had been enabled.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like good advice but is going to be difficult to ensure. I don't think any of the gRPC features have corresponding runtime configuration file options so we'd be faced with having to update the *.proto files, re-vendor into the other components and make the feature a NOP in the agent for "a release". We also need to decide on (and document) what we mean by "a release" here - presumably a major or minor x.Y.z release.


TODO: Find a concrete example from Kata rather than from an external source.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it helps - I think there is some non-backwards compat change in one of the components between the current head, and the v1.1.0 tags. If I roll the runtime back to 1.1.0 for instance I can not launch docker containers. I've not figured out where the break was though - if anybody knows (and I suspect an agent/proxy relationship maybe), please speak up...


### R3. Enums are Problematic
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be made into a rule: "Do Not Use Enums Unless There Is No Other Alternative," or similar.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think we need a strict description to forbid developer from doing this, or the programmer will tend to ignore this warning (warnings means OK to lots of developers 😄 )

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We could have an automated check to forbid enums too of course. Thoughts @sboeuf, @bergwolf?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at gRPC enums (and I'm presuming this is only relating to gRPC enums) - but is there a certain class of enums we are talking about here. If one had a linear/sequential set of enums that only ever grew on the end for instance, is that problem free?


Related to R2, enum types are problematic as adding additional values in a new
releases can create undefined behavior in prior releases.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • s/can/will/ ?
  • Ugh. Maybe we should convert all enums to ints or strings :) I guess that atleast for strings that would not be as efficient and would place a burden on the agent rather than the gRPC library though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: in a new release


### R4. Rarely Change Field Types

Related to the R1, while not always strictly unsafe it is usually unwise to
change the type of an existing field. In most cases it is better to inroduce a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inroduce/introduce/

new field with the new type and slowly phase out the old field once it has left
the backward compatibility window.

Exceptions to this rule can include converting enum types to int32 or int64,
provided no values other than the original enum definition are used
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are used.


## Guidelines

### GL1. Have Clearly Defined Compatibility Windows

TODO: Change this guideline to encompass Kata's agreed-upon compatibiltiy
policy.

It is critical to have a clear policy on what upgrade and downgrade paths are
supported. One policy that has been generally successful for other large
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need a careful testing matrix for this too. And these compatibility tests are going to have to block a release.

/cc @chavafg.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata agent works as grpc server, kata-runtime works as grpc client, so I think an example matrix can be:

  Runtime v1 Runtime v2 Runtime v3 Runtime v4
Agent v1
Agent v2
Agent v3 ×
Agent v4 × ×

Do I understand it right? @jon

projects (Google Compute Engine, in particular) it to support unbounded forward
upgrades and single-release downgrades.

### GL2. The Scope of Compatibility is Only Releases that may Mix

Many compatibility problems can been addressed be avoiding mixing components.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/be avoiding mixing components/by avoiding mixing component versions/ ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be addressed

For example, if each Kata release includes both the agent and the runtime, the
agent can be
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete sentence :)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ?????


### GL3. Redundancy is a Useful Tool

While it can be aesthetically displeasing, redundantly encoding similar (or
identical) information can allow slow migration from one behavior
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another. I'm rather intrigued to read the rest of this sentence though :)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behavior to another.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be stripped when copy 😄


### GL4. Changes to Defaults Should Straddle Compatibility Boundaries

As per R2 above, new fields cannot generally be introduced and used in the same
release. More broadly, new behaviors (typically new features) should not become
the default until the _oldest_ release not supporting that behavior is no longer
in the downgrade compatibiltiy window.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.


### GL5. On-the-Wire Versioning is Rarely Useful

Very few protocol buffers at Google include version information as part of the
wire format (either explicitly or implicitly). For example, the Google Compute
Engine API is still on "version 1" despite dozens of calls and arguments being
added since its inception. This is possible because new features are rolled out
as "additions" to existing APIs, and by following R2 above.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As kata-agent will keeps running after kata-runtime is upgraded, I think we need a mechanism to let kata-runtime know which field isn't supported by this version of kata-agent, and kata-runtime should downgrade it's grpc message version to appropriate version for better adapt to the agent. so question is, without on wire versioning support, how could kata-runtime do this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way I think is, we give the on-disk storage config of a POD a version number, and this version number will represent the kata-agent version, kata-runtime will send appropriate message according to the on-disk version(which equals to agent version). What do you think, do you have better practice to share?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @jon is suggesting we should not perform any version checks.

A possible solution which was mentioned briefly yesterday is that we could introduce a new RestartAgent gRPC service. When we can guarantee that all agent versions support that service (and assuming that adding new services to the protocol will not stop RestartAgent from working), when the runtime is upgraded, it can call RestartService for running pods. That will cause the agent to re-exec itself, reloading a newer binary version (which supports the newer gRPC protocol version). Once done, the runtime is guaranteed that it can use the full (new) gRPC protocol version when talking to the agent. This would potentially require the agent to pass state between the original instance and the newly-exec'd instance but that should be achievable.

Alternatively, rather than re-execing the agent, we could apparently squirt bytes down the wire via gRPC from the runtime to the agent so that the agent could read the byte stream and create a new agent instance out of it (write the bytes to disk and re-exec?). I think @jon mentioned that would be slower though?

I'd vote for the re-exec option as it's proven and used by init systems such as systemd and upstart.

One issue with both approaches is being able to rewrite the agent binary. This might not be possible or desirable. We could share the agent into all pods images via 9p and then the runtime would only need to update the (host) agent binary file and ask all the running agent instances to re-exec to use it. Again, performance might take a hit here due to 9p.

Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long it takes to get a new agent binary into the VM, and how, I'm not so worried about. This will be a very rare sequence (it's not like something we would be doing every minute, or even daily - this is probably a once every 3 month or more type scenario I would think).
How we hand over the state from the old agent to the new I think will be an interesting challenge - but, as you say, should be achievable.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sound achievable, also as @jodh-intel mentioned:

Another concern is of course security - arbitrarily rewriting files isn't ideal. We'd need to think carefully about checksums, security tokens, gpg signatures, etc.

This is also the biggest concern. Thanks @grahamwhaley @jodh-intel for explanation !


While it can seem attractive to add explicit versions to the protocol this
typically requires complex version conversion infrastructure that must be
maintained (and tested) alongside the actual code.