HTTP/2 support in Knative
PR #2539 introduced the basic ability to use a Knative Service with HTTP/2. There have been numerous discussions on how to "properly" support HTTP/2 (and other stream based protocols) in Knative. This will focus on the different aspects of HTTP/2 only and how we could implement it to the benefit of our users. Some of this might also be applicable to other protocols such as Websockets or gRPC.
Why HTTP/2?
The official spec outlines the key differences to HTTP/1.x as follows: HTTP/2
- is binary, instead of textual
- is fully multiplexed, instead of ordered and blocking
- can therefore use one connection for parallelism
- uses header compression to reduce overhead
- allows servers to “push” responses proactively into client caches
Single connection parallelism is further superior according to the spec in:
In the past, browsers have used multiple TCP connections to issue parallel requests. However, there are limits to this; if too many connections are used, it’s both counter-productive (TCP congestion control is effectively negated, leading to congestion events that hurt performance and the network), and it’s fundamentally unfair (because browsers are taking more than their share of network resources).
At the same time, the large number of requests means a lot of duplicated data “on the wire”.
Both of these factors means that HTTP/1.1 requests have a lot of overhead associated with them; if too many requests are made, it hurts performance.
For the purpose of this document, I'll divide these properties into two buckets:
- Wire-protocol properties: These include the binary nature of the protocol itself (1) and the compression of the headers (4).
- Connection properties: These include the multiplex (2/3) and server "push" (5) properties.
What do we need to do to gain advantage of these properties?
Taking advantage of the wire-protocol properties is relatively simple. Given that our routing layers and applications correctly support HTTP/2 end-to-end, we will get these for "free". This should already been done and tested with #2539.
Supporting the connection properties though is a different beast. Especially single connection parallelism could be a neck-breaker for autoscaling and routing in Knative. I therefore propose different "modes" of HTTP/2 support, which lets the user define some additional information so Knative can decide how to properly handle incoming HTTP/2 connections.
HTTP/2 end-to-end
To support server "push" and fully take advantage of HTTP/2's parallelism properties, we need to support HTTP/2 end-to-end. That means, we need to route a connection to the user-application as-is. Since we want to take advantage of the per-connection multiplexing, we need to allow a parallelism of greater than one per connection. This then means, that we have no opportunity to reroute one of these requests once a pod becomes overloaded. Once a connection is routed to a pod, it sticks and all requests sent over it are going to that pod, no matter what.
If containerConcurrency is set to 0 (allowing infinite parallelism), this is not really an issue, as we have no defined limit of how many concurrent requests we can handle, thus we don't need to enforce one. Vertical scalability could become crucial on this path, as one connection could potentially overload a pod and we cannot reroute individual requests on the connection to relieve that pod.
If containerConcurrency is set to > 0 (allowing only a set amount of parallelism) things get a little more tricky. The HTTP/2 spec knows a SETTINGS_MAX_CONCURRENT_STREAMS setting to control the maximum amount of active concurrent streams on one connection. As long as we allow only one HTTP/2 connection per pod, this should work well to indicate to the maximum allowed concurrency per client-connection. If we stick to one connection per pod, autoscaling would naturally scale to one pod per connection and leave all request/stream based concurrency to the pod itself. The SETTINGS_MAX_CONCURRENT_STREAMS would be sent by the queue-proxy.
If a single connection from a single client does not saturate a pod though, we are left with unused capacity. Once we allow multiple HTTP/2 connections per pod, we'll have to deal with sizing each of them properly relative to each other. Autoscaling in this case also needs to account for total active streams across pods.
HTTP/2 to gateway
An alternative approach is to support HTTP/2 until we reach the gateway of a service. The gateway then demultiplexes the connections and sends HTTP/1.1 requests to the application pods themselves. Scaling granularity is not an issue in this case and we need no changes to the user-pods at all (which includes the queue-proxy). This approach takes advantage of HTTP/2's reduced overhead until the user's requests are routed to the gateway. The "last mile" then has the usual overhead of HTTP/1.1, which is hopefully less crucial in an in-cluster network vs. user-requests coming from somewhere on the planet (although multi-region HA services would see the same overhead potentially).
Proposal
Given the different cases laid out above, I feel like it's hard to infer which kind of HTTP/2 support a user wants for her application. If anything, we can try to infer decent defaults based on the containerConcurrency setting. We should always allow to override this default though.
Based on the above, I see at least 3 modes:
- Manual: We route HTTP/2 through to the application and it needs to handle everything accordingly itself (status quo).
- Convert: Converts to HTTP/1.1 at the gateway. Routing and loadbalancing logic stays intact.
- Single: Allows only a single HTTP/2 connection per pod, which is properly sized in concurrency for the allowed
containerConcurrency. Trying to resize connections etc. seems error prone and "hard to guess right" to me. We could maybe implement a Multiple mode later?
HTTP/2 support in Knative
PR #2539 introduced the basic ability to use a Knative Service with HTTP/2. There have been numerous discussions on how to "properly" support HTTP/2 (and other stream based protocols) in Knative. This will focus on the different aspects of HTTP/2 only and how we could implement it to the benefit of our users. Some of this might also be applicable to other protocols such as Websockets or gRPC.
Why HTTP/2?
The official spec outlines the key differences to HTTP/1.x as follows: HTTP/2
Single connection parallelism is further superior according to the spec in:
For the purpose of this document, I'll divide these properties into two buckets:
What do we need to do to gain advantage of these properties?
Taking advantage of the wire-protocol properties is relatively simple. Given that our routing layers and applications correctly support HTTP/2 end-to-end, we will get these for "free". This should already been done and tested with #2539.
Supporting the connection properties though is a different beast. Especially single connection parallelism could be a neck-breaker for autoscaling and routing in Knative. I therefore propose different "modes" of HTTP/2 support, which lets the user define some additional information so Knative can decide how to properly handle incoming HTTP/2 connections.
HTTP/2 end-to-end
To support server "push" and fully take advantage of HTTP/2's parallelism properties, we need to support HTTP/2 end-to-end. That means, we need to route a connection to the user-application as-is. Since we want to take advantage of the per-connection multiplexing, we need to allow a parallelism of greater than one per connection. This then means, that we have no opportunity to reroute one of these requests once a pod becomes overloaded. Once a connection is routed to a pod, it sticks and all requests sent over it are going to that pod, no matter what.
If
containerConcurrencyis set to0(allowing infinite parallelism), this is not really an issue, as we have no defined limit of how many concurrent requests we can handle, thus we don't need to enforce one. Vertical scalability could become crucial on this path, as one connection could potentially overload a pod and we cannot reroute individual requests on the connection to relieve that pod.If
containerConcurrencyis set to> 0(allowing only a set amount of parallelism) things get a little more tricky. The HTTP/2 spec knows aSETTINGS_MAX_CONCURRENT_STREAMSsetting to control the maximum amount of active concurrent streams on one connection. As long as we allow only one HTTP/2 connection per pod, this should work well to indicate to the maximum allowed concurrency per client-connection. If we stick to one connection per pod, autoscaling would naturally scale to one pod per connection and leave all request/stream based concurrency to the pod itself. TheSETTINGS_MAX_CONCURRENT_STREAMSwould be sent by the queue-proxy.If a single connection from a single client does not saturate a pod though, we are left with unused capacity. Once we allow multiple HTTP/2 connections per pod, we'll have to deal with sizing each of them properly relative to each other. Autoscaling in this case also needs to account for total active streams across pods.
HTTP/2 to gateway
An alternative approach is to support HTTP/2 until we reach the gateway of a service. The gateway then demultiplexes the connections and sends HTTP/1.1 requests to the application pods themselves. Scaling granularity is not an issue in this case and we need no changes to the user-pods at all (which includes the queue-proxy). This approach takes advantage of HTTP/2's reduced overhead until the user's requests are routed to the gateway. The "last mile" then has the usual overhead of HTTP/1.1, which is hopefully less crucial in an in-cluster network vs. user-requests coming from somewhere on the planet (although multi-region HA services would see the same overhead potentially).
Proposal
Given the different cases laid out above, I feel like it's hard to infer which kind of HTTP/2 support a user wants for her application. If anything, we can try to infer decent defaults based on the
containerConcurrencysetting. We should always allow to override this default though.Based on the above, I see at least 3 modes:
containerConcurrency. Trying to resize connections etc. seems error prone and "hard to guess right" to me. We could maybe implement a Multiple mode later?