H2 encode flow control#1328
Conversation
| onDecoderFilterAboveWriteBufferHighWatermark() { | ||
| ENVOY_STREAM_LOG(debug, "Read-disabling downstream stream due to filter callbacks.", parent_); | ||
| parent_.response_encoder_->getStream().readDisable(true); | ||
| parent_.connection_manager_.stats_.named_.downstream_flow_control_paused_reading_total_.inc(); |
There was a problem hiding this comment.
@mattklein123 Do you think it's also worth adding the source-cluster of the flow control backup or is that overkill?
There was a problem hiding this comment.
We should have this in upstream stats, so I think this is overkill (see my comment about router stats). Router should be hitting the cluster stats in this case IMO.
|
It seems to make sense to have separate stream/connection limits, given you have many streams per connection. There's a natural place for them in https://github.com/lyft/envoy/blob/master/source/common/json/config_schemas.cc#L267 (for H2 at least - in v2 of the API we have places for all these settings in https://github.com/lyft/envoy-api/blob/master/api/protocol.proto, including a |
| virtual void encodeTrailers(HeaderMapPtr&& trailers) PURE; | ||
|
|
||
| /** | ||
| * Called when a decoder filter goes over its high watermark. |
There was a problem hiding this comment.
Is it the filter going over the high watermark or the underlying buffers for its stream/connection? This is a distinction that is interesting because at some point the filter itself will have a notion of low/high watermark when managing its own buffers.
There was a problem hiding this comment.
Filter, codec, and connection for the router filter.
Updating to (not upstream yet, still doing h2 config changes)
- Called when the buffer for a decoder filter or any buffers the filter sends data to go over
- their high watermark.
- In the case of a filter such as the router filter, which spills into multiple buffers (codec,
- connection etc.) this may be called multiple times. Any such filter is responsible for calling
- the Low Watermark callbacks an equal number of times as the respective buffers are drained.
| void onBelowWriteBufferLowWatermark() override {} | ||
| // Pass watermark events from the connection on to the codec which will pass it to the underlying | ||
| // streams. | ||
| void onAboveWriteBufferHighWatermark() override { codec_->onAboveWriteBufferHighWatermark(); } |
There was a problem hiding this comment.
I haven't gone over the full PR yet, but one question I have is how do the connection and stream watermark event edge triggers interact? For example, if connections goes above high watermark, then stream goes above high watermark, are we guaranteed to only ever get one onAboveWriteBufferHighWatermark for the stream's registered callbacks, or will we see two? I recall this was an issue I had to write some funky logic to handle when experimenting with this a while back.
There was a problem hiding this comment.
The connection manager has to handle callbacks from multiple parties since each filter can go over watermark, so I figured it was better to handle all the deduping in one place rather than do it both here and over there. That's all managed by the overrun_buffers() counters.
|
@htuch I'm not convinced http2_settings is the right place - everything in there is an actual HTTP/2 setting, which gets sent to the peer. Unless you're saying we should use the stream window size though that adds a constraint that the data you're willing to buffer locally is also what you want to allow to be on the network. I can put it there temporarily and have it be the odd "setting" out or I can add a TODO to use per_stream_buffer_limit_byte when we move over to v2. Preference? |
|
I don't think there's a big issue with the semantics of that JSON object today, it's basically a disposable object, since it will be retired soon. I think plumbing in the |
|
|
||
| per_stream_buffer_limit | ||
| *(optional, integer)* A soft limit on the number of bytes envoy will buffer per-stream in the | ||
| codec buffers. Once the buffer reaches this point, Watermark callbacks will fire to stop the |
There was a problem hiding this comment.
Nit: s/Watermark/watermark/, s/envoy/Envoy/
| /** | ||
| * Enable/disable further data from this stream. | ||
| * Cessation of data may not be immediate. For example, for HTTP/2 this may stop further flow | ||
| * control window updates which will result in the peer eventually stopping sending data. |
There was a problem hiding this comment.
For a later implementation, have you thought about using distance from watermark as a way to size the H2 stream flow control window?
There was a problem hiding this comment.
It would avoid the double buffer and given the proximity between buffer and data source I think it'd be a nice clean thing to do, but it would make this one buffer + data source combination special and unusual so I'm not convinced it'd be worth doing unless I get to do the full flow control redesign we've talked about in person where everything seemlessly avoids the double buffering.
There was a problem hiding this comment.
everything seemlessly avoids the double buffering
I assume you mean by rejiggering everything so that encode* returns the number of bytes accepted, etc.? That's certainly doable, though, buffering still will need to happen since whoever fails to write will need to buffer locally until they can write. This pushes more complexity out to the leaves. Not sure it's worth it, but I guess we can see how things go with this approach.
| * Called when the buffer for a decoder filter or any buffers the filter sends data to go over | ||
| * their high watermark. | ||
| * | ||
| * In the case of a filter such as the router filter, which spills into multiple buffers (codec, |
There was a problem hiding this comment.
Can you share this comment with the other variants below? I'm thinking we might want to take the design doc and digest it down to a Markdown file explaining what is actually implemented, and example sequences of going above and below to convey the intuition of when callbacks are fired etc. Maybe you can figure out a better way to convey the intuition in comment form, up to you.
There was a problem hiding this comment.
Having read more of the PR, I think a MD file explaining the connection + stream level design stuff, and then real examples of callback sequences on different events/overflows would be super helpful. Looking through the PR locally, it looks very solid, but I'm having a hard time connecting all the dots staring at the code, some worked examples or diagrams if you prefer would be great.
| int ConnectionImpl::onData(int32_t stream_id, const uint8_t* data, size_t len) { | ||
| getStream(stream_id)->pending_recv_data_.add(data, len); | ||
| StreamImpl* stream = getStream(stream_id); | ||
| stream->pending_recv_data_.add(data, len); |
There was a problem hiding this comment.
Does the H2 stream flow control window save us from infinite buffering in pending_recv_data_ here? If so, maybe add a comment.
| // I don't fully understand. | ||
| static const uint64_t MAX_HEADER_SIZE = 63 * 1024; | ||
|
|
||
| bool buffers_overrun() { return buffers_overrun_ > 0; } |
There was a problem hiding this comment.
Nit: bool buffers_overrrun() const {
mattklein123
left a comment
There was a problem hiding this comment.
Overall solid as usual. This is great. I have a bunch of high level comments to get started with. I haven't looked through the tests yet.
| no_cluster, Counter, Total requests in which the target cluster did not exist and resulted in a 404 | ||
| rq_redirect, Counter, Total requests that resulted in a redirect response | ||
| rq_total, Counter, Total routed requests | ||
| flow_control_paused_downstream_reads_total, Counter, Total times requests backed up enough to pause reading from downstream. |
There was a problem hiding this comment.
Is this useful at router level? We have http_conn_man stats, and upstream stats, so maybe seems superfluous?
There was a problem hiding this comment.
I thought as we added various other filters we might want to track the source of the reads getting paused. If you think it's overkill I'm happy to remove it :-)
There was a problem hiding this comment.
IMO for now it's probably overkill, but up to you. But we definitely need to track the cluster that we paused. Are we doing that with this change? I didn't see it.
There was a problem hiding this comment.
We had cluster counters for when we stopped reading from upstream but not when upstream stopped reads from downstream.
Adding them now. I think does mean we'll have somewhat redundant TCP and HTTP stats where the cluster level "downstream backing up" stats match the "upstream stopped reading" stats and vice versa but it'll make the H2 case easier to understand for sure.
| * Fires when a stream, or the connection the stream is sending to, goes over its high watermark. | ||
| */ | ||
| virtual void onAboveWriteBufferHighWatermark() PURE; | ||
| /** |
| onDecoderFilterAboveWriteBufferHighWatermark() { | ||
| ENVOY_STREAM_LOG(debug, "Read-disabling downstream stream due to filter callbacks.", parent_); | ||
| parent_.response_encoder_->getStream().readDisable(true); | ||
| parent_.connection_manager_.stats_.named_.downstream_flow_control_paused_reading_total_.inc(); |
There was a problem hiding this comment.
We should have this in upstream stats, so I think this is overkill (see my comment about router stats). Router should be hitting the cluster stats in this case IMO.
| void addCallbacks(StreamCallbacks& callbacks) override { addCallbacks_(callbacks); } | ||
| void removeCallbacks(StreamCallbacks& callbacks) override { removeCallbacks_(callbacks); } | ||
| void resetStream(StreamResetReason reason) override; | ||
| void readDisable(bool /*disable*/) override {} |
There was a problem hiding this comment.
nit: rest of code just deletes param name to deal with unused parameter.
| void onResetStream(Http::StreamResetReason reason) override; | ||
| void onAboveWriteBufferHighWatermark() override { | ||
| // Have the connection manager disable reads on the downstream stream. | ||
| parent_.config_.stats_.flow_control_paused_downstream_reads_total_.inc(); |
There was a problem hiding this comment.
Commented elsewhere, but IMO I would remove router stats here, and do upstream cluster stats.
| StreamImpl* stream = getStream(stream_id); | ||
| stream->pending_recv_data_.add(data, len); | ||
| if (!stream->buffers_overrun()) { | ||
| nghttp2_session_consume(session_, stream_id, len); |
There was a problem hiding this comment.
Q: The current auto window handling code does the "standard" h2 behavior of only sending window update when we fall below max size / 2 of window (to avoid excess window updates). Does this code path still do the same thing? I suspect it might be it would be nice to check and comment.
| } | ||
|
|
||
| connection_.dispatcher().deferredDelete(stream->removeFromList(active_streams_)); | ||
| nghttp2_session_consume(session_, stream_id, stream->unconsumed_bytes_); |
There was a problem hiding this comment.
Out of curiosity was this required to get tests passing? I would think nghttp2 would actually do this accounting at the session level if a stream is destroyed.
There was a problem hiding this comment.
I would have thought so too. I had a random moment of "oh but how awful would it be to debug things it doesn't?" and added a regression test just in case. (Http2CodecImplFlowControlTest.EarlyResetRestoresWindow) which fails without this consume.
You don't need to update stream windows on stream close, but the stream window influences the connection window and that isn't updated. My first reaction was that was terrible. My second reaction was that if a stream is "closed" as far as nghttp2 is concerned (all data is sent, fin is received) but the consumer has the data buffered, it really ought not update the connection level window until the consumer says to. So yeah, necessary by design :-/
There was a problem hiding this comment.
OK thanks for looking into this. I might add a comment here for the next person that comes along and wonders about this.
| ConnectionCallbackImpl(HttpActiveHealthCheckSession& parent) : parent_(parent) {} | ||
| // Network::ConnectionCallbacks | ||
| void onEvent(uint32_t events) override { parent_.onEvent(events); } | ||
| void onAboveWriteBufferHighWatermark() override {} |
There was a problem hiding this comment.
General comment: I noticed in latest coverage report that there now a bunch of missing lines for watermark callbacks in some network filters (and soon potentially http filters) where we don't do anything. I realize it's kind of "cheap" coverage but it would be nice to see if we can get everything covered if you see any good way to do it.
There was a problem hiding this comment.
Is this for the no-op functions? I was pretty careful on coverage to cover actual lines of code.
For no-ops do we want to have the tests all call onUnusedfunction() or are coverage annotations to disregard ok?
There was a problem hiding this comment.
"coverage annotations to disregard" don't currently work until we switch to lcov. You can bug @htuch about that. 😉
In this case though, I'm not sure we should disregard as these functions will likely get called. I feel like unit tests on filters, etc. should probably call the functions when they can just to future proof anything strange happening, but not a big deal.
| ENVOY_CONN_LOG(debug, "Stream {} disabled: disable {}, unconsumed_bytes {}", parent_.connection_, | ||
| stream_id_, disable, unconsumed_bytes_); | ||
| if (disable) { | ||
| ++buffers_overrun_; |
There was a problem hiding this comment.
nit: I would rename variable to read_disable_count_ to match what is going on here. (or maybe rename readDisable() to something else).
| } | ||
| } | ||
|
|
||
| void ConnectionImpl::StreamImpl::pendingRecvBufferHighWatermark() { |
There was a problem hiding this comment.
Is it worth heaving temporary asserts for all 4 of these functions like we did in the other PR with a TODO to clean them up after we get more prod experience? I would be in favor of that personally. This is all super scary.
There was a problem hiding this comment.
Given the work being factored into the watermark classes I don't think it gains us much but I'm happy to err on the side of caution. Added locally, upstreamed soon :-)
…me cleanup accidentally done in the same merge.
| @@ -0,0 +1,103 @@ | |||
| ### Overview | |||
There was a problem hiding this comment.
@htuch obviously fairly rough but I'd appreciate high level comments - is this the sort of detailed walkthrough you were looking for?
There was a problem hiding this comment.
Thanks this is very useful. nit: it would be easier to read if using foo for functions and possibly using some numeric lists/bullet lists in subsections.
There was a problem hiding this comment.
Yep, this is what I was after. Some more suggestions on how to polish:
- Explain explicitly the connection between network and stream level handling (and how this relates to plain TCP or H1 network level handling).
- Any diagrams that are reasonable (i.e. don't take ages to put together) would help. I have in my mind the perfect way to efficiently communicate this entire story, an animation showing what happens on backup, the series of firing callbacks and how the buffers fill and drain, but this would take a bit of time to put together, so I don't expect that from this PR (maybe for a future blog post however..).
- More explanation of who the principle actors are. Kind of like telling a story and providing some character background. Leading into the callback description, it would be good to know why HTTP connection manager and the router play the role they do, i.e. that they are the unique bridges across certain kinds of events and subsystem divides in Envoy. Sames goes with the codecs.
- Cover corner cases explicitly, such as the combination of stream and connection level watermark firing.
- Run spell checker and fix any grammar nits.
|
@alyssawilk Can you merge to resolve conflicts? |
htuch
left a comment
There was a problem hiding this comment.
This documentation is fantastic, thanks! PR is basically ready to ship IMHO.
| virtual void onBelowWriteBufferLowWatermark() PURE; | ||
|
|
||
| /** | ||
| * Called when the connection goes from over its high watermark to under its low watermark. |
There was a problem hiding this comment.
Are these comments the opposite way around (or maybe I'm standing on my head today ;) )?
There was a problem hiding this comment.
hasty push
No, why would I ever do something as silly as that?
|
|
||
| void ConnectionImpl::StreamImpl::pendingRecvBufferHighWatermark() { | ||
| ENVOY_CONN_LOG(debug, "recv buffer over limit ", parent_.connection_); | ||
| ASSERT(pending_receive_buffer_high_watermark_called_ == false); |
There was a problem hiding this comment.
I think for booleans we don't ned equality with the value, just something like !pending_receive...
| ## HTTP/2 codec recv buffer | ||
|
|
||
| Given the HTTP/2 `Envoy::Http::Http2::ConnectionImpl::StreamImpl::pending_recv_data_` is processed immediately | ||
| there's no real need for buffer limits, but for consistency and to future-proof the implementation, |
There was a problem hiding this comment.
Yeah, I wondered about this. Seems to make it more complicated than needed for now IMHO, but it's done already, so I don't mind I guess.
|
|
||
| private: | ||
| virtual ConnectionCallbacks& callbacks() PURE; | ||
| virtual Http::ConnectionCallbacks& callbacks() PURE; |
There was a problem hiding this comment.
Nit: Why did this namespace prefixing become necessary in this PR?
There was a problem hiding this comment.
Earlier version had connection callbacks here as well. Reverted.
| @@ -0,0 +1,182 @@ | |||
| ### Overview | |||
|
|
|||
| Flow control in envoy is done by having limits on each buffer, and watermark callbacks. When a | |||
There was a problem hiding this comment.
Super minor nit: I really prefer to have s/envoy/Envoy/g. It's not really important, but it takes me to my calm place ;)
|
|
||
|  | ||
|
|
||
| For HTTP/2, when filters, streams, or connections back up, the end result is readDisable(true) |
There was a problem hiding this comment.
Nit: prefer code quotations, readDisable(true) as done elsewhere.
| being called on the source stream. This results in the stream ceasing to consume window, and so | ||
| not sending further flow control window updates to the peer. This will result in the peer | ||
| eventually stopping sending data when the available window is consumed (or nghttp2 closing the | ||
| connection if the peer violates the flow control limit). When readDisable(false) is called, any |
| Note that readDisable() on a stream may be called by multiple entities. It is called when any | ||
| filter buffers too much, when the stream backs up and has too much data buffered, or the | ||
| connection has too much data buffered. Because of this, readDisable() maintains a count of | ||
| the number of times it has been called to both enable and disable the stream, resumes reads when |
| each caller has called the equivalent low watermark callback. For example, if | ||
| the TCP window upstream fills up and results in the network buffer backing up, | ||
| all the streams associated with that connection will readDisable(true) their | ||
| downstream data sources. While the HTTP/2 flow control window fills up an |
| all the streams associated with that connection will readDisable(true) their | ||
| downstream data sources. While the HTTP/2 flow control window fills up an | ||
| individual stream may use all of the window available and call a second | ||
| readDisable() its downstream data source. When the upstream TCP socket drains, |
| static const uint32_t MAX_INITIAL_CONNECTION_WINDOW_SIZE = (1U << 31) - 1; | ||
|
|
||
| // By default, do not enforce stream buffer limits beyond the connection level limits. | ||
| static const uint32_t DEFAULT_PER_STREAM_BUFFER_LIMIT = 0; |
There was a problem hiding this comment.
This is somewhat related to our offline convos, but I'm wondering if by default we should set stream limits (since defaults are huge, maybe stream limit could default to 128MiB or less vs. 256MiB)? Also, from a config perspective, what is the reasoning behind differentiating between per_stream_buffer_limit_ and initial_stream_window_size_? (I will look more through but figured I would quickly as this question since I noticed it).
There was a problem hiding this comment.
Connection limits default to 1M - I'd prefer to have stream limits equal or smaller if that works for you.
There was a problem hiding this comment.
Current defaults IIRC are 1MiB for TCP level buffering, but 256MiB for both connection and stream HTTP/2 windows. We can tweak defaults in follow up (or here), but my main question is actually why to have the stream window default be different from stream buffer limit. Seems like they could be the same?
There was a problem hiding this comment.
I just looked at the code more, and correct me if I am wrong the default OOB behavior will be:
- 1MiB connection buffer limit, when breached, fires callbacks and pauses all streams.
- No default codec level buffer limit. If stream/connection windows back up, nothing will be paused and we can use unlimited memory. (In case TCP is open, but other side is not ACKing window updates).
IMO we should set buffer limit by default, and perhaps merge stream buffer/window settings into single setting?
There was a problem hiding this comment.
While I'd prefer something smaller in the long run 256M is strictly better than leaving it infinite :-)
I think lowering should be done in a follow-up anyway. I think there will be complexities we'll want to mull on - @htuch suggests defaulting filter limits to stream limits, which would make this a breaking change for non-streaming filters.
There was a problem hiding this comment.
+1 on both turning on as well as tuning defaults. I guess my question is still why we need two different settings for stream window size and buffer size. If there are good reasons for that can we add some comments somewhere?
There was a problem hiding this comment.
Oh, you wanted to remove the option entirely and just go with window size? Hm... I could imagine a use case where a user wanted to keep the stream limits smaller to avoid one stream hogging full connection bandwidth but was willing to buffer locally to spool upstream immediately as window became available. I can comment up the reason they're somewhat distinct use cases, or just remove it until/if someone asks for it back.
There was a problem hiding this comment.
I guess my preference would be to remove the extra configuration pivot right now, but don't feel that strongly. If you want to keep it I would just add some comments on why it exists.
htuch
left a comment
There was a problem hiding this comment.
This is probably going to be an issue in v2, where we have independent per stream buffer limits and initial window sizes. I kind of preferred it how it was before, but it's fine for a first PR as long as we have some idea of how this will evolve for v2.
| * Called when the underlying Network::Connection goes over its high watermark. | ||
| */ | ||
| virtual void onAboveWriteBufferHighWatermark() PURE; | ||
| virtual void onUnderlyingConnectionAboveWriteBufferHighWatermark() PURE; |
There was a problem hiding this comment.
I'd probably find onNetworkConnections... more descriptive, but potatoes potatahs..
| request_encoder_->encodeHeaders(request_headers, false); | ||
| Buffer::OwnedImpl body(std::string(1024 * 1024, 'a')); | ||
| EXPECT_CALL(client_stream_callbacks, onAboveWriteBufferHighWatermark()).Times(AnyNumber()); | ||
| ; |
|
Ignore last set of comments except boilerplate, I see you've already had a long thread on this. |
| { | ||
| "virtual_hosts": [ | ||
| { | ||
| "name": "redirect", |
There was a problem hiding this comment.
Not unique to this PR, but I feel there is a fair bit of boiler plate in these JSON files that is actually unused. A good example is the redirect stuff. In a PR I'm working on right now, I made an effort to cut out this, I suggest giving this JSON a quick pass and seeing what can be removed.
| eventually stopping sending data when the available window is consumed (or nghttp2 closing the | ||
| connection if the peer violates the flow control limit). When `readDisable(false)` is called, any | ||
| outstanding unconsumed data is immediately consumed, which results in resuming window updates to the | ||
| peer and the resumption of data. |
There was a problem hiding this comment.
Might be worth pointing out that we only do H2 flow control on stream level and not the connection level for the purposes of buffer bounding, which is implied here but a useful thing to call out explicitly.
|
|
||
| * When `Envoy::Network::ConnectionImpl::write_buffer_` has too much data it calls | ||
| `Network::ConnectionCallbacks::onAboveWriteBufferHighWatermark()`. | ||
| * When `Envoy::Http::CodecClient` receives `onAboveWriteBufferHighWatermark()` it |
There was a problem hiding this comment.
Does the doc need updating for the underlying buffer name changes?
|
Can you also file a PR to remove https://github.com/lyft/envoy-api/blob/master/api/protocol.proto#L14? Thanks. |
|
@alyssawilk docs build is still failing, otherwise LGTM. |
| window. Currently , this has the same minimum/maximum/default as :ref:`initial_stream_window_size | ||
| <config_http_conn_man_http2_settings_initial_stream_window_size>`. | ||
|
|
||
| per_stream_buffer_limit_bytes |
There was a problem hiding this comment.
This doesn't exist anymore right? Should this text get merged into initial_stream_window_size?
mattklein123
left a comment
There was a problem hiding this comment.
This is awesome! Great work.
| Because the various buffers in the HTTP/2 stack are fairly complicated, each path from a buffer | ||
| going over the watermark limit to disabling data from the data source is documented separately. | ||
|
|
||
|  |
There was a problem hiding this comment.
Is there a source vector drawing that was used to generate this image?
I think that the color or some of the buffers in the diagram may need to be updated. I think that based on how L4 buffering works today the ConnectionImpl buffer is bounded by the configured connection watermark and should be green. Some other changes are also in the pipeline.
There was a problem hiding this comment.
This comes from https://docs.google.com/document/d/1EB3ybx3yTndp158c4AdQ4nutksZA9lL-BQvixhPnb_4/edit#bookmark=id.umrpbw846kb. I'll share the doc with you and you can copy from there if you want.
**Description** Since log time ago (i think v0.2.), there has been a limitation on k8s Serice as a target by AIServiceBackend. The detailed context is described in #902. However, we haven't had any CRD CEL validation that enforces the limitation, hence there has been a lot of users encountering the limitation that results in *very* hard-to-debug errors. e.g. #1169 kserve/kserve#4595 (comment) Since resolving #902 takes a relatively large amount of efforts (needs change in EG) as well as using k8s service as a target is out of our primary scope, this adds a CEL validation to save everyone's time. --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
#150 (for h2 flow control)
H2 flow control for the encode (backend to client) path, specifically enforcing the exiting connection limits and applying those same limits to the codec send/recv buffers.
If the connection buffer overflows, the CodecClient notifies the Codec.
If the Codec send buffer overflows (or the connection overflows) the codec calls stream callbacks.
The router listens for stream callbacks and passes the backup to the connection manager via StreamDecoderFilterCallbacks
the connection manager enables/disables reads on the stream.
Connection buffer limits are already configurable. This adds configurable stream limits via the h2 setting structure.
TBD, the client-to-backend path, filter buffers, and HTTP1.