Overload: Reset expensive streams using byte accounting by KBaichoo · Pull Request #17702 · envoyproxy/envoy

KBaichoo · 2021-08-12T20:38:54Z

Commit Message: Overload Manager action to reset expensive streams using byte accounting
Additional Description:
Risk Level: Medium
Testing: unit and integration tests
Docs Changes: included
Release Notes: included
Platform Specific Features: NA
Optional Runtime guard: N/A (guarded by configuration of the overload manager)
Related Issue #15791

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

absl::optional instead of specific sentential. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

into resetStreamsInBucket using adapter. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Manager unneeded bits. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

instead. Update documentation. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2021-08-12T20:39:44Z

/assign @yanavlasov

KBaichoo · 2021-08-12T20:40:03Z

cc @alyssawilk

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2021-08-13T18:44:04Z

/retest

repokitteh-read-only · 2021-08-13T18:44:08Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17702 (comment) was created by @KBaichoo.

see: more, trace.

KBaichoo · 2021-08-13T21:25:17Z

/retest

repokitteh-read-only · 2021-08-13T21:25:20Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17702 (comment) was created by @KBaichoo.

see: more, trace.

alyssawilk

Nice addition! Some thoughts below :-)

alyssawilk · 2021-08-16T13:47:17Z

+                     testing::Bool()),
+    protocolTestParamsAndBoolToString);
+
+TEST_P(Http2OverloadManagerIntegrationTest, ResetsExpensiveStreamsWhenOverloaded) {


can we make sure to have an integration test where upstream buffers take too much space and another where downstream buffers do?

Good idea. This original test had the upstream buffers hold data (sending to upstream was blocked). Added a test for the response path where the data will be in the downstream buffers to the client.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo

Thanks for the review @alyssawilk

KBaichoo · 2021-08-19T00:58:30Z

+                     testing::Bool()),
+    protocolTestParamsAndBoolToString);
+
+TEST_P(Http2OverloadManagerIntegrationTest, ResetsExpensiveStreamsWhenOverloaded) {


Good idea. This original test had the upstream buffers hold data (sending to upstream was blocked). Added a test for the response path where the data will be in the downstream buffers to the client.

KBaichoo · 2021-08-19T13:33:58Z

CI failing for unrelated fuzz tests fixed here: #17767

…-rework Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

alyssawilk

Looking great! Here's next round of comments :-)

alyssawilk · 2021-08-19T18:45:35Z

+:ref:`buffer_factory_config
+<envoy_v3_api_field_config.overload.v3.OverloadManager.buffer_factory_config>`.
+If the `minimum_threshold_for_tracking` isn't configured, Envoy *won't* track
+per stream allocated bytes which is needed for this action to work.


can we simply reject the config if we only include half the config needed?

Will do. Good idea.

alyssawilk · 2021-08-19T18:48:00Z

                                                    Http::StreamResetHandler& reset_handler);
  ~BufferMemoryAccountImpl() override {
+    // The buffer_memory_allocated_ should always be zero on destruction, even if we
+    // triggered a reset of the downstream. This is because the dtor only will


super nitty optional (since it passed spellcheck =P) dtor -> destuctor?

alyssawilk · 2021-08-19T19:02:22Z

+  // Wait for the proxy to notice and take action for the overload.
+  if (streamBufferAccounting()) {
+    test_server_->waitForCounterGe("http.config_test.downstream_rq_rx_reset", 2);
+    EXPECT_TRUE(medium_request_response->waitForReset());


can we check reset reason or response code details here to make sure the accounting works?

For reset reason: We get HTTP::StreamResetReason::RemoteReset from the endpoint (client). I think that's because it's the catch-all in ConnectionImpl::onStreamClose.

envoy/source/common/http/http2/codec_impl.cc

Line 1082 in 744b7bf

reason = StreamResetReason::RemoteReset;

We could add additional plumbing to better tune this e.g. have the RST_REASON sent to downstream be INTERNAL_ERROR (0x2)

as per https://www.rfc-editor.org/rfc/rfc7540.html#section-7

and have the client (e.g. onStreamClose) check for this to guess the reason... though there might be multiple reasons client might send internal error? e.g. Internal Error RST -> Overload might be a mapping, but there might be more?

WDYT?

if the client in this test gets remote reset I think that's fine - internal error might be a bit more accurate but I don't mind the catchall.
The real request here is that we have a way to debug what's going on in production if we see a bunch of streams being reset, and want to verify the cause. Normally my catchall is "check the stream response code details" but I think the way we reset this through the stream interface we don't have a way to send details, or stick them directly in the stream info. Alternately there stats which were added somewhere? I'm game for stats or details but I think we should have a way to test Envoy is triggering the path that (reading the test) I believe it's triggering since that lets SRE do the same verification live :-)

Added an explicit stat for number of streams the action called resetStream on.

KBaichoo

Will push these changes soon, thanks for the feedback.

KBaichoo · 2021-08-19T19:22:09Z

                                                    Http::StreamResetHandler& reset_handler);
  ~BufferMemoryAccountImpl() override {
+    // The buffer_memory_allocated_ should always be zero on destruction, even if we
+    // triggered a reset of the downstream. This is because the dtor only will


KBaichoo · 2021-08-19T20:29:30Z

+  // Wait for the proxy to notice and take action for the overload.
+  if (streamBufferAccounting()) {
+    test_server_->waitForCounterGe("http.config_test.downstream_rq_rx_reset", 2);
+    EXPECT_TRUE(medium_request_response->waitForReset());


For reset reason: We get HTTP::StreamResetReason::RemoteReset from the endpoint (client). I think that's because it's the catch-all in ConnectionImpl::onStreamClose.

envoy/source/common/http/http2/codec_impl.cc

Line 1082 in 744b7bf

reason = StreamResetReason::RemoteReset;

We could add additional plumbing to better tune this e.g. have the RST_REASON sent to downstream be INTERNAL_ERROR (0x2)

as per https://www.rfc-editor.org/rfc/rfc7540.html#section-7

and have the client (e.g. onStreamClose) check for this to guess the reason... though there might be multiple reasons client might send internal error? e.g. Internal Error RST -> Overload might be a mapping, but there might be more?

WDYT?

KBaichoo · 2021-08-19T20:30:01Z

+:ref:`buffer_factory_config
+<envoy_v3_api_field_config.overload.v3.OverloadManager.buffer_factory_config>`.
+If the `minimum_threshold_for_tracking` isn't configured, Envoy *won't* track
+per stream allocated bytes which is needed for this action to work.


Will do. Good idea.

KBaichoo · 2021-08-19T20:30:50Z

/wait

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo · 2021-08-23T12:47:02Z

/retest

repokitteh-read-only · 2021-08-23T12:47:08Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17702 (comment) was created by @KBaichoo.

see: more, trace.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

…-rework Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

alyssawilk

This is looking great! one more hit offhand but I think it's ready for a non-googler pass (especially as I'm out until Tuesday :-) )

alyssawilk · 2021-08-26T17:16:18Z

    - Envoy will reduce the waiting period for a configured set of timeouts. See
      :ref:`below <config_overload_manager_reducing_timeouts>` for details on configuration.

+  * - envoy.overload_actions.reset_streams


how about a specific name here as I suspect there are or will be other reasons we reset streams. reset_high_memory_stream or expensive_streams or something more terse but still specific?

Done it's now envoy.overload_actions.reset_high_memory_stream

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

snowp

Thanks for working on this! Took a pass at the docs and implementation for now, looks pretty good!

snowp · 2021-08-27T13:13:24Z

    - Envoy will reduce the waiting period for a configured set of timeouts. See
      :ref:`below <config_overload_manager_reducing_timeouts>` for details on configuration.

+  * - envoy.overload_actions.reset_streams


snowp · 2021-08-27T13:13:50Z

+Reset Streams
+^^^^^^^^^^^^^^^^^


This seems off, the squibbles should line up with the header

snowp · 2021-08-27T13:14:33Z

+
+.. warning::
+
+   Reset Streams only currently works with HTTP2.


nit: I'd say Resetting streams via an overload action currently only works with HTTP2.

snowp · 2021-08-27T13:14:51Z

+configured via :ref:`buffer_factory_config
+<envoy_v3_api_field_config.overload.v3.OverloadManager.buffer_factory_config>`.
+
+As an example, here is partial Overload Manager configuration with minimum


a partial

snowp · 2021-08-27T13:15:04Z

+<envoy_v3_api_field_config.overload.v3.OverloadManager.buffer_factory_config>`.
+
+As an example, here is partial Overload Manager configuration with minimum
+threshold for tracking and a single overload action entry that enables reset


that resets streams

snowp · 2021-08-27T13:23:56Z

+there's something seriously wrong e.g. the existence of streams using :math:`>=
+128 * minimum_threshold_for_tracking`.


Same, should probably say "streams using X amount of heap" or something to that effect

snowp · 2021-08-27T13:26:09Z

  ASSERT(current_class != new_class, "Expected the current_class and new_class to be different");

-  if (current_class == -1 && new_class >= 0) {
+  if (!current_class.has_value() && new_class >= 0u) {


should this be new_class.has_value()?

Good point, it's actually redundant since current_class != new_class

e.g. if no current class => new class has value. If no new class => current class.

snowp · 2021-08-27T13:27:53Z

+      std::floor(pressure * BufferMemoryAccountImpl::NUM_MEMORY_CLASSES_) + 1, 8);
+  uint32_t bucket_idx = BufferMemoryAccountImpl::NUM_MEMORY_CLASSES_ - buckets_to_clear;
+
+  ENVOY_LOG_MISC(warn, "resetting streams in buckets >= {}", bucket_idx);


Should this class have its own logger instead of just using misc?

Think it's consistent with other Overload manager actions this way such as scaled timers.

snowp · 2021-08-27T13:28:40Z

+uint64_t WatermarkBufferFactory::resetAccountsGivenPressure(float pressure) {
+  ASSERT(pressure >= 0.0 && pressure <= 1.0, "Provided pressure is out of range [0, 1].");
+
+  // Compute buckets to clear


nit: end comments with proper punctuation, here and elsewhere

snowp · 2021-08-27T13:32:29Z

 namespace Server {
+namespace {
+// TODO(kbaichoo): refactor into a utility.
+Stats::Counter& getCounter(Stats::Scope& scope, absl::string_view name_of_stat) {


@jmarantz Do we have something like this already, or is there a nicer way to do this?

There is no such helper because this pattern takes a global symbol-table lock, so we don't really want to make it look easy :)

What should be done instead is the names that are known at compile-time should be saved as StatNames in the factory object. This pattern happens a lot. A StatNamePool is the easiest way to create a handful of known-at-compile-time names.

In this case it looks like ResetStreamsCount is known at compile-time so we can make a StatName for that in the ProdWorkerFactory constructor.

If there's a case where the name is not known at compile-time, that's a candidate for a DynamicStatName, which uses more bytes, but can be constructed without a symbol table lock.

See https://github.com/envoyproxy/envoy/blob/main/source/docs/stats.md for more details.

sgtm, will leverage the standard pattern

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

snowp

Thanks for the docs cleanup, much easier to follow! A couple more comments and then I think we're good to go

snowp · 2021-08-31T15:38:07Z

-(e.g. streams using memory within a power of two range). There are 8 buckets,
-with the last bucket capturing all of the streams using :math:`>= 128 *
-minimum_threshold_for_tracking`.  In this example the `minimum_threshold_for_tracking` is 1MB.
+By setting the `minimum_account_to_track_power_of_two` to `20`, we will only track


Could you spell out how 20 translates to 1MB? I assume it's due to 2^20 but would be helpful to be explicit in the docs

snowp · 2021-08-31T15:38:47Z

-minimum_threshold_for_tracking`.  In this example the `minimum_threshold_for_tracking` is 1MB.
+By setting the `minimum_account_to_track_power_of_two` to `20`, we will only track
+streams using >= 1MB worth of allocated memory in buffers. Streams using >= 1MB
+will be classified into 8 power of two sized buckets. For this example, the


Is it 8 regardless of the value of minimum_account_to_track_power_of_two ? Not clear from the docs

snowp · 2021-08-31T15:41:26Z

+The above configuration also configures the overload manager to reset our tracked
+streams based on heap usage as a trigger. When the heap usage is less than 85%,
+no streams will be reset.  When heap usage is at or above 85%, we start to
+reset certain buckets. When the heap usage is at 95% all streams using >= 1MB memory


Instead of "certain buckets" I would say something like "reset buckets according to the strategy described below"

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

KBaichoo

Thanks for the review @snowp

KBaichoo · 2021-08-31T17:30:19Z

-(e.g. streams using memory within a power of two range). There are 8 buckets,
-with the last bucket capturing all of the streams using :math:`>= 128 *
-minimum_threshold_for_tracking`.  In this example the `minimum_threshold_for_tracking` is 1MB.
+By setting the `minimum_account_to_track_power_of_two` to `20`, we will only track


KBaichoo · 2021-08-31T17:30:23Z

-minimum_threshold_for_tracking`.  In this example the `minimum_threshold_for_tracking` is 1MB.
+By setting the `minimum_account_to_track_power_of_two` to `20`, we will only track
+streams using >= 1MB worth of allocated memory in buffers. Streams using >= 1MB
+will be classified into 8 power of two sized buckets. For this example, the


KBaichoo · 2021-08-31T17:30:28Z

+The above configuration also configures the overload manager to reset our tracked
+streams based on heap usage as a trigger. When the heap usage is less than 85%,
+no streams will be reset.  When heap usage is at or above 85%, we start to
+reset certain buckets. When the heap usage is at 95% all streams using >= 1MB memory


KBaichoo · 2021-08-31T23:28:36Z

/retest

repokitteh-read-only · 2021-08-31T23:28:39Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17702 (comment) was created by @KBaichoo.

see: more, trace.

snowp

LGTM, thanks!

KBaichoo added 9 commits August 12, 2021 13:14

WIP: Overload Manager action.

71a5d51

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Add Proto for Reset Action.

31ec6d2

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Buffer resetAllStreamsInBuckets, improve buffer interface to use

85863fa

absl::optional instead of specific sentential. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

ResetStreamAdapter, Overload Manager integration and WorkerImpl hook

bc30b20

into resetStreamsInBucket using adapter. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

WIP: integration test -- DO NOT SUBMIT -- clean up by removing Overload

1ec29a7

Manager unneeded bits. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Remove reset_stream_adapter.

dddd095

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Add documentation, clean up integration test.

80a159d

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Update tests.

fb51e1e

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Remove flag, use the configuration of minimum tracking threshold

340ebc2

instead. Update documentation. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only Bot assigned yanavlasov Aug 12, 2021

KBaichoo added 2 commits August 13, 2021 14:05

Tracked Watermark Buffer test needs tracking enabled.

3130c44

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Clangtidy.

9d9d464

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

alyssawilk reviewed Aug 16, 2021

View reviewed changes

alyssawilk self-assigned this Aug 16, 2021

alyssawilk added the waiting label Aug 17, 2021

Added test, updated doc, added TODO that will follow up in next PR.

e00c0ed

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only Bot removed the waiting label Aug 19, 2021

KBaichoo commented Aug 19, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/main' into overload-rst-stream…

6e10eee

…-rework Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

alyssawilk reviewed Aug 19, 2021

View reviewed changes

KBaichoo commented Aug 19, 2021

View reviewed changes

repokitteh-read-only Bot added the waiting label Aug 19, 2021

Have ResetStream require buffer_factory_config, update documentation.

d11e391

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only Bot removed the waiting label Aug 20, 2021

alyssawilk added the waiting label Aug 25, 2021

Add stat for number streams reset.

f461e51

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only Bot removed the waiting label Aug 25, 2021

Merge remote-tracking branch 'upstream/main' into overload-rst-stream…

063b315

…-rework Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

alyssawilk assigned snowp Aug 26, 2021

alyssawilk reviewed Aug 26, 2021

View reviewed changes

More descriptive overload action name.

58fc13f

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

snowp suggested changes Aug 27, 2021

View reviewed changes

snowp added the waiting label Aug 27, 2021

Updated docs.

bdce94b

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only Bot removed the waiting label Aug 27, 2021

KBaichoo added 2 commits August 30, 2021 13:51

Use StatName pool in the worker factory, share StatName with workers

28ed3a0

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Save reference to underlying stat counter instead of StatNames.

3318336

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

snowp suggested changes Aug 31, 2021

View reviewed changes

snowp added the waiting label Aug 31, 2021

Additional doc improvements.

7ecf8f3

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

repokitteh-read-only Bot removed the waiting label Aug 31, 2021

KBaichoo commented Aug 31, 2021

View reviewed changes

snowp approved these changes Sep 1, 2021

View reviewed changes

snowp merged commit 7bf466c into envoyproxy:main Sep 1, 2021

		there's something seriously wrong e.g. the existence of streams using :math:`>=
		128 * minimum_threshold_for_tracking`.

Conversation

KBaichoo commented Aug 12, 2021

Uh oh!

KBaichoo commented Aug 12, 2021

Uh oh!

KBaichoo commented Aug 12, 2021

Uh oh!

KBaichoo commented Aug 13, 2021

Uh oh!

repokitteh-read-only Bot commented Aug 13, 2021

Uh oh!

KBaichoo commented Aug 13, 2021

Uh oh!

repokitteh-read-only Bot commented Aug 13, 2021

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo commented Aug 19, 2021

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo commented Aug 19, 2021

Uh oh!

KBaichoo commented Aug 23, 2021

Uh oh!

repokitteh-read-only Bot commented Aug 23, 2021

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!