Skip to content

runtime: do not share runtime snapshot between worker threads#18601

Closed
lkpdn wants to merge 1 commit intoenvoyproxy:mainfrom
lkpdn:eliminate-contention-on-runtime-feature-lookup
Closed

runtime: do not share runtime snapshot between worker threads#18601
lkpdn wants to merge 1 commit intoenvoyproxy:mainfrom
lkpdn:eliminate-contention-on-runtime-feature-lookup

Conversation

@lkpdn
Copy link
Copy Markdown

@lkpdn lkpdn commented Oct 13, 2021

Commit Message:
Currently, every worker thread accesses the shared_ptr of the runtime snapshot. When the data plane frequently needs to do runtime feature checking under heavy load, the resource contention becomes a scaling bottleneck. The created snapshot does not need to be shared in that way. This patch eliminates the scaling bottleneck.

An alternative way to resolve the contention may be to change LoaderImpl::threadsafeSnapshot so that it never touch refcount of the original shared_ptr to a new snapshot, and returns a plain reference instead of SnapshotConstSharedPtr. (#18485 (comment) )

This is a fllow-up PR of #18485.

Ref: #18600
Additional Description: n/a
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Currently, every worker thread accesses the shared_ptr of the runtime
snapshot. When the data plane frequently needs to do runtime feature
checking under heavy load, the resource contention becomes a scaling
bottleneck. The created snapshot does not need to be shared in that
way. This patch eliminates the scaling bottleneck.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
@repokitteh-read-only
Copy link
Copy Markdown

Hi @lkpdn, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #18601 was opened by lkpdn.

see: more, trace.

@lkpdn
Copy link
Copy Markdown
Author

lkpdn commented Oct 13, 2021

To see why this PR was factored out from #18485, please refer to the original thread #18485 (comment).

Copy link
Copy Markdown
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added a couple of comments.

static void parseEntryFractionalPercentValue(Entry& entry);

const std::vector<OverrideLayerConstPtr> layers_;
const std::shared_ptr<std::vector<OverrideLayerConstPtr>> layers_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::shared_ptr<std::vector<OverrideLayerConstPtr>> layers_;
const std::shared_ptr<const std::vector<OverrideLayerConstPtr>> layers_;

Also since values_ do not mutate outside the constructor it would be nice to make them const too and to set them in the initializer list (with a private static method).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!

std::shared_ptr<SnapshotImpl> ptr = createNewSnapshot();
tls_->set([ptr](Event::Dispatcher&) -> ThreadLocal::ThreadLocalObjectSharedPtr {
return std::static_pointer_cast<ThreadLocal::ThreadLocalObject>(ptr);
return std::make_shared<SnapshotImpl>(ptr.get());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, now instead of one shared snapshot we create and keep a copy of values_ for every worker to avoid contention on incref'ing a shared_ptr. This should work, but seems suboptimal memory-wise and adds latency upon creating a new snapshot.

Could you also try @mattklein123 's suggestion to return a reference to the content of the TLS slot, not a copy of shared_ptr, and check if it works? After looking at the flamegraph (btw, thanks! it helps a lot) I think this could be achieved by changing Loader's interface:

-  virtual SnapshotConstSharedPtr threadsafeSnapshot() PURE;
+  virtual const Snapshot& threadsafeSnapshot() PURE;

and its implementation accordingly. Exactly like you described in the original PR.

This blog post explains the RCU-like semantics.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ^ is what I was suggesting we try just to make sure we understand the problem. I think many/all callsites never store the shared_ptr and just reference the snapshot. This might be a very nice perf win across the board.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rojkov @mattklein123 Thank you so much! I'll try and check the resulting performance in detail later.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the PR lost tracking?

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 14, 2021
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot closed this Nov 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants