Skip to content

vhds: add support for VHDS in a static route-configuration#43852

Open
adisuissa wants to merge 8 commits into
envoyproxy:mainfrom
adisuissa:vhds_on_static_route
Open

vhds: add support for VHDS in a static route-configuration#43852
adisuissa wants to merge 8 commits into
envoyproxy:mainfrom
adisuissa:vhds_on_static_route

Conversation

@adisuissa
Copy link
Copy Markdown
Contributor

Commit Message: vhds: add support for VHDS in a static route-configuration
Additional Description:
Prior to this work VHDS wasn't working when used in a static RouteConfiguration (that is not fetched via RDS).
This PR enables this feature.
It is inspired by the code from source/common/router/rds_impl.cc.

Risk Level: low - things that were configured and did not work will now work as expected.
Testing: Added unit and integration tests.
Docs Changes: N/A
Release Notes: Added.
Platform Specific Features: N/A

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #43852 was opened by adisuissa.

see: more, trace.

@adisuissa adisuissa marked this pull request as ready for review March 9, 2026 19:49
@adisuissa
Copy link
Copy Markdown
Contributor Author

Requires senior maintainer review, assigning Yan.
/assign @yanavlasov

Copy link
Copy Markdown
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm module two questions. Most of the logic is similar to the rds_impl.

/wait

// Emulate a config-update information gathering using a dynamic RouteConfigurationReceiver.
config_update_info_ = std::make_unique<RouteConfigUpdateReceiverImpl>(
route_config_provider_manager.protoTraits(), factory_context);
config_update_info_->onRdsUpdate(config, "");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q, if the static route config + VHDS is used, does this change the validate_clusters behavior between static and RDS? https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route.proto.html

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I don't fully understand the question.
The behavior should be as configured, that is, if validate_clusters is true, then the clusters are validated, and if not, they are not validated.
This validate_clusters should define the behavior of the route-configuration, regardless of whether vhds is configured or not.

Copy link
Copy Markdown
Member

@botengyao botengyao Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is configurable, I was taking a look at the difference between static and RDS (with AI's help), I meant the default behavior change, by emulating the static to a RDS, will the default change? e.g., static+VHDS will now not validate it if it is unset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that RDS+VHDS will only use the configured knob. Can you point me to where the default value is different?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false /* not validate unknown cluster */);

specifically this line above, and it is hard coded.

And here

// is selected at runtime. This setting defaults to true if the route table
// is statically defined via the :ref:`route_config
// <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.route_config>`
// option. This setting default to false if the route table is loaded dynamically via the
, you have better context here and I am just confused, wondering if we need to callout this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the reference.
I believe this was added in #20577 and I don't have a full context to why it was allowed in the first place. @JuniorHsu do you have any context on why there's no cluster validation?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hunch that my previous patch didn't take virtual host into account. i'm unsure if route level will check eventually tbh.

validate_clusters is trying to prevent sr drop for unknown cluster. We do use virtual host but unsure if we have sr drop -- at least i didn't hear from other's report.

}
// If no VHDS is configured, immediately notify the callback that the
// virtual-host doesn't exist.
auto current_cb = route_config_updated_cb.lock();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put the weak_ptr lock inside the post?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think current_cb is used in both the if statement, and inside the dispatched lambda.
Maybe I'm missing something in your comment, so feel free to clarify.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though there is no vhds, the pointer is upgrade to a shared_ptr, which means the posted lambda can keep the callback alive even if the stream/filter is destroyed before the dispatcher runs it. This seems different from the existing RDS/VHDS pattern:

thread_local_dispatcher.post([route_config_updated_cb] {
  if (auto cb = route_config_updated_cb.lock()) {
    (*cb)(false);
  }
});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at the code I tried to avoid a case where the callback is removed while it is posted to the worker thread. Specifically I don't think that posting an element that will then be out of scope is a good idea.

Can you point me to the lines of code that you mentioned in your comment?
I may be able to give a better answer if I'll see what exactly you are comparing against.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/envoyproxy/envoy/pull/43852/changes#diff-0b8eafdf25d4ec6552ec090d37226126154d88e2f753efa9987775895690f4b7R121-R126 and

std::weak_ptr<Http::RouteConfigUpdatedCallback> current_cb(it->cb_);
it->thread_local_dispatcher_.post([current_cb, host_exists] {
if (auto cb = current_cb.lock()) {
(*cb)(host_exists);
}
});

here are the pointers, and this callback can be from the on_demand filter, and the it was a shard_ptr before and a week_ptr is passed to the provider, and the shared_ptr will be reseted during the on-demand filter destroy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference from the original code which allows me to better understand your question.

RE the approach, I thought that acquired the weak_ptr lock makes it a shared pointer, which will keep the object for a longer time (otherwise there will be a race when the on-demand filter is destroyed).

I think this was observed in #9784 and the fix in #11341 is trying to fix the issue by resetting the pointer on destroy, but I'm not sure if that's the right way to check it.
I'll add a similar test to 'VhdsOnDemandUpdateHttpConnectionCloses' that was added in #11341 and run this with tsan.
I'm willing to change this, I just think that the original fix might not be safe.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
- source/common/grpc/google_grpc_creds_impl.cc
- source/server/drain_manager_impl.cc
- source/common/router/rds_impl.cc
- source/common/router/static_route_provider_impl.cc
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a TODO to clean the usage up, and I'll remove it in a near-future PR (requires conversion of c'tors to create functions that I think should be done in a separate PR to avoid a large refactor as part of this PR).

@adisuissa
Copy link
Copy Markdown
Contributor Author

I'm taking a step back here.
I tried taking the on-demand-VHDS integration tests and parameterize them to work with both the dynamic-RDS and static-route. I think there are a few gaps that would need to be fixed before we can say that VHDS over static-routes works well.
Converting to draft.

@adisuissa adisuissa marked this pull request as draft March 10, 2026 18:21
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa adisuissa marked this pull request as ready for review March 27, 2026 16:04
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@tonya11en
Copy link
Copy Markdown
Member

@yanavlasov PTAL

@yanavlasov
Copy link
Copy Markdown
Contributor

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Virtual Host Discovery Service (VHDS) support for static route configurations, allowing them to handle on-demand subscriptions and updates. The implementation introduces a VhdsContext within StaticRouteConfigProviderImpl to manage thread-local configurations and subscription logic. The changes also include new unit tests and the enablement of static route types in VHDS integration tests. Review feedback identifies a thread-safety issue in requestVirtualHostsUpdate where configuration data is accessed across threads and suggests refactoring the constructor into a factory method to improve error handling and avoid exceptions.

Comment on lines +141 to +142
auto alias = VhdsSubscription::domainNameToAlias(
config_update_info_->protobufConfigurationCast().name(), for_domain);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing config_update_info_ on a worker thread is not thread-safe because it is modified on the main thread during onConfigUpdate. Since the route configuration name is static for this provider, it should be captured during construction and stored as a member of VhdsContext to avoid this race condition.

  auto alias = VhdsSubscription::domainNameToAlias(route_config_name_, for_domain);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is safe, since protobufConfigurationCast() is immutable. However it would be much safer if the const correctness would be reflected in the config_update_info_ type by having it be pointing to a const object. This would make it clear that the object can be used across threads. Maybe file an Issue to track this if a quick followup is not possible?

std::weak_ptr<Http::RouteConfigUpdatedCallback> cb_;
};

RouteConfigUpdatePtr config_update_info_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To fix the thread-safety issue in requestVirtualHostsUpdate, store the route configuration name as a member variable.

Suggested change
RouteConfigUpdatePtr config_update_info_;
const std::string route_config_name_;
RouteConfigUpdatePtr config_update_info_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively make it const, so it is clear that the pointer itself is immutable.

@envoyproxy envoyproxy deleted a comment from gemini-code-assist Bot Apr 10, 2026
route_config_provider_manager_(route_config_provider_manager) {}
route_config_provider_manager_(route_config_provider_manager) {
if (config.has_vhds()) {
vhds_context_ = std::make_unique<VhdsContext>(config, factory_context, *this,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved into the initializer list.

}
// If no VHDS is configured, immediately notify the callback that the
// virtual-host doesn't exist.
thread_local_dispatcher.post([current_cb = route_config_updated_cb] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be dispatched via the event loop?

Comment on lines +141 to +142
auto alias = VhdsSubscription::domainNameToAlias(
config_update_info_->protobufConfigurationCast().name(), for_domain);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is safe, since protobufConfigurationCast() is immutable. However it would be much safer if the const correctness would be reflected in the config_update_info_ type by having it be pointing to a const object. This would make it clear that the object can be used across threads. Maybe file an Issue to track this if a quick followup is not possible?

std::weak_ptr<Http::RouteConfigUpdatedCallback> cb_;
};

RouteConfigUpdatePtr config_update_info_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively make it const, so it is clear that the pointer itself is immutable.

Rds::RouteConfigProviderManager& route_config_provider_manager)
: factory_context_(factory_context), tls_(factory_context.threadLocal()) {
// Emulate a config-update information gathering using a dynamic RouteConfigurationReceiver.
config_update_info_ = std::make_unique<RouteConfigUpdateReceiverImpl>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to make thread safety clear here, I suggest making both the pointer and the pointee const. Gemini is correct to point that this is a weak area, even though right now I think it is safe.

@yanavlasov
Copy link
Copy Markdown
Contributor

/wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants