Skip to content

Add per shard metric for redis proxy#10

Closed
gavin-jeong wants to merge 13 commits intorelease/v1.35.6-sendbird-customfrom
support_per_shard_redis_metric
Closed

Add per shard metric for redis proxy#10
gavin-jeong wants to merge 13 commits intorelease/v1.35.6-sendbird-customfrom
support_per_shard_redis_metric

Conversation

@gavin-jeong
Copy link
Copy Markdown

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

dlunch and others added 11 commits November 12, 2025 17:06
Include code formatting improvements for consistent style in trace test files.
Signed-off-by: Chanhun Jeong <keyolk@gmail.com>
This commit introduces QUIC/HTTP3 keylog functionality in Envoy, enabling generation of NSS Key Log Format files for Wireshark and other debugging tools.

- Keylog callback registration in OnNewSslCtx()
- Implementation of EnvoyQuicProofSource::setupQuicKeylogCallback() and quicKeylogCallback()
- TLS context–based keylog configuration with per–filter chain caching and thread safety
- Address filtering via local/remote IP lists
- Fallback to SSLKEYLOGFILE environment variable for compatibility with existing workflows
- QuicKeylogBridge integration with Envoy’s existing TLS keylog infrastructure
- RawBufferSocket fallback fix in QuicServerTransportSocketFactory::createDownstreamTransportSocket()
- Comprehensive unit tests including edge cases

Signed-off-by: Chanhun Jeong <keyolk@gmail.com>
Protect all async callbacks from accessing deallocated cluster members
during destruction by adding is_destroying_ atomic flag checks.

Affected callbacks:
- ClusterRefreshManager callbacks
- DNS resolution callbacks
- Connection event callbacks
- Timer callbacks
- Redis client response callbacks (onResponse, onFailure, onUnexpectedResponse)
- Hostname resolution callbacks

The race condition occurred when callbacks were already queued in the
event loop when cluster destruction began, causing use-after-free access
to parent cluster members like info_, redis_discovery_session_, and
resolve_timer_.

All callbacks now check is_destroying_ with memory_order_acquire before
accessing any parent members, ensuring safe termination during destruction.

Fixes segfaults that occurred when removing Redis service entries.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ter destruction

Problem: Segmentation faults occur when accessing member pointers in async
callbacks during Redis cluster destruction, even with is_destroying_ flag
checks. This happens because there's a race window between checking the
flag and accessing the pointers.

Solution: Add defensive null checks for all pointer accesses that could
become invalid during destruction:

1. ClusterInfo pointer (info_):
   - Add null checks before all configUpdateStats() calls
   - Use safe access pattern for name() in log statements
   - Locations: startResolveRedis(), updateDnsStats(), DNS callbacks,
     onResponse(), onUnexpectedResponse(), onFailure()

2. DNS Resolver pointer (dns_resolver_):
   - Add null checks in startResolveDns()
   - Add checks in resolveClusterHostnames() and resolveReplicas()
   - Prevents crashes when DNS resolution is initiated during teardown

3. Timer pointer (resolve_timer_):
   - Add null checks before enableTimer() calls
   - Locations: finishClusterHostnameResolution(), onResponse(),
     onUnexpectedResponse(), onFailure()

4. Consistency fix:
   - Line 714: Changed parent_.info() to parent_.info_ to match
     null-checked pattern used elsewhere

The pattern applied throughout:
1. Check is_destroying_ flag with memory_order_acquire
2. Verify each pointer is non-null before dereferencing
3. This dual-check handles the race window safely

This prevents use-after-free crashes during Redis cluster teardown when
async callbacks execute after partial destruction has begun.
The previous fix with null checks still had a race condition window between
checking the pointer and using it. Even with the null check, the shared_ptr
could be reset to null by another thread between the check and use.

Solution: Make local copies of shared_ptr before use. This ensures the
pointer remains valid throughout its usage in the current scope.

Changes:
1. startResolveRedis(): Copy info_ to local variable before use
2. updateDnsStats(): Use local copy of info_
3. DNS callbacks: Use local copy for stats updates
4. onResponse(), onUnexpectedResponse(), onFailure(): Use local copies
5. client_factory_.create(): Check and use local copy of info_

The pattern applied:
  auto info = parent_.info_;  // Make local copy (ref count++)
  if (!info) {                // Check if null
    return;
  }
  info->method();             // Safe to use - won't become null

This prevents the crash at line 376 where info_ was becoming null
between the check and the access, even with memory_order_acquire.
…ing timer callbacks

The 5% crash rate was caused by timer callbacks executing after the
RedisDiscoverySession was destroyed. Even though we checked is_destroying_,
there was a race where:

1. Timer callback fires and enters the lambda
2. Destructor runs and deletes the session (unique_ptr reset)
3. Callback tries to access parent_.is_destroying_ → CRASH (use-after-free)

Solution:
- Move timer creation from constructor to initialize() method
- Capture shared_from_this() in timer lambda instead of raw 'this'
- Call initialize() after RedisDiscoverySession construction completes

This ensures the session object stays alive as long as any timer callback
is queued or executing, preventing the use-after-free.

Pattern changed from:
  resolve_timer_ = dispatcher_.createTimer([this]() { ... });

To:
  auto self = shared_from_this();
  resolve_timer_ = dispatcher_.createTimer([self]() { ... });

This should eliminate the remaining 5% crash rate during Redis cluster
destruction.
…nt reference

CRITICAL FIX: The previous approach had a fatal flaw - callbacks with
shared_from_this() kept the session alive, but the session holds a
reference to the parent RedisCluster. When the parent was destroyed,
accessing parent_.is_destroying_ became use-after-free.

The race condition:
1. Timer callback fires with shared_ptr<Session> (session kept alive)
2. RedisCluster destructor runs and completes
3. Callback tries to check parent_.is_destroying_
4. CRASH - parent object destroyed, reference is dangling

Solution:
- Add parent_destroyed_ atomic flag IN THE SESSION
- Parent sets this flag BEFORE destroying session
- Callbacks check session-owned flag, never access parent directly
- Also simplify all safety checks into helper methods

This is the correct fix for the 5% crash rate when removing Redis services.
@gavin-jeong gavin-jeong force-pushed the support_per_shard_redis_metric branch from aa5bf0b to 25c305c Compare November 26, 2025 23:20
@gavin-jeong gavin-jeong force-pushed the support_per_shard_redis_metric branch from 4be1cfb to 42fee0e Compare November 27, 2025 23:23
@gavin-jeong gavin-jeong force-pushed the support_per_shard_redis_metric branch from 42fee0e to f1acdf5 Compare November 28, 2025 02:31
Base automatically changed from update_to_1_35_6 to release/v1.35.6-sendbird-custom January 6, 2026 00:50
bellatoris pushed a commit that referenced this pull request Jan 15, 2026
…voyproxy#42554)

## Description

Today, when a filesystem watch callback returns a non-OK status or
throws an exception, the error gets propagated to `FileEventImpl` which
uses `THROW_IF_NOT_OK`.

Since there's no exception handler in the `libevent` loop, this causes
`std::terminate` to be called, which crashes Envoy.

**Stack Trace:**
```
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.119][234999][warning][misc] [source/common/protobuf/message_validator_impl.cc:23] Deprecated field: type envoy.config.core.v3.HeaderValueOption Using deprecated option 'envoy.config.core.v3.HeaderValueOption.append' from file base.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.120][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '0_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.123][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '1_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.126][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '2_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.127][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '3_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.128][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '4_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.130][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '5_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.132][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener '6_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.134][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener 'mtls_untrusted_regional_transparent_tunnel_listener'
Dec 11 00:11:26 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:26.135][234999][info][upstream] [source/common/listener_manager/lds_api.cc:109] lds: add/update listener 'mtls_app_trusted_regional_transparent_tunnel_listener'
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.097][234999][critical][main] [source/exe/terminate_handler.cc:36] std::terminate called! Uncaught unknown exception, see trace.
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.097][234999][critical][backtrace] [./source/server/backtrace.h:113] Backtrace (use tools/stack_decode.py to get line numbers):
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.097][234999][critical][backtrace] [./source/server/backtrace.h:114] Envoy version: 5eaabe0bbaad4612cb85473cd151039d8f1a2760/1.34.2-dev/Clean/RELEASE/BoringSSL
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.097][234999][critical][backtrace] [./source/server/backtrace.h:116] Address mapping: 558d8afcc000-558d8ee2f000 /usr/local/bin/envoy
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.100][234999][critical][backtrace] [./source/server/backtrace.h:123] #0: [0x558d8da5784f]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.102][234999][critical][backtrace] [./source/server/backtrace.h:123] #1: [0x558d8edd8673]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.104][234999][critical][backtrace] [./source/server/backtrace.h:123] #2: [0x558d8e3b120b]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.106][234999][critical][backtrace] [./source/server/backtrace.h:121] #3: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x558d8e3990c3]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.108][234999][critical][backtrace] [./source/server/backtrace.h:123] #4: [0x558d8e3998d2]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.109][234999][critical][backtrace] [./source/server/backtrace.h:123] #5: [0x558d8e393de6]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.111][234999][critical][backtrace] [./source/server/backtrace.h:121] #6: Envoy::Event::FileEventImpl::mergeInjectedEventsAndRunCb() [0x558d8e394eb5]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.113][234999][critical][backtrace] [./source/server/backtrace.h:123] #7: [0x558d8e710823]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.115][234999][critical][backtrace] [./source/server/backtrace.h:121] #8: event_base_loop [0x558d8e70d4a1]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.117][234999][critical][backtrace] [./source/server/backtrace.h:121] #9: Envoy::Server::InstanceBase::run() [0x558d8daa2b99]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.119][234999][critical][backtrace] [./source/server/backtrace.h:121] #10: Envoy::MainCommonBase::run() [0x558d8da4327a]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.121][234999][critical][backtrace] [./source/server/backtrace.h:121] #11: Envoy::MainCommon::main() [0x558d8da44234]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.123][234999][critical][backtrace] [./source/server/backtrace.h:121] #12: main [0x558d8afcc11c]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.123][234999][critical][backtrace] [./source/server/backtrace.h:123] #13: [0x7f1d54073efb]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.123][234999][critical][backtrace] [./source/server/backtrace.h:121] #14: __libc_start_main [0x7f1d54073fbb]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:121] #15: _start [0x558d8afcc02e]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:129] Caught Aborted, suspect faulting address 0x395f7
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:113] Backtrace (use tools/stack_decode.py to get line numbers):
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:114] Envoy version: 5eaabe0bbaad4612cb85473cd151039d8f1a2760/1.34.2-dev/Clean/RELEASE/BoringSSL
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:116] Address mapping: 558d8afcc000-558d8ee2f000 /usr/local/bin/envoy
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:123] #0: [0x7f1d54089c90]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:121] #1: gsignal [0x7f1d54089bde]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.124][234999][critical][backtrace] [./source/server/backtrace.h:121] #2: abort [0x7f1d54072832]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.126][234999][critical][backtrace] [./source/server/backtrace.h:123] #3: [0x558d8da5785c]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.128][234999][critical][backtrace] [./source/server/backtrace.h:123] #4: [0x558d8edd8673]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.129][234999][critical][backtrace] [./source/server/backtrace.h:123] #5: [0x558d8e3b120b]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.129][234999][critical][backtrace] [./source/server/backtrace.h:121] #6: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x558d8e3990c3]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.131][234999][critical][backtrace] [./source/server/backtrace.h:123] #7: [0x558d8e3998d2]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.133][234999][critical][backtrace] [./source/server/backtrace.h:123] #8: [0x558d8e393de6]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.133][234999][critical][backtrace] [./source/server/backtrace.h:121] #9: Envoy::Event::FileEventImpl::mergeInjectedEventsAndRunCb() [0x558d8e394eb5]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:123] #10: [0x558d8e710823]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #11: event_base_loop [0x558d8e70d4a1]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #12: Envoy::Server::InstanceBase::run() [0x558d8daa2b99]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #13: Envoy::MainCommonBase::run() [0x558d8da4327a]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #14: Envoy::MainCommon::main() [0x558d8da44234]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #15: main [0x558d8afcc11c]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:123] #16: [0x7f1d54073efb]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #17: __libc_start_main [0x7f1d54073fbb]
Dec 11 00:11:30 dbletE9433T node-envoy[234999]: [2025-12-11 00:11:30.135][234999][critical][backtrace] [./source/server/backtrace.h:121] #18: _start [0x558d8afcc02e]
```

In this change, we are making the `inotify` and `kqueue` watchers handle
callback errors gracefully by catching any exceptions using
`TRY_ASSERT_MAIN_THREAD`, logging errors instead of propagating them and
always returning the `OkStatus` to the event loop.

---

**Commit Message:** filesystem: Fix crash when watch callback returns
error or throws
**Additional Description:** Make `inotify` and `kqueue` watchers handle
callback errors gracefully.
**Risk Level:** Low
**Testing:** CI
**Docs Changes:** N/A
**Release Notes:** N/A

---------

Signed-off-by: Rohit Agrawal <rohit.agrawal@salesforce.com>
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants