tls: expose detailed certificate validation failure reasons in access logs#42420
tls: expose detailed certificate validation failure reasons in access logs#42420ggreenway merged 7 commits intoenvoyproxy:mainfrom prashanthjos:main
Conversation
… logs Previously, when TLS certificate validation failed, Envoy's access logs only showed the generic "CERTIFICATE_VERIFY_FAILED" error code without actionable details about why the validation failed. This made debugging TLS issues difficult, especially in production environments where debug logging is not feasible. This change propagates the detailed certificate validation error (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from the DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string that appears in access logs. The detailed error is appended to the existing failure_reason_ when SSL_R_CERTIFICATE_VERIFY_FAILED is encountered, resulting in access log output like: TLS_error:|268435581:SSL:CERTIFICATE_VERIFY_FAILED:verify cert failed: X509_verify_cert: certificate verification error at depth 0: certificate has expired:TLS_error_end This information is exposed via the existing access log formatters: - %UPSTREAM_TRANSPORT_FAILURE_REASON% (for upstream/cluster connections) - %DOWNSTREAM_TRANSPORT_FAILURE_REASON% (for downstream/listener mTLS) Signed-off-by: Prashanth Josyula <prashanth.chaitanya@salesforce.com>
… logs Previously, when TLS certificate validation failed, Envoy's access logs only showed the generic "CERTIFICATE_VERIFY_FAILED" error code without actionable details about why the validation failed. This made debugging TLS issues difficult, especially in production environments where debug logging is not feasible. This change propagates the detailed certificate validation error (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from the DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string that appears in access logs. The detailed error is appended to the existing failure_reason_ when SSL_R_CERTIFICATE_VERIFY_FAILED is encountered, resulting in access log output like: TLS_error:|268435581:SSL:CERTIFICATE_VERIFY_FAILED:verify cert failed: X509_verify_cert: certificate verification error at depth 0: certificate has expired:TLS_error_end This information is exposed via the existing access log formatters: - %UPSTREAM_TRANSPORT_FAILURE_REASON% (for upstream/cluster connections) - %DOWNSTREAM_TRANSPORT_FAILURE_REASON% (for downstream/listener mTLS) Signed-off-by: Prashanth Josyula <prashanth.chaitanya@salesforce.com>
|
/retest |
…gs only showed the generic "CERTIFICATE_VERIFY_FAILED" error code without actionable details about why the validation failed. This made debugging TLS issues difficult, especially in production environments where debug logging is not feasible. This change propagates the detailed certificate validation error (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from the DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string that appears in access logs. The detailed error is appended to the existing failure_reason_ when SSL_R_CERTIFICATE_VERIFY_FAILED is encountered, resulting in access log output like: TLS_error:|268435581:SSL:CERTIFICATE_VERIFY_FAILED:verify cert failed: X509_verify_cert: certificate verification error at depth 0: certificate has expired:TLS_error_end This information is exposed via the existing access log formatters: - %UPSTREAM_TRANSPORT_FAILURE_REASON% (for upstream/cluster connections) - %DOWNSTREAM_TRANSPORT_FAILURE_REASON% (for downstream/listener mTLS) Signed-off-by: Prashanth Josyula <prashanth.chaitanya@salesforce.com>
|
/retest |
|
@ggreenway @wbpcode can you please take a look when you get sometime. |
ggreenway
left a comment
There was a problem hiding this comment.
This is a really nice improvement. Thanks!
/wait
| while (uint64_t err = ERR_get_error()) { | ||
| if (ERR_GET_LIB(err) == ERR_LIB_SSL) { | ||
| if (ERR_GET_REASON(err) == SSL_R_PEER_DID_NOT_RETURN_A_CERTIFICATE) { | ||
| ctx_->stats().fail_verify_no_cert_.inc(); |
There was a problem hiding this comment.
Is there anything in the details text right now if the peer didn't provide a cert? If not, should we add something here in this case?
| EXPECT_THAT(error_details, testing::HasSubstr("verify cert failed")); | ||
| EXPECT_THAT(error_details, testing::HasSubstr("SAN matcher")); |
There was a problem hiding this comment.
What is the entire string of error_details here? Can this match either the full string or a longer substring?
…gs only showed the generic "CERTIFICATE_VERIFY_FAILED" error code without actionable details about why the validation failed. This made debugging TLS issues difficult, especially in production environments where debug logging is not feasible. This change propagates the detailed certificate validation error (e.g., "certificate has expired", "unable to get local issuer certificate","SAN matcher failed") from the DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string that appears in access logs. The detailed error is appended to the existing failure_reason_ when SSL_R_CERTIFICATE_VERIFY_FAILED is encountered, resulting in access log output like: TLS_error:|268435581:SSL:CERTIFICATE_VERIFY_FAILED:verify cert failed: X509_verify_cert: certificate verification error at depth 0: certificate has expired:TLS_error_end This information is exposed via the existing access log formatters: - %UPSTREAM_TRANSPORT_FAILURE_REASON% (for upstream/cluster connections) - %DOWNSTREAM_TRANSPORT_FAILURE_REASON% (for downstream/listener mTLS) Signed-off-by: Prashanth Josyula <prashanth.chaitanya@salesforce.com>
|
/retest |
|
@ggreenway please have a relook I have addressed your comments |
| } else if (saw_no_client_cert) { | ||
| absl::StrAppend(&failure_reason_, ":peer did not provide required client certificate"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Can this be simplified as below? Anything we miss trying to remove the repetitive if check
// Append detailed error info for certificate-related failures.
if (saw_cert_verify_failed || saw_no_client_cert) {
auto* extended_socket_info = reinterpret_cast<Envoy::Ssl::SslExtendedSocketInfo*>(
SSL_get_ex_data(rawSsl(), ContextImpl::sslExtendedSocketInfoIndex()));
if (saw_no_client_cert) {
absl::StrAppend(&failure_reason_, ":peer did not provide required client certificate");
} else if (extended_socket_info != nullptr) {
absl::string_view cert_validation_error = extended_socket_info->certificateValidationError();
if (!cert_validation_error.empty()) {
absl::StrAppend(&failure_reason_, ":", cert_validation_error);
}
}
}
There was a problem hiding this comment.
or even
if (saw_no_client_cert) {
absl::StrAppend(&failure_reason_, ":peer did not provide required client certificate");
} else if (saw_cert_verify_failed) {
// Append detailed error info for certificate-related failures.
auto* extended_socket_info = reinterpret_cast<Envoy::Ssl::SslExtendedSocketInfo*>(
SSL_get_ex_data(rawSsl(), ContextImpl::sslExtendedSocketInfoIndex()));
if (extended_socket_info != nullptr) {
absl::string_view cert_validation_error = extended_socket_info->certificateValidationError();
if (!cert_validation_error.empty()) {
absl::StrAppend(&failure_reason_, ":", cert_validation_error);
}
}
}
wbpcode
left a comment
There was a problem hiding this comment.
Although I think it would be better to merge the setCertificateValidationError and setCertificateValidationStatus. But current change also fine to me.
|
Defer this to @ggreenway for further review/approval. And hello, @ggreenway, I see you left a require change, but I didn't see any new comment from you. Do I missed something? |
|
@wbpcode that's weird; github seems to have discarded the comment I left in that review. |
| absl::StrAppend(&failure_reason_, ":peer did not provide required client certificate"); | ||
| } | ||
| } else if (saw_no_client_cert) { | ||
| absl::StrAppend(&failure_reason_, ":peer did not provide required client certificate"); |
There was a problem hiding this comment.
Please refactor the logic so there's only one case that appends this error. Maybe something roughly like this:
bool added_reason = false;
if (saw_cert_verify_failed) {
// checks for null
StrAppend(failure_reason_, "the text");
added_reason = true;
}
if (!added_reason && saw_no_client_cert) {
StrAppend(......);
}
… logs Previously, when TLS certificate validation failed, Envoy's access logs only showed the generic "CERTIFICATE_VERIFY_FAILED" error code without actionable details about why the validation failed. This made debugging TLS issues difficult, especially in production environments where debug logging is not feasible. This change propagates the detailed certificate validation error (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from the DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string that appears in access logs. The detailed error is appended to the existing failure_reason_ when SSL_R_CERTIFICATE_VERIFY_FAILED is encountered, resulting in access log output like: TLS_error:|268435581:SSL:CERTIFICATE_VERIFY_FAILED:verify cert failed: X509_verify_cert: certificate verification error at depth 0: certificate has expired:TLS_error_end This information is exposed via the existing access log formatters: - %UPSTREAM_TRANSPORT_FAILURE_REASON% (for upstream/cluster connections) - %DOWNSTREAM_TRANSPORT_FAILURE_REASON% (for downstream/listener mTLS) Signed-off-by: Prashanth Josyula <prashanth.16@gmail.com>
ggreenway
left a comment
There was a problem hiding this comment.
Please add a release note changelogs/current.yaml
/wait
Signed-off-by: Prashanth Josyula <prashanth.16@gmail.com>
Head branch was pushed to by a user without write access
|
There's a merge conflict on the changelog. Please merge main into your PR and resolve. /wait |
Signed-off-by: Prashanth Josyula <prashanth.16@gmail.com>
|
/retest |
… logs (envoyproxy#42420) Propagate detailed certificate validation errors (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string in access logs. Previously, when TLS certificate validation failed, access logs only showed the generic CERTIFICATE_VERIFY_FAILED error without actionable details. This made debugging TLS issues difficult in production where debug logging is not feasible. The detailed error is now visible in %UPSTREAM_TRANSPORT_FAILURE_REASON% and %DOWNSTREAM_TRANSPORT_FAILURE_REASON% access log formatters. Fixes envoyproxy#42266 Signed-off-by: Prashanth Josyula <prashanth.chaitanya@salesforce.com> Signed-off-by: MayorFaj <mayorfaj@gmail.com>
… logs (envoyproxy#42420) Propagate detailed certificate validation errors (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string in access logs. Previously, when TLS certificate validation failed, access logs only showed the generic CERTIFICATE_VERIFY_FAILED error without actionable details. This made debugging TLS issues difficult in production where debug logging is not feasible. The detailed error is now visible in %UPSTREAM_TRANSPORT_FAILURE_REASON% and %DOWNSTREAM_TRANSPORT_FAILURE_REASON% access log formatters. Fixes envoyproxy#42266 Signed-off-by: Prashanth Josyula <prashanth.chaitanya@salesforce.com> Signed-off-by: Gustavo <grnmeira@gmail.com>
Commit Message:
Propagate detailed certificate validation errors (e.g., "certificate has expired", "unable to get local issuer certificate", "SAN matcher failed") from DefaultCertValidator through SslExtendedSocketInfo to the transport failure reason string in access logs.
Additional Description:
Previously, when TLS certificate validation failed, access logs only showed the generic CERTIFICATE_VERIFY_FAILED error without actionable details. This made debugging TLS issues difficult in production where debug logging is not feasible. This change adds two new methods to SslExtendedSocketInfo (setCertificateValidationError/certificateValidationError) and wires them through:
ContextImpl::customVerifyCallback() stores error_details from ValidationResults
ValidateResultCallbackImpl::onCertValidationResult() stores error_details for async validation
SslSocket::drainErrorQueue() appends the detailed error when SSL_R_CERTIFICATE_VERIFY_FAILED is encountered
The detailed error is now visible in %UPSTREAM_TRANSPORT_FAILURE_REASON% and %DOWNSTREAM_TRANSPORT_FAILURE_REASON% access log formatters.
Risk Level: Low
Testing:
Added unit tests in default_validator_test.cc: TestCertificateValidationErrorDetailsForSanFailure, TestSslExtendedSocketInfoCertValidationError, TestEmptyCertChainErrorDetails
Added unit tests in handshaker_test.cc: SslExtendedSocketInfoCertValidationError, ValidateResultCallbackStoresErrorDetails
All existing TLS tests pass
Docs Changes: Uses existing documented access log formatters
Release Notes: Added detailed certificate validation failure reasons to %UPSTREAM_TRANSPORT_FAILURE_REASON% and %DOWNSTREAM_TRANSPORT_FAILURE_REASON% access log fields. When TLS certificate validation fails, the access log now includes specific error details such as "certificate has expired" or "unable to get local issuer certificate" instead of only the generic CERTIFICATE_VERIFY_FAILED code.
Fixes #42266