-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-13397. Fix liveness probe failures for httpfs and s3g pods #8775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
peterxcli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on the fix.
I guess this is related to https://issues.apache.org/jira/browse/HDDS-13399?
Yes, Thank you for pointing this out. |
|
I think HDDS-13397 is for the Kubernetes examples, and HDDS-13399 for the Helm chart. |
Yes, the description has beed modified. |
|
I understand that using a TCP socket for the health check is more of a workaround. I'm definitely open to better solutions if anyone has suggestions. |
|
@roach231428 Thanks for the patch. My suggestion is to keep the http GET probe, but change the port to the admin port. GET probe should give a higher confidence that the HTTP server has started and can serve some requests. I haven't tested it yet though. |
Thanks for the suggestion — I’ve tested it with the admin port and the HTTP GET probe works well. The change has been submitted accordingly. |
ivandika3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. LGTM +1.
peterxcli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @roach231428! LGTM.
dombizita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @roach231428!
|
Thanks @roach231428 for the patch and reporting the problem, @dombizita, @ivandika3, @peterxcli for the review. |
Fixes #8659
What changes were proposed in this pull request?
HDDS-13397. Fix liveness probe failures for httpfs and s3g pods in Kubernetes deployment
This PR addresses persistent restart issues for the
httpfs-0ands3g-0pods in the Apache Ozone 2.0.0 Kubernetes deployment, caused by liveness probe failures.Changes proposed:
Fix
httpfspod impersonation failure:Added the following entries to
core-site.xmlvia the config map:This resolves the impersonation authorization error:
"User: hadoop is not allowed to impersonate hadoop"which previously caused HTTP 500 errors from thehttpfsliveness endpoint.Fix
s3gpod liveness probe misconfiguration:Replaced the HTTP-based
livenessProbewith atcpSocketcheck ins3g-statefulset.yaml.The previous HTTP probe expected an unauthenticated
/endpoint, buts3grequires AWS V4 signed requests, leading to a 403 error:Switching to a TCP socket probe on port 9878 avoids invalid HTTP probing and allows the container to start successfully.
These changes resolve the liveness failures and allow both pods to reach and maintain a healthy state.
What is the link to the Apache JIRA
[HDDS-13397] S3 and HttpFS gateways are failing on deployment from Kubernetes resource examples
How was this patch tested?
Applied changes to a test Kubernetes cluster running Ozone 2.0.0
Verified that:
httpfs-0passes liveness probe with updated impersonation configss3g-0no longer restarts and remains in Running state with TCP liveness checkConfirmed via
kubectl describe podand container logs that restart loops ceased and liveness checks succeeded