HDDS-13397. Fix liveness probe failures for httpfs and s3g pods #8775

roach231428 · 2025-07-10T07:52:25Z

What changes were proposed in this pull request?

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods in Kubernetes deployment

This PR addresses persistent restart issues for the httpfs-0 and s3g-0 pods in the Apache Ozone 2.0.0 Kubernetes deployment, caused by liveness probe failures.

Changes proposed:

Fix httpfs pod impersonation failure:
- Added the following entries to core-site.xml via the config map:
```
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: '*'
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: '*'
```
- This resolves the impersonation authorization error:
  "User: hadoop is not allowed to impersonate hadoop" which previously caused HTTP 500 errors from the httpfs liveness endpoint.
Fix s3g pod liveness probe misconfiguration:
- Replaced the HTTP-based livenessProbe with a tcpSocket check in s3g-statefulset.yaml.
- The previous HTTP probe expected an unauthenticated / endpoint, but s3g requires AWS V4 signed requests, leading to a 403 error:
```
<Code>InvalidRequest</Code>
<Message>Error creating s3 auth info...</Message>
```
- Switching to a TCP socket probe on port 9878 avoids invalid HTTP probing and allows the container to start successfully.

These changes resolve the liveness failures and allow both pods to reach and maintain a healthy state.

What is the link to the Apache JIRA

[HDDS-13397] S3 and HttpFS gateways are failing on deployment from Kubernetes resource examples

How was this patch tested?

Applied changes to a test Kubernetes cluster running Ozone 2.0.0
Verified that:
- httpfs-0 passes liveness probe with updated impersonation configs
- s3g-0 no longer restarts and remains in Running state with TCP liveness check
Confirmed via kubectl describe pod and container logs that restart loops ceased and liveness checks succeeded

peterxcli

Thanks for working on the fix.
I guess this is related to https://issues.apache.org/jira/browse/HDDS-13399?

roach231428 · 2025-07-10T08:03:19Z

Thanks for working on the fix. I guess this is related to https://issues.apache.org/jira/browse/HDDS-13399?

Yes, Thank you for pointing this out.

adoroszlai · 2025-07-10T08:04:31Z

I think HDDS-13397 is for the Kubernetes examples, and HDDS-13399 for the Helm chart.

roach231428 · 2025-07-10T08:15:44Z

I think HDDS-13397 is for the Kubernetes examples, and HDDS-13399 for the Helm chart.

Yes, the description has beed modified.

roach231428 · 2025-07-10T08:17:33Z

I understand that using a TCP socket for the health check is more of a workaround. I'm definitely open to better solutions if anyone has suggestions.

ivandika3 · 2025-07-10T08:38:15Z

@roach231428 Thanks for the patch.

My suggestion is to keep the http GET probe, but change the port to the admin port. GET probe should give a higher confidence that the HTTP server has started and can serve some requests. I haven't tested it yet though.

       livenessProbe:
          httpGet:
            path: /
            port: 19878

roach231428 · 2025-07-10T09:03:30Z

@roach231428 Thanks for the patch.

My suggestion is to keep the http GET probe, but change the port to the admin port. GET probe should give a higher confidence that the HTTP server has started and can serve some requests. I haven't tested it yet though.
       livenessProbe:
          httpGet:
            path: /
            port: 19878

Thanks for the suggestion — I’ve tested it with the admin port and the HTTP GET probe works well. The change has been submitted accordingly.

ivandika3

Thanks for the update. LGTM +1.

peterxcli

Thanks @roach231428! LGTM.

dombizita

Thanks for fixing this @roach231428!

adoroszlai · 2025-07-10T13:34:01Z

Thanks @roach231428 for the patch and reporting the problem, @dombizita, @ivandika3, @peterxcli for the review.

…he#8775)

Fix: fix the error on healthy check in k8s

1882f73

peterxcli reviewed Jul 10, 2025

View reviewed changes

adoroszlai changed the title ~~HDDS-?????. Fix liveness probe failures for httpfs and s3g pods in Kubernetes deployment~~ HDDS-13397. Fix liveness probe failures for httpfs and s3g pods Jul 10, 2025

roach231428 marked this pull request as ready for review July 10, 2025 08:03

Change the live probe of s3g from TCP to HTTP

cc8cbfc

ivandika3 approved these changes Jul 10, 2025

View reviewed changes

ivandika3 requested a review from dombizita July 10, 2025 09:16

peterxcli approved these changes Jul 10, 2025

View reviewed changes

dombizita approved these changes Jul 10, 2025

View reviewed changes

adoroszlai merged commit 1a2ab3d into apache:master Jul 10, 2025
13 checks passed

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods (apac…

d632229

…he#8775)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods #8775

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods #8775

Uh oh!

roach231428 commented Jul 10, 2025 •

edited

Loading

Uh oh!

peterxcli left a comment

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

adoroszlai commented Jul 10, 2025

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

ivandika3 commented Jul 10, 2025 •

edited

Loading

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

ivandika3 left a comment

Uh oh!

peterxcli left a comment

Uh oh!

dombizita left a comment

Uh oh!

Uh oh!

adoroszlai commented Jul 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods #8775

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods #8775

Uh oh!

Conversation

roach231428 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Changes proposed:

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

peterxcli left a comment

Choose a reason for hiding this comment

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

adoroszlai commented Jul 10, 2025

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

ivandika3 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roach231428 commented Jul 10, 2025

Uh oh!

ivandika3 left a comment

Choose a reason for hiding this comment

Uh oh!

peterxcli left a comment

Choose a reason for hiding this comment

Uh oh!

dombizita left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adoroszlai commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

roach231428 commented Jul 10, 2025 •

edited

Loading

ivandika3 commented Jul 10, 2025 •

edited

Loading

adoroszlai commented Jul 10, 2025 •

edited

Loading