Skip to content

Conversation

@roach231428
Copy link
Contributor

@roach231428 roach231428 commented Jul 10, 2025

Fixes #8659

What changes were proposed in this pull request?

HDDS-13397. Fix liveness probe failures for httpfs and s3g pods in Kubernetes deployment

This PR addresses persistent restart issues for the httpfs-0 and s3g-0 pods in the Apache Ozone 2.0.0 Kubernetes deployment, caused by liveness probe failures.

Changes proposed:

  1. Fix httpfs pod impersonation failure:

    • Added the following entries to core-site.xml via the config map:

      CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: '*'
      CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: '*'
      
    • This resolves the impersonation authorization error:
      "User: hadoop is not allowed to impersonate hadoop" which previously caused HTTP 500 errors from the httpfs liveness endpoint.

  2. Fix s3g pod liveness probe misconfiguration:

    • Replaced the HTTP-based livenessProbe with a tcpSocket check in s3g-statefulset.yaml.

    • The previous HTTP probe expected an unauthenticated / endpoint, but s3g requires AWS V4 signed requests, leading to a 403 error:

      <Code>InvalidRequest</Code>
      <Message>Error creating s3 auth info...</Message>
      
    • Switching to a TCP socket probe on port 9878 avoids invalid HTTP probing and allows the container to start successfully.

These changes resolve the liveness failures and allow both pods to reach and maintain a healthy state.


What is the link to the Apache JIRA

[HDDS-13397] S3 and HttpFS gateways are failing on deployment from Kubernetes resource examples


How was this patch tested?

  • Applied changes to a test Kubernetes cluster running Ozone 2.0.0

  • Verified that:

    • httpfs-0 passes liveness probe with updated impersonation configs
    • s3g-0 no longer restarts and remains in Running state with TCP liveness check
  • Confirmed via kubectl describe pod and container logs that restart loops ceased and liveness checks succeeded

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the fix.
I guess this is related to https://issues.apache.org/jira/browse/HDDS-13399?

@adoroszlai adoroszlai changed the title HDDS-?????. Fix liveness probe failures for httpfs and s3g pods in Kubernetes deployment HDDS-13397. Fix liveness probe failures for httpfs and s3g pods Jul 10, 2025
@roach231428
Copy link
Contributor Author

Thanks for working on the fix. I guess this is related to https://issues.apache.org/jira/browse/HDDS-13399?

Yes, Thank you for pointing this out.

@roach231428 roach231428 marked this pull request as ready for review July 10, 2025 08:03
@adoroszlai
Copy link
Contributor

I think HDDS-13397 is for the Kubernetes examples, and HDDS-13399 for the Helm chart.

@roach231428
Copy link
Contributor Author

I think HDDS-13397 is for the Kubernetes examples, and HDDS-13399 for the Helm chart.

Yes, the description has beed modified.

@roach231428
Copy link
Contributor Author

I understand that using a TCP socket for the health check is more of a workaround. I'm definitely open to better solutions if anyone has suggestions.

@ivandika3
Copy link
Contributor

ivandika3 commented Jul 10, 2025

@roach231428 Thanks for the patch.

My suggestion is to keep the http GET probe, but change the port to the admin port. GET probe should give a higher confidence that the HTTP server has started and can serve some requests. I haven't tested it yet though.

       livenessProbe:
          httpGet:
            path: /
            port: 19878

@roach231428
Copy link
Contributor Author

@roach231428 Thanks for the patch.

My suggestion is to keep the http GET probe, but change the port to the admin port. GET probe should give a higher confidence that the HTTP server has started and can serve some requests. I haven't tested it yet though.

       livenessProbe:
          httpGet:
            path: /
            port: 19878

Thanks for the suggestion — I’ve tested it with the admin port and the HTTP GET probe works well. The change has been submitted accordingly.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. LGTM +1.

@ivandika3 ivandika3 requested a review from dombizita July 10, 2025 09:16
Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roach231428! LGTM.

Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @roach231428!

@adoroszlai adoroszlai merged commit 1a2ab3d into apache:master Jul 10, 2025
13 checks passed
@adoroszlai
Copy link
Contributor

adoroszlai commented Jul 10, 2025

Thanks @roach231428 for the patch and reporting the problem, @dombizita, @ivandika3, @peterxcli for the review.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants