-
Notifications
You must be signed in to change notification settings - Fork 9.2k
YARN-11919 segfault with get_user_info #8177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
I retriggered pre-submit now that we have the build fix in #8166 committed. https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8177/2/ |
cnauroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edwardcapriolo , this is interesting. Thanks for reporting it. Do you know if this is because sysconf(_SC_GETPW_R_SIZE_MAX) doesn't give a defined max buffer size on Alpine? If so, then I guess every other OS we've tried in the past has it defined and we didn't know.
|
@cnauroth I am not much of the c expert. As you know it is not java-like, instead of stack traces.... segfault. I did try to get a debugger but building the code with the debugger and running GDB in a container got complicated. I am fast out of my depth. My thought is that the math we are doing to create the buffer size isn't in accordance with the documentation I found. If you compare the two codes you can see why. It also could be be some other issue, possibly container is alpine but my host is fedora. I don't know what happens to the sysconf call in that case. |
|
💔 -1 overall
This message was automatically generated. |
|
Is there any way to get more details out of this: [INFO] --- hadoops:3.5.0-SNAPSHOT:cmake-test (test-container-executor) @ hadoop-yarn-server-nodemanager ---
[INFO] -------------------------------------------------------
[INFO] C M A K E B U I L D E R T E S T
[INFO] -------------------------------------------------------
[INFO] test-container-executor: running /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-8177/rockylinux-8/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/target/usr/local/bin/test-container-executor
[INFO] with extra environment variables {}
[INFO] STATUS: ERROR CODE 1 after 18 millisecond(s).
[INFO] -------------------------------------------------------
[INFO] ------------------------------------------------------------------------ |
|
Ok I have it sorted: https://man7.org/linux/man-pages/man3/getpwnam.3p.html Both examples are wrong. You need to detect error and grow the buffer. Yikes |
|
Also the way the code is written. Im not sure it is correct. I think it only works via a quirk. src/main/native/container-executor/impl/container-executor.c //struct to store the user details The documentation is not super clear but the passwd struct has pointers to the buffer, and some docs seem to indicate the points might be reused/cleared in later calls. |
|
Interestingly what I believe i have found is that your must deep copy these objects. What I observe is repeated calls to the method changes the objects. Something about the buffer sizing of the old code wasn't hitting the edge case. For some confirmation I asked google: multipe callsto getpwnam_r to create array of users. Key Concepts This is exactly the conclusion I had come to that we need to deep copy this object because the second call to the method alters the state of the first struct. I should have a fix in the next day or so. |
b02531c to
bdb5d65
Compare
|
here is my final summary of the issue. IMHO The code as it is in master amazingly works only in limited contexts. Here is why: hile ((s = getpwnam_r(user, &pwd, buf, bufsize, &result)) == ERANGE){This is the proper way to use this method. It may return ERANGE which means the buffer is not big enough and you need to keep trying. Next the big problem: the passwd stuct has pointers to buffers that can be recyled by other calls to getpwnam_r. So the global object could be corrupted by further calls. //struct to store the user details
-struct passwd *user_detail = NULL;
+struct serialized_passwd *user_detail = NULL;This was effectively the root error I originally observed when I tried to take this to alpine. The implementation of passwd is sufficiently different that it exposed the problem above. Please review @cnauroth and other people who are skilled with c/c++. Thanks. |
|
💔 -1 overall
This message was automatically generated. |
bdb5d65 to
f730460
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@cnauroth A reminder on this one. I am digging though the c code quite intensely, and I believe I have found another c pointer level issue. Because this code is the gatekeeper of secure hadoop we want do not want dangerous undefined behavior running about. |
|
@ericbadger @brumi1024 May you look unfortunately I have found you through git blame :) |
|
|
||
| //struct to store the user details | ||
| struct passwd *user_detail = NULL; | ||
| struct serialized_passwd *user_detail = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The passwd struct contains pointers to buffers. The buffers are have already been freed and the behavor is undefined. This struct is created to insulate us from the passwd struct and give us better control of the memory.
| } | ||
|
|
||
| //function to make a deep clone of passwd to serialized_passwd | ||
| void deep_copy_passwd(const struct passwd *src, struct serialized_passwd *dest){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We duplicate the object using str_dup for all strings. int copy constructor is fine.
| dest->pw_gid = src->pw_gid; | ||
| } | ||
|
|
||
| void free_serialized_passwd(struct serialized_passwd * passwd){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we create dynamic memory for this struct we free it. We give users a simple clean destructor like method.
| if (buf == NULL) { | ||
| exit(EXIT_FAILURE); | ||
| } | ||
| while ((s = getpwnam_r(user, &pwd, buf, bufsize, &result)) == ERANGE){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The large flaw in the exsting code method call return ERANGE when buffer to small. This is the excepted recipe to continually resize it. In practice I never observed the loop run more than once.
| //extern struct passwd *user_detail; | ||
| extern struct section executor_cfg; | ||
|
|
||
| struct serialized_passwd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we put only the things we need limiting the domain and size of the structure.
Description of PR
Linux container executor isn't portable to alpine linux due to code that gets the passwd info for local users
https://github.com/edwardcapriolo/hadoop/pull/new/YARN-11919
How was this patch tested?
Unit tests and direct calls to lce under specific conditions.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?AI Tooling
AI was not used