-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions #12701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sanjibansg I'm not sure it will be easy to create an automated S3 test with limited permissions, so we may simply punt on that. @westonpace What do you think? |
|
@pitrou We have precedence for it here: arrow/python/pyarrow/tests/test_fs.py Line 242 in 64560af
|
fc6ca39 to
2179a5f
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some lint errors:
/arrow/python/pyarrow/tests/test_fs.py:23:1: F401 're' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:24:1: F401 'subprocess' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:25:1: F401 'sys' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:26:1: F401 'time' imported but unused
/arrow/python/pyarrow/tests/test_dataset.py:28:1: F401 'venv.create' imported but unused
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't do any partitioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use Hive or Directory partitioning here, as we are keeping the create_dir as False. However, we can use Filename Partitioning. But, while doing that, I noticed there is probably some issues with Filename Partitioning, where the values of the field on which the partitioning is done is not retrieved correctly. All those values were returned as null. I am investigating the cause, if it's a simple fix, I can then maybe push that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we can't do anything else than filename partitioning then is it worth fixing this issue? @westonpace What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the filename partioning bug, better to file a separate JIRA IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original ask was for a case that didn't have any partitioning. If an S3 user has a partitioning then they shouldn't run into the original issue because all CreateDir calls will be for bucket + path. So this flag is only to enable the very specific case where partitioning is not used.
That being said, I think hive & directory partitioning should still work with create_dir=False if you were using s3 or a similar filesystem that did not require directories to be created in advance. So I think it still has some general value if users wanted to avoid creating marker directories in their S3 repository.
We could also solve this problem by modifying s3fs.cc so that it didn't try and create the bucket if it already existed. This would add a BucketExists call to the path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So should we add a test to check that it works anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modified the test to use Hive partitioning with the existing_data_behavior set to 'overwrite_or_ignore'. But, not very sure, why the Python CI tests are failing with the ChildProcessError.
cf39a2e to
cfadd60
Compare
|
@sanjibansg please rebase to make the builds pass |
c7efb30 to
03e838b
Compare
python/pyarrow/tests/util.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this actually happen since we ran _ensure_minio_component_version above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not understand. Are you saying whether the FileNotFoundError will even be raised since we ran _ensure_minio_component_version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's my question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I now notice you've removed the calls to _ensure_minio_component_version, is that intended? @westonpace What is your take on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I misunderstood this comment
#12701 (comment)
I am getting back the _ensure_minio_component_version function, and we can maybe remove the try-except block since the error should not be raised if _ensure_minio_component_version ran successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can you explicitly skip the test when the Minio checks fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping the test explicitly if the check fails.
…amed as _configure_s3_limited_user
71ec6a7 to
e181b9d
Compare
review: refactored to start try block from _wait_for_minio_startup
|
Well, my bad: it seems the |
Yes, the Windows issue should be fixed now, moved that line of code to the correct place. Modified the |
|
@github-actions crossbow submit -g python |
|
Revision: 0d6e0fe Submitted crossbow builds: ursacomputing/crossbow @ actions-1953 |
|
Benchmark runs are scheduled for baseline = b397d17 and contender = 3592f98. 3592f98 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR adds a boolean flag which can be used to avoid creating directories with the Dataset API for filesystems like s3.
Checklist