ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions #12701

sanjibansg · 2022-03-24T05:55:26Z

This PR adds a boolean flag which can be used to avoid creating directories with the Dataset API for filesystems like s3.

Checklist

Implementation to avoid creating directories
s3 filesystem tests with limited permissions (only put object permissions)

github-actions · 2022-03-24T05:55:47Z

https://issues.apache.org/jira/browse/ARROW-15892

pitrou · 2022-03-30T13:37:36Z

@sanjibansg I'm not sure it will be easy to create an automated S3 test with limited permissions, so we may simply punt on that. @westonpace What do you think?

westonpace · 2022-03-30T13:44:56Z

@pitrou We have precedence for it here:

arrow/python/pyarrow/tests/test_fs.py

Line 242 in 64560af

_minio_limited_policy = """{

so I don't think it would be too difficult to add.

westonpace

There are some lint errors:

/arrow/python/pyarrow/tests/test_fs.py:23:1: F401 're' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:24:1: F401 'subprocess' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:25:1: F401 'sys' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:26:1: F401 'time' imported but unused
/arrow/python/pyarrow/tests/test_dataset.py:28:1: F401 'venv.create' imported but unused

python/pyarrow/tests/conftest.py

python/pyarrow/includes/libarrow_dataset.pxd

python/pyarrow/tests/test_dataset.py

cpp/src/arrow/dataset/file_base.h

cpp/src/arrow/dataset/dataset_writer.cc

python/pyarrow/dataset.py

python/pyarrow/tests/test_dataset.py

python/pyarrow/tests/conftest.py

cpp/src/arrow/dataset/file_base.h

python/pyarrow/tests/util.py

pitrou · 2022-04-19T15:27:46Z

python/pyarrow/tests/test_dataset.py

This test doesn't do any partitioning?

I don't think we can use Hive or Directory partitioning here, as we are keeping the create_dir as False. However, we can use Filename Partitioning. But, while doing that, I noticed there is probably some issues with Filename Partitioning, where the values of the field on which the partitioning is done is not retrieved correctly. All those values were returned as null. I am investigating the cause, if it's a simple fix, I can then maybe push that in this PR.

Hmm, if we can't do anything else than filename partitioning then is it worth fixing this issue? @westonpace What do you think?

As for the filename partioning bug, better to file a separate JIRA IMHO.

The original ask was for a case that didn't have any partitioning. If an S3 user has a partitioning then they shouldn't run into the original issue because all CreateDir calls will be for bucket + path. So this flag is only to enable the very specific case where partitioning is not used.

That being said, I think hive & directory partitioning should still work with create_dir=False if you were using s3 or a similar filesystem that did not require directories to be created in advance. So I think it still has some general value if users wanted to avoid creating marker directories in their S3 repository.

We could also solve this problem by modifying s3fs.cc so that it didn't try and create the bucket if it already existed. This would add a BucketExists call to the path here.

I see. So should we add a test to check that it works anyway?

I have modified the test to use Hive partitioning with the existing_data_behavior set to 'overwrite_or_ignore'. But, not very sure, why the Python CI tests are failing with the ChildProcessError.

cpp/src/arrow/dataset/file_base.h

python/pyarrow/dataset.py

kszucs · 2022-04-22T20:51:53Z

@sanjibansg please rebase to make the builds pass

python/pyarrow/tests/util.py

pitrou · 2022-04-25T09:17:27Z

python/pyarrow/tests/util.py

Can this actually happen since we ran _ensure_minio_component_version above?

Sorry, I do not understand. Are you saying whether the FileNotFoundError will even be raised since we ran _ensure_minio_component_version?

Indeed, that's my question.

Also I now notice you've removed the calls to _ensure_minio_component_version, is that intended? @westonpace What is your take on this?

Oh, sorry, I misunderstood this comment
#12701 (comment)

I am getting back the _ensure_minio_component_version function, and we can maybe remove the try-except block since the error should not be raised if _ensure_minio_component_version ran successfully.

Ok, can you explicitly skip the test when the Minio checks fail?

Skipping the test explicitly if the check fails.

…amed as _configure_s3_limited_user

review: refactored to start try block from _wait_for_minio_startup

…fails

pitrou · 2022-04-25T13:55:09Z

Well, my bad: it seems the FileNotFoundError exception can still happen :-)
Also, the test needs skipping on Windows (see AppVeyor results: mc is something else on Windows)

sanjibansg · 2022-04-25T15:08:41Z

Well, my bad: it seems the FileNotFoundError exception can still happen :-) Also, the test needs skipping on Windows (see AppVeyor results: mc is something else on Windows)

Yes, the Windows issue should be fixed now, moved that line of code to the correct place. Modified the _ensure_minio_component_version to raise a FileNotFoundError if the correct version is not found, and then the test gets skipped.

pitrou · 2022-04-25T16:05:51Z

@github-actions crossbow submit -g python

github-actions · 2022-04-25T16:10:18Z

Revision: 0d6e0fe

Submitted crossbow builds: ursacomputing/crossbow @ actions-1953

Task	Status
test-conda-python-3.10
test-conda-python-3.7
test-conda-python-3.7-hdfs-2.9.2
test-conda-python-3.7-hdfs-3.2.1
test-conda-python-3.7-kartothek-latest
test-conda-python-3.7-kartothek-master
test-conda-python-3.7-pandas-0.24
test-conda-python-3.7-pandas-latest
test-conda-python-3.7-spark-v3.1.2
test-conda-python-3.8
test-conda-python-3.8-hypothesis
test-conda-python-3.8-pandas-latest
test-conda-python-3.8-pandas-nightly
test-conda-python-3.8-spark-v3.2.0
test-conda-python-3.9
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-master
test-conda-python-3.9-pandas-master
test-conda-python-3.9-spark-master
test-debian-11-python-3
test-fedora-35-python-3
test-ubuntu-20.04-python-3

ursabot · 2022-05-01T06:21:02Z

Benchmark runs are scheduled for baseline = b397d17 and contender = 3592f98. 3592f98 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.63% ⬆️0.08%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3592f988 ec2-t3-xlarge-us-east-2
[Failed] 3592f988 test-mac-arm
[Finished] 3592f988 ursa-i9-9960x
[Finished] 3592f988 ursa-thinkcentre-m75q
[Finished] b397d17f ec2-t3-xlarge-us-east-2
[Failed] b397d17f test-mac-arm
[Finished] b397d17f ursa-i9-9960x
[Finished] b397d17f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Mar 24, 2022

sanjibansg changed the title ~~ARROW-15892: Dataset APIs require s3:ListBucket Permissions~~ ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions Mar 24, 2022

sanjibansg force-pushed the s3-directory branch from fc6ca39 to 2179a5f Compare April 11, 2022 04:48

github-actions bot added the Component: Python label Apr 11, 2022

sanjibansg marked this pull request as ready for review April 11, 2022 22:30

westonpace self-requested a review April 12, 2022 04:10

westonpace requested changes Apr 14, 2022

View reviewed changes

python/pyarrow/tests/conftest.py Outdated Show resolved Hide resolved

python/pyarrow/includes/libarrow_dataset.pxd Outdated Show resolved Hide resolved

python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved

pitrou reviewed Apr 14, 2022

View reviewed changes

sanjibansg requested review from pitrou and westonpace April 19, 2022 05:10

pitrou reviewed Apr 19, 2022

View reviewed changes

cpp/src/arrow/dataset/file_base.h Outdated Show resolved Hide resolved

pitrou reviewed Apr 19, 2022

View reviewed changes

python/pyarrow/tests/util.py Outdated Show resolved Hide resolved

pitrou reviewed Apr 19, 2022

View reviewed changes

python/pyarrow/tests/util.py Outdated Show resolved Hide resolved

pitrou reviewed Apr 19, 2022

View reviewed changes

westonpace reviewed Apr 20, 2022

View reviewed changes

cpp/src/arrow/dataset/file_base.h Outdated Show resolved Hide resolved

python/pyarrow/dataset.py Outdated Show resolved Hide resolved

sanjibansg force-pushed the s3-directory branch from cf39a2e to cfadd60 Compare April 20, 2022 23:51

sanjibansg force-pushed the s3-directory branch from c7efb30 to 03e838b Compare April 22, 2022 21:00

sanjibansg requested review from pitrou and westonpace April 23, 2022 06:31

kszucs force-pushed the s3-directory branch from 03e838b to 71ec6a7 Compare April 24, 2022 18:22

pitrou requested changes Apr 25, 2022

View reviewed changes

sanjibansg added 5 commits April 25, 2022 14:56

feat: flag in dataset writer for creating dir

0c6002f

test: testing put only limited s3 policy

0b8902b

fix: PrepareDirectory for create_dir flag

3473c51

fix: lint issue for unused modules

472cac7

feat: moved limited_s3_user to util

2ff6233

sanjibansg added 13 commits April 25, 2022 14:56

feat: using c_bool instead of bool for create_dir

9f63cd5

test: test with expected failure for create_dir flag set to true

f6a01ae

docs: docstring explaining test for s3 with put only policy

21dba25

fix: python lint

a40aef0

fix: limited_s3_user used as a function and moved to util.py

3c00819

fix: avoid creating dir if already present in _configure_limited_user

587bb3e

docs: changed docstring for create_dir in FileSystemDatasetWriteOptions

fc39613

refactor: merged limited_s3_user into _configure_limited_user and ren…

1652606

…amed as _configure_s3_limited_user

refactor: stdlib imports in alphabetical order in util.py

58a0e51

docs: change docstring of create_dir in dataset.py and file_base.h

ab9a792

test: hive partitioning in test_write_dataset_s3_put_only

bd28ec1

fix: remove dir if already exist in _configure_s3_limited_user

83aecbd

fix: added flag --ignore-existing in mc mb command

e181b9d

sanjibansg force-pushed the s3-directory branch from 71ec6a7 to e181b9d Compare April 25, 2022 09:26

review: removed _ensure_minio_component_version

9ae65f3

review: refactored to start try block from _wait_for_minio_startup

sanjibansg requested a review from pitrou April 25, 2022 09:55

sanjibansg added 2 commits April 25, 2022 16:13

review: keep _ensure_minio_component_version and skip test if this fails

b1ece5f

test: skip limited user s3 test if _ensure_minio_component_version() …

247bec2

…fails

fix: FileNotFoundError Exception and skipping in windows

0d6e0fe

pitrou approved these changes Apr 25, 2022

View reviewed changes

pitrou closed this in 3592f98 Apr 25, 2022

sanjibansg deleted the s3-directory branch April 25, 2022 17:04

asfimport mentioned this pull request Nov 23, 2022

[C++] Dataset APIs require s3:ListBucket Permissions #20147

Closed

ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions #12701

ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions #12701

Uh oh!

Conversation

sanjibansg commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2022

Uh oh!

pitrou commented Mar 30, 2022

Uh oh!

westonpace commented Mar 30, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kszucs commented Apr 22, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Apr 25, 2022

Uh oh!

sanjibansg commented Apr 25, 2022

Uh oh!

pitrou commented Apr 25, 2022

Uh oh!

github-actions bot commented Apr 25, 2022

Uh oh!

ursabot commented May 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sanjibansg commented Mar 24, 2022 •

edited

Loading