Skip to content

Conversation

@sanjibansg
Copy link
Contributor

@sanjibansg sanjibansg commented Mar 24, 2022

This PR adds a boolean flag which can be used to avoid creating directories with the Dataset API for filesystems like s3.

Checklist

  • Implementation to avoid creating directories
  • s3 filesystem tests with limited permissions (only put object permissions)

@github-actions
Copy link

@sanjibansg sanjibansg changed the title ARROW-15892: Dataset APIs require s3:ListBucket Permissions ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions Mar 24, 2022
@pitrou
Copy link
Member

pitrou commented Mar 30, 2022

@sanjibansg I'm not sure it will be easy to create an automated S3 test with limited permissions, so we may simply punt on that. @westonpace What do you think?

@westonpace
Copy link
Member

@pitrou We have precedence for it here:

_minio_limited_policy = """{
so I don't think it would be too difficult to add.

@sanjibansg sanjibansg marked this pull request as ready for review April 11, 2022 22:30
@westonpace westonpace self-requested a review April 12, 2022 04:10
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some lint errors:

/arrow/python/pyarrow/tests/test_fs.py:23:1: F401 're' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:24:1: F401 'subprocess' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:25:1: F401 'sys' imported but unused
/arrow/python/pyarrow/tests/test_fs.py:26:1: F401 'time' imported but unused
/arrow/python/pyarrow/tests/test_dataset.py:28:1: F401 'venv.create' imported but unused

@sanjibansg sanjibansg requested review from pitrou and westonpace April 19, 2022 05:10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't do any partitioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use Hive or Directory partitioning here, as we are keeping the create_dir as False. However, we can use Filename Partitioning. But, while doing that, I noticed there is probably some issues with Filename Partitioning, where the values of the field on which the partitioning is done is not retrieved correctly. All those values were returned as null. I am investigating the cause, if it's a simple fix, I can then maybe push that in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we can't do anything else than filename partitioning then is it worth fixing this issue? @westonpace What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the filename partioning bug, better to file a separate JIRA IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original ask was for a case that didn't have any partitioning. If an S3 user has a partitioning then they shouldn't run into the original issue because all CreateDir calls will be for bucket + path. So this flag is only to enable the very specific case where partitioning is not used.

That being said, I think hive & directory partitioning should still work with create_dir=False if you were using s3 or a similar filesystem that did not require directories to be created in advance. So I think it still has some general value if users wanted to avoid creating marker directories in their S3 repository.

We could also solve this problem by modifying s3fs.cc so that it didn't try and create the bucket if it already existed. This would add a BucketExists call to the path here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So should we add a test to check that it works anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the test to use Hive partitioning with the existing_data_behavior set to 'overwrite_or_ignore'. But, not very sure, why the Python CI tests are failing with the ChildProcessError.

@kszucs
Copy link
Member

kszucs commented Apr 22, 2022

@sanjibansg please rebase to make the builds pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually happen since we ran _ensure_minio_component_version above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not understand. Are you saying whether the FileNotFoundError will even be raised since we ran _ensure_minio_component_version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's my question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I now notice you've removed the calls to _ensure_minio_component_version, is that intended? @westonpace What is your take on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I misunderstood this comment
#12701 (comment)

I am getting back the _ensure_minio_component_version function, and we can maybe remove the try-except block since the error should not be raised if _ensure_minio_component_version ran successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you explicitly skip the test when the Minio checks fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping the test explicitly if the check fails.

review: refactored to start try block from _wait_for_minio_startup
@sanjibansg sanjibansg requested a review from pitrou April 25, 2022 09:55
@pitrou
Copy link
Member

pitrou commented Apr 25, 2022

Well, my bad: it seems the FileNotFoundError exception can still happen :-)
Also, the test needs skipping on Windows (see AppVeyor results: mc is something else on Windows)

@sanjibansg
Copy link
Contributor Author

Well, my bad: it seems the FileNotFoundError exception can still happen :-) Also, the test needs skipping on Windows (see AppVeyor results: mc is something else on Windows)

Yes, the Windows issue should be fixed now, moved that line of code to the correct place. Modified the _ensure_minio_component_version to raise a FileNotFoundError if the correct version is not found, and then the test gets skipped.

@pitrou
Copy link
Member

pitrou commented Apr 25, 2022

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: 0d6e0fe

Submitted crossbow builds: ursacomputing/crossbow @ actions-1953

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-hdfs-2.9.2 Github Actions
test-conda-python-3.7-hdfs-3.2.1 Github Actions
test-conda-python-3.7-kartothek-latest Github Actions
test-conda-python-3.7-kartothek-master Github Actions
test-conda-python-3.7-pandas-0.24 Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-spark-v3.1.2 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-v3.2.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-dask-latest Github Actions
test-conda-python-3.9-dask-master Github Actions
test-conda-python-3.9-pandas-master Github Actions
test-conda-python-3.9-spark-master Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure

@pitrou pitrou closed this in 3592f98 Apr 25, 2022
@sanjibansg sanjibansg deleted the s3-directory branch April 25, 2022 17:04
@ursabot
Copy link

ursabot commented May 1, 2022

Benchmark runs are scheduled for baseline = b397d17 and contender = 3592f98. 3592f98 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.63% ⬆️0.08%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3592f988 ec2-t3-xlarge-us-east-2
[Failed] 3592f988 test-mac-arm
[Finished] 3592f988 ursa-i9-9960x
[Finished] 3592f988 ursa-thinkcentre-m75q
[Finished] b397d17f ec2-t3-xlarge-us-east-2
[Failed] b397d17f test-mac-arm
[Finished] b397d17f ursa-i9-9960x
[Finished] b397d17f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants