Skip to content

Conversation

@danepitkin
Copy link
Member

@danepitkin danepitkin commented Jul 28, 2023

Rationale for this change

Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:

arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

What changes are included in this PR?

  • Add a Java runtime shutdown hook that calls EnsureS3Finalized() via JNI. This is a noop if S3 is uninitialized or already finalized.

Are these changes tested?

Yes, reproduced with:

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
        ) {
            // S3 is initialized
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:

  • We can't test the actual shutdown since thats a JVM thing.
  • We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
  • No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.

Are there any user-facing changes?

No

@danepitkin danepitkin requested a review from lidavidm as a code owner July 28, 2023 20:52
@github-actions
Copy link

⚠️ GitHub issue #36069 has been automatically assigned in GitHub to PR creator.

*/
public class FileSystemDatasetFactory extends NativeDatasetFactory {

private static final AtomicBoolean addedS3ShutdownHook = new AtomicBoolean(false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an atomic check to ensure we only register a single hook. Without it, we could register multiple hooks to call EnsureS3Finalized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, this is a much better spot for this. Will update. The atomic bool did feel like overkill

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 28, 2023
@danepitkin
Copy link
Member Author

@github-actions crossbow submit java

@github-actions
Copy link

Revision: 6058031

Submitted crossbow builds: ursacomputing/crossbow @ actions-389a7e4e18

Task Status
java-jars Github Actions
verify-rc-source-java-linux-almalinux-8-amd64 Github Actions
verify-rc-source-java-linux-conda-latest-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-java-macos-amd64 Github Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 28, 2023
@danepitkin
Copy link
Member Author

@github-actions crossbow submit java

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 31, 2023
@github-actions
Copy link

Revision: 556a822

Submitted crossbow builds: ursacomputing/crossbow @ actions-ebfb41f456

Task Status
java-jars Github Actions
verify-rc-source-java-linux-almalinux-8-amd64 Github Actions
verify-rc-source-java-linux-conda-latest-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-java-macos-amd64 Github Actions

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 31, 2023
@lidavidm lidavidm merged commit 3501964 into apache:main Aug 2, 2023
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Aug 2, 2023
@febinsathar
Copy link

thanks

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 3501964.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
### Rationale for this change

Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:

`arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit`

### What changes are included in this PR?

* Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized.

### Are these changes tested?

Yes, reproduced with:

```
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
        ) {
            // S3 is initialized
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
```

I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
* We can't test the actual shutdown since thats a JVM thing.
* We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
* No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.

### Are there any user-facing changes?

No
* Closes: apache#36069

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
### Rationale for this change

Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:

`arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit`

### What changes are included in this PR?

* Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized.

### Are these changes tested?

Yes, reproduced with:

```
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
        ) {
            // S3 is initialized
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
```

I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
* We can't test the actual shutdown since thats a JVM thing.
* We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
* No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.

### Are there any user-facing changes?

No
* Closes: apache#36069

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java] while using s3 FileSystemDatasetFactory getting this exception

3 participants