-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-36069: [Java] Ensure S3 is finalized on shutdown #36934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
| */ | ||
| public class FileSystemDatasetFactory extends NativeDatasetFactory { | ||
|
|
||
| private static final AtomicBoolean addedS3ShutdownHook = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an atomic check to ensure we only register a single hook. Without it, we could register multiple hooks to call EnsureS3Finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just stick it in JniLoader? https://github.com/apache/arrow/blob/main/java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniLoader.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, this is a much better spot for this. Will update. The atomic bool did feel like overkill
|
@github-actions crossbow submit java |
|
Revision: 6058031 Submitted crossbow builds: ursacomputing/crossbow @ actions-389a7e4e18 |
|
@github-actions crossbow submit java |
|
Revision: 556a822 Submitted crossbow builds: ursacomputing/crossbow @ actions-ebfb41f456 |
|
thanks |
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 3501964. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change
Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:
`arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit`
### What changes are included in this PR?
* Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized.
### Are these changes tested?
Yes, reproduced with:
```
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
public class DatasetModule {
public static void main(String[] args) {
String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
try (
BufferAllocator allocator = new RootAllocator();
DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
) {
// S3 is initialized
} catch (Exception e) {
e.printStackTrace();
}
}
}
```
I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
* We can't test the actual shutdown since thats a JVM thing.
* We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
* No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.
### Are there any user-facing changes?
No
* Closes: apache#36069
Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
### Rationale for this change
Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:
`arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit`
### What changes are included in this PR?
* Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized.
### Are these changes tested?
Yes, reproduced with:
```
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
public class DatasetModule {
public static void main(String[] args) {
String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
try (
BufferAllocator allocator = new RootAllocator();
DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
) {
// S3 is initialized
} catch (Exception e) {
e.printStackTrace();
}
}
}
```
I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
* We can't test the actual shutdown since thats a JVM thing.
* We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
* No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.
### Are there any user-facing changes?
No
* Closes: apache#36069
Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:
arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exitWhat changes are included in this PR?
EnsureS3Finalized()via JNI. This is a noop if S3 is uninitialized or already finalized.Are these changes tested?
Yes, reproduced with:
I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
Are there any user-facing changes?
No