Skip to content

fix: add graceful shutdown and start for rest namespace adapter#5325

Merged
jackye1995 merged 8 commits intolance-format:mainfrom
jackye1995:ns-flaky
Dec 2, 2025
Merged

fix: add graceful shutdown and start for rest namespace adapter#5325
jackye1995 merged 8 commits intolance-format:mainfrom
jackye1995:ns-flaky

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Nov 22, 2025

Closes #5293

The test was flaky because the server in the last test is not properly shut down, and the one in new test fails to start and then client talks to the server in the last test.

Made the RestAdapter run non-blocking start instead of blocking serve, so that the python and java level can more gracefully handle start and shutdown.

@jackye1995 jackye1995 marked this pull request as draft November 22, 2025 05:35
@github-actions github-actions Bot added bug Something isn't working python labels Nov 22, 2025
@jackye1995 jackye1995 force-pushed the ns-flaky branch 2 times, most recently from 1d9bab5 to 24e1454 Compare December 1, 2025 18:03
@jackye1995
Copy link
Copy Markdown
Contributor Author

Added some logging and found the issue:

[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::new() creating backend with impl=dir, host=127.0.0.1, port=13055
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::new() properties={"root": "/tmp/tmpwvbvbg_7"}
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::new() calling builder.connect()
[2025-12-01T20:07:29Z INFO  lance_namespace_impls::dir::manifest] create_or_get_manifest: attempting to load manifest from /tmp/tmpwvbvbg_7/__manifest
[2025-12-01T20:07:29Z DEBUG lance::events] target="lance::dataset_events" event="loading" uri="/tmp/tmpwvbvbg_7/__manifest" target_ref=None status="error" 
[2025-12-01T20:07:29Z INFO  lance_namespace_impls::dir::manifest] create_or_get_manifest: load result for /tmp/tmpwvbvbg_7/__manifest is Err
[2025-12-01T20:07:29Z INFO  lance_namespace_impls::dir::manifest] Creating new manifest table at /tmp/tmpwvbvbg_7/__manifest
[2025-12-01T20:07:29Z DEBUG lance::events] target="lance::dataset_events" event="loading" uri="/tmp/tmpwvbvbg_7/__manifest" target_ref=None status="error" 
[2025-12-01T20:07:29Z DEBUG lance::events] target="lance::dataset_events" event="writing" uri="/tmp/tmpwvbvbg_7/__manifest" mode=Create 
[2025-12-01T20:07:29Z DEBUG lance::events] target="lance::dataset_events" event="loading" uri="/tmp/tmpwvbvbg_7/__manifest" target_ref=None status="error" 
[2025-12-01T20:07:29Z DEBUG lance::events] target="lance::file_audit" mode="create" type="manifest" path="dummy" 
[2025-12-01T20:07:29Z DEBUG lance::events] target="lance::dataset_events" event="committed" uri="/tmp/tmpwvbvbg_7/__manifest" read_version=0 committed_version=1 detached=false operation="Overwrite" 
[2025-12-01T20:07:29Z INFO  lance_namespace_impls::dir::manifest] Successfully created manifest table at /tmp/tmpwvbvbg_7/__manifest, version=1, uri=/tmp/tmpwvbvbg_7/__manifest
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::new() backend created successfully
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::serve() starting server on 127.0.0.1:13055
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::serve() sleeping 500ms to wait for server startup
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::serve() background task started, calling adapter.serve()
[2025-12-01T20:07:29Z INFO  lance_namespace_impls::rest_adapter] RestAdapter::serve() binding to 127.0.0.1:13055
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::serve() adapter.serve() returned: false
[2025-12-01T20:07:29Z INFO  lance::namespace] PyRestAdapter::serve() done sleeping, returning
[2025-12-01T20:07:29Z DEBUG reqwest::connect] starting new connection: http://127.0.0.1:13055/
[2025-12-01T20:07:29Z INFO  lance_namespace_impls::rest_adapter] REST create_namespace: received request for id=Some(["workspace"])
[2025-12-01T20:07:29Z DEBUG lance_namespace_impls::dir::manifest] DatasetConsistencyWrapper::reload() starting for uri=/tmp/tmp1ef2t_g9/__manifest, current_version=6
[2025-12-01T20:07:29Z ERROR lance_namespace_impls::dir::manifest] DatasetConsistencyWrapper::reload() failed to get latest version for uri=/tmp/tmp1ef2t_g9/__manifest, current_version=6, error=Not found: tmp/tmp1ef2t_g9/__manifest/_versions, /runner/_work/lance/lance/rust/lance-table/src/io/commit.rs:355:23
[2025-12-01T20:07:29Z ERROR lance_namespace_impls::rest_adapter] REST create_namespace: error=IO { source: Custom { kind: Other, error: "Failed to get latest version: Not found: tmp/tmp1ef2t_g9/__manifest/_versions, /runner/_work/lance/lance/rust/lance-table/src/io/commit.rs:355:23" }, location: Location { file: "/runner/_work/lance/lance/rust/lance-namespace-impls/src/dir/manifest.rs", line: 169, column: 31 } }
python/tests/test_namespace_rest.py::TestTableOperations::test_register_table_rejects_absolute_path 
[FIXTURE] Creating RestAdapter with tmpdir=/tmp/tmpwvbvbg_7, port=13055
[FIXTURE] RestAdapter context entered, creating client
[FIXTURE] Client created, yielding
FAILED[FIXTURE] Test completed, cleaning up

Basically the server of last test was still active, thus the new server fails to start and the client still calls into the old server.

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 1, 2025

Codecov Report

❌ Patch coverage is 82.35294% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-namespace-impls/src/rest_adapter.rs 89.04% 6 Missing and 2 partials ⚠️
rust/lance-namespace-impls/src/dir/manifest.rs 41.66% 5 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@github-actions github-actions Bot added the java label Dec 1, 2025
@jackye1995 jackye1995 changed the title fix: add graceful shutdown for rest namespace adapter fix: add graceful shutdown and start for rest namespace adapter Dec 1, 2025
@jackye1995 jackye1995 marked this pull request as ready for review December 1, 2025 21:28
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions but seems fine otherwise

Comment thread java/lance-jni/src/namespace.rs Outdated
if let Some(server_handle) = adapter.server_handle.take() {
server_handle.abort();
server_handle.shutdown();
std::thread::sleep(std::time::Duration::from_millis(100));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment about why this sleep is needed?


// Use a random port to avoid conflicts
port = 4000 + new Random().nextInt(10000);
port = 10000 + new Random().nextInt(10000);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you already know but the canonical solution to this problem that I'm used to seeing is to use port number 0 and have that pass all the way down to the socket opening. Then the OS will assign a random ephemeral port and you can either return that from the constructor or make it available via an accessor.

Not sure if that is easily doable or possible but figured I'd mention it.

Comment on lines +125 to +128
/// Gracefully shut down the server.
///
/// This signals the server to stop accepting new connections and wait for
/// existing connections to complete.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't actually block until the shutdown is complete though right? I feel like we should document that fact (or ideally find some way to block?)

@jackye1995 jackye1995 requested a review from westonpace December 2, 2025 07:33
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. Thanks for doing the ephemeral port change!

@jackye1995 jackye1995 merged commit 6381f7c into lance-format:main Dec 2, 2025
26 checks passed
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…e-format#5325)

Closes lance-format#5293 

The test was flaky because the server in the last test is not properly
shut down, and the one in new test fails to start and then client talks
to the server in the last test.

Made the RestAdapter run non-blocking start instead of blocking serve,
so that the python and java level can more gracefully handle start and
shutdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestTableOperations.test_drop_table

2 participants