Skip to content

fix: prevent duplicate manifest entries from concurrent table creation#6143

Merged
westonpace merged 3 commits intolance-format:mainfrom
jmhsieh:jon/namespace-race-fix
Mar 17, 2026
Merged

fix: prevent duplicate manifest entries from concurrent table creation#6143
westonpace merged 3 commits intolance-format:mainfrom
jmhsieh:jon/namespace-race-fix

Conversation

@jmhsieh
Copy link
Copy Markdown
Contributor

@jmhsieh jmhsieh commented Mar 10, 2026

Change conflict_retries from 0 to 5 in insert_into_manifest so that cross-process races are handled correctly. When two processes concurrently insert the same object_id, the second one hits a commit version conflict. With conflict_retries > 0, MergeInsert retries by re-evaluating the full plan against the latest data, where the join detects the existing row and WhenMatched::Fail fires properly.

Previously, conflict_retries=0 meant the second operation would fail with a generic TooMuchWriteContention error, but in some cases both commits could succeed creating duplicate manifest entries ("Expected exactly 1 table...found 2").

Add test with two independent ManifestNamespace instances racing on the same directory to verify no duplicates are created.

Here's an example I run into occasionally

...
  File "/home/runner/work/geneva/geneva/src/geneva/state/manager.py", line 35, in __init__
    self.table = alter_or_create_table(
  File "/home/runner/work/geneva/geneva/src/geneva/utils/schema.py", line 138, in alter_or_create_table
    return db.create_table(table_name, schema=schema, namespace=namespace)
  File "/home/runner/work/geneva/geneva/src/geneva/db.py", line 403, in create_table
    return Table(self, name, namespace=namespace, storage_options=storage_options)
  File "/home/runner/work/geneva/geneva/src/geneva/table.py", line 489, in __init__
    self._ltbl  # noqa
  File "/home/runner/.local/share/uv/python/cpython-3.10-linux-x86_64-gnu/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/home/runner/work/geneva/geneva/src/geneva/table.py", line 543, in _ltbl
    tbl = inner.open_table(self.name, namespace=self._namespace)
  File "/home/runner/work/geneva/geneva/.venv/lib/python3.10/site-packages/lancedb/namespace.py", line 392, in open_table
    response = self._ns.describe_table(request)
  File "/home/runner/work/geneva/geneva/.venv/lib/python3.10/site-packages/lance/namespace.py", line 362, in describe_table
    response_dict = self._inner.describe_table(request.model_dump())
OSError: LanceError(IO): Expected exactly 1 table with id 'default$geneva_manifests', found 2, /home/runner/work/lance/lance/rust/lance-namespace-impls/src/dir/manifest.rs:642:21

Change conflict_retries from 0 to 5 in insert_into_manifest so that
cross-process races are handled correctly. When two processes
concurrently insert the same object_id, the second one hits a commit
version conflict. With conflict_retries > 0, MergeInsert retries by
re-evaluating the full plan against the latest data, where the join
detects the existing row and WhenMatched::Fail fires properly.

Previously, conflict_retries=0 meant the second operation would fail
with a generic TooMuchWriteContention error, but in some cases both
commits could succeed creating duplicate manifest entries ("Expected
exactly 1 table...found 2").

Add test with two independent ManifestNamespace instances racing on the
same directory to verify no duplicates are created.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the bug Something isn't working label Mar 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Overall: Clean, well-motivated fix with a good test. The change from conflict_retries(0) to conflict_retries(5) correctly addresses the race condition where two concurrent create_table calls could both succeed and create duplicate manifest entries.

CI

  • The format check is failing — please run cargo fmt --all and push.

Test note (minor)

The test uses tokio::join! on the same runtime, which doesn't guarantee true interleaving — both futures may execute sequentially depending on task scheduling, meaning the test could pass without actually exercising the race. This is a common limitation of single-process concurrency tests and is acceptable, but worth noting. The test still provides value as a regression guard.

No other issues found. LGTM once the format check passes.


Automated review by Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

jmhsieh and others added 2 commits March 9, 2026 20:23
@jmhsieh
Copy link
Copy Markdown
Contributor Author

jmhsieh commented Mar 10, 2026

duck db related failures seem to have been fix on main. merged with latest

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense.

// When two processes concurrently insert the same object_id, the second one
// hits a commit conflict. With conflict_retries > 0, the retry re-evaluates
// the full MergeInsert plan against the latest data, where the join detects
// the existing row and WhenMatched::Fail fires, producing a clear error.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5 and not 1? Even if we have 10 transactions running at the same time won't transactions 2-10 all hit the conflict error before they attempt to commit again (so 1 retry is enough)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no deep reason for 5.

FWIW, claude came up with a scenario related to multiple distinct tables being created and the fact that namespaces use lance tables that can conflict underneath when multiple different tables are created (not just the same table).

@westonpace westonpace merged commit 7b42fd5 into lance-format:main Mar 17, 2026
30 checks passed
westonpace pushed a commit that referenced this pull request Mar 17, 2026
#6143)

Change conflict_retries from 0 to 5 in insert_into_manifest so that
cross-process races are handled correctly. When two processes
concurrently insert the same object_id, the second one hits a commit
version conflict. With conflict_retries > 0, MergeInsert retries by
re-evaluating the full plan against the latest data, where the join
detects the existing row and WhenMatched::Fail fires properly.

Previously, conflict_retries=0 meant the second operation would fail
with a generic TooMuchWriteContention error, but in some cases both
commits could succeed creating duplicate manifest entries ("Expected
exactly 1 table...found 2").

Add test with two independent ManifestNamespace instances racing on the
same directory to verify no duplicates are created.

Here's an example I run into occasionally
```
...
  File "/home/runner/work/geneva/geneva/src/geneva/state/manager.py", line 35, in __init__
    self.table = alter_or_create_table(
  File "/home/runner/work/geneva/geneva/src/geneva/utils/schema.py", line 138, in alter_or_create_table
    return db.create_table(table_name, schema=schema, namespace=namespace)
  File "/home/runner/work/geneva/geneva/src/geneva/db.py", line 403, in create_table
    return Table(self, name, namespace=namespace, storage_options=storage_options)
  File "/home/runner/work/geneva/geneva/src/geneva/table.py", line 489, in __init__
    self._ltbl  # noqa
  File "/home/runner/.local/share/uv/python/cpython-3.10-linux-x86_64-gnu/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/home/runner/work/geneva/geneva/src/geneva/table.py", line 543, in _ltbl
    tbl = inner.open_table(self.name, namespace=self._namespace)
  File "/home/runner/work/geneva/geneva/.venv/lib/python3.10/site-packages/lancedb/namespace.py", line 392, in open_table
    response = self._ns.describe_table(request)
  File "/home/runner/work/geneva/geneva/.venv/lib/python3.10/site-packages/lance/namespace.py", line 362, in describe_table
    response_dict = self._inner.describe_table(request.model_dump())
OSError: LanceError(IO): Expected exactly 1 table with id 'default$geneva_manifests', found 2, /home/runner/work/lance/lance/rust/lance-namespace-impls/src/dir/manifest.rs:642:21
```

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
westonpace added a commit that referenced this pull request Mar 18, 2026
## Summary

Cherry-picks bug fixes onto `release/v3.0` for the v3.0.1 patch release:

- **#6160** - fix: handle `DataType::Null` in `adjust_child_validity` to
prevent panic
- **#6187** - fix: handle nullable validity layers without def levels
- **#6143** - fix: prevent duplicate manifest entries from concurrent
table creation
- **#6212** - chore: bump lz4_flex patch versions
- **#6146** - fix: replace fetch_arrow_table with to_arrow_table

## Test plan

- CI passes on cherry-picked commits (both PRs were already merged and
tested on main)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
Co-authored-by: Jonathan Hsieh <jon@lancedb.com>
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants