emilk/fix write starvation by andrea-reale · Pull Request #12 · rerun-io/lance

andrea-reale · 2026-03-30T07:29:05Z

fix: infinite kmeans if the largest cluster produces only 1 cluster (fix: infinite kmeans if the largest cluster produces only 1 cluster lance-format/lance#5078)
ci: close stale PR (ci: close stale PR lance-format/lance#5087)
fix: remove remainder explain_plan method in Python (fix: remove remainder explain_plan method in Python lance-format/lance#5085)
ci: add agent review guidelines (ci: add agent review guidelines lance-format/lance#5098)
ci: fix close stale PR didn't work as expected (ci: fix close stale PR didn't work as expected lance-format/lance#5107)
docs: minor doc fix for docs/src/format/file/encoding.md (docs: minor doc fix for docs/src/format/file/encoding.md lance-format/lance#5108)
fix(rust): add explicit dependency on chrono serde feature (fix(rust): add explicit dependency on chrono serde feature lance-format/lance#5110)
fix: no panic on unknown version (fix: no panic on unknown version lance-format/lance#5111)
feat!: incremental indexing via SPFresh (feat!: incremental indexing via SPFresh lance-format/lance#4837)
feat(java): expose ManifestSummary to java api (feat(java): expose ManifestSummary to java api lance-format/lance#5092)
fix: skip compression in create_per_value if compression metadata is set to none (fix: skip compression in create_per_value if compression metadata is set to none lance-format/lance#5086)
feat: support dynamic storage options provider with AWS credentials vending (feat: support dynamic storage options provider with AWS credentials vending lance-format/lance#4905)
fix: forward incompatibility of prerelease in writer version (fix: forward incompatibility of prerelease in writer version lance-format/lance#5116)
refactor: introduce SchemaAdapter to perform logical/physical transform (refactor: introduce SchemaAdapter to perform logical/physical transform lance-format/lance#5096)
docs: add fragment level update columns docs (docs: add fragment level update columns docs lance-format/lance#5123)
chore: release version 0.39.0
perf: add a chunk cache to avoid decoding duplicated miniblock chunks (perf: add a chunk cache to avoid decoding duplicated miniblock chunks lance-format/lance#4846)
feat: add fuzziness to json inverted match query (feat: add fuzziness to json inverted match query lance-format/lance#5048)
chore: output column in FTS plan (chore: output column in FTS plan lance-format/lance#5132)
fix: always return correct json schema to users (fix: always return correct json schema to users lance-format/lance#5109)
chore: improve rust dev guide with logging (chore: improve rust dev guide with logging lance-format/lance#5139)
feat(java): expose cleanup_with_policy api (feat(java): expose cleanup_with_policy api lance-format/lance#5136)
refactor: remove not used storage class and blob dataset (refactor: remove not used storage class and blob dataset lance-format/lance#5131)
fix: ensure I/O cancels correctly when scan is dropped (fix: ensure I/O cancels correctly when scan is dropped lance-format/lance#5129)
test: fix flaky SPFresh test (test: fix flaky SPFresh test lance-format/lance#5138)
fix: remove unnecessary object store initialization (fix: remove unnecessary object store initialization lance-format/lance#5145)
ci: make sure all prs and issues are been handled (ci: make sure all prs and issues are been handled lance-format/lance#5146)
feat(java): supports building scalar indices distributedly in java module (feat(java): supports building scalar indices distributedly in java module lance-format/lance#4961)
ci: generate multiple fragments in forward compatibility tests and doc RowIdTreeMap.serialize… (ci: generate multiple fragments in forward compatibility tests and doc RowIdTreeMap.serialize… lance-format/lance#5105)
fix: use typed LanceNamespace for python storage options provider (fix: use typed LanceNamespace for python storage options provider lance-format/lance#5151)
chore: fix nprobes method spelling (chore: fix nprobes method spelling lance-format/lance#5147)
chore: release version v0.40.0-beta.1
feat: add blob v2 schema (feat: add blob v2 schema lance-format/lance#4948)
ci: set value for stale issues too (ci: set value for stale issues too lance-format/lance#5166)
test: create framework for testing memory (test: create framework for testing memory lance-format/lance#4921)
ci!: move to semantic versioning release mechanism (ci!: move to semantic versioning release mechanism lance-format/lance#5089)
test: stablize test_partition_split_on_append (test: stablize test_partition_split_on_append lance-format/lance#5181)
fix: prevent cache collisions in ObjectStoreRegistry (fix: prevent cache collisions in ObjectStoreRegistry lance-format/lance#5153)
chore: release beta version 0.40.0-beta.2
fix: ensure recheck for IsNotNull in bloom filter (fix: ensure recheck for IsNotNull in bloom filter lance-format/lance#5192)
fix: contributing URL gives 404 (fix: contributing URL gives 404 lance-format/lance#5196)
fix: merge struct array use wrong child values (fix: merge struct array use wrong child values lance-format/lance#5106)
ci: fix failure on main (ci: fix failure on main lance-format/lance#5198)
fix: avoid unnecessary get_fragments calling during plan compaction (fix: avoid unnecessary get_fragments calling during plan compaction lance-format/lance#5179)
fix: split partition may be assigned to itself (fix: split partition may be assigned to itself lance-format/lance#5190)
perf!: dynamic pruning for vector search (perf!: dynamic pruning for vector search lance-format/lance#4773)
test: respect the recall requirement of test suite (test: respect the recall requirement of test suite lance-format/lance#5200)
test: lower recall requirement in remap case for PQ (test: lower recall requirement in remap case for PQ lance-format/lance#5204)
feat: provide inline_transaction model for IO optimizing (feat: provide inline_transaction model for IO optimizing lance-format/lance#4774)
fix: improve schema validation for nullability and subschemas (fix: improve schema validation for nullability and subschemas lance-format/lance#4994)
feat(python): add support for HuggingFace IterableDataset (feat(python): add support for HuggingFace IterableDataset lance-format/lance#2599)
chore: simplify dot implementation to use auto-vectorization (chore: simplify dot implementation to use auto-vectorization lance-format/lance#2645)
fix: compile error in test_inline_transaction (fix: compile error in test_inline_transaction lance-format/lance#5206)
docs: introduce lance as a lakehouse format (docs: introduce lance as a lakehouse format lance-format/lance#5209)
chore: move deprecated fields to bottom in Field msg (chore: move deprecated fields to bottom in Field msg lance-format/lance#5211)
chore: introduce lance.org website domain (chore: introduce lance.org website domain lance-format/lance#5213)
feat!: remove unnecessary mut of dataset::sql (feat!: remove unnecessary mut of dataset::sql lance-format/lance#5207)
chore: fix new ui view in mobile (chore: fix new ui view in mobile lance-format/lance#5215)
docs: update readme with latest lakehouse format info (docs: update readme with latest lakehouse format info lance-format/lance#5216)
ci: workaround before StorageFull addressed (ci: workaround before StorageFull addressed lance-format/lance#5219)
refactor!: move all previous code into previous mod (refactor!: move all previous code into previous mod lance-format/lance#5217)
docs: fix batch udf with checkpoint document error (docs: fix batch udf with checkpoint document error lance-format/lance#5185)
feat: add public accessors for count plan construction (feat: add public accessors for count plan construction lance-format/lance#5103)
ci: create a new library compatibility test suite (ci: create a new library compatibility test suite lance-format/lance#5178)
feat: add adapter for REST namespace with manifest namespace backend (feat: add adapter for REST namespace with manifest namespace backend lance-format/lance#4984)
feat: add blob compaction support (feat: add blob compaction support lance-format/lance#5189)
chore: bump to 1.0.0-beta.1 based on breaking change detection
perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out (perf: speed up filtered scan by up to 18.9× by moving the heavy CPU task out lance-format/lance#5165)
ci: cache rust dependencies, enable parallelism (1.9x to 5.5x faster) (ci: cache rust dependencies, enable parallelism (1.9x to 5.5x faster) lance-format/lance#5236)
feat: add inline optimization for dir namespace (feat: add inline optimization for dir namespace lance-format/lance#5244)
fix: memory-limited BTREE index building (fix: memory-limited string BTREE index building lance-format/lance#5175)
feat: conflict resolution for DataReplacement (feat: conflict resolution for DataReplacement lance-format/lance#3631)
fix: home page code snippets cuasing various problems (fix: home page code snippets cuasing various problems lance-format/lance#5245)
fix: panic if only one partition and split is triggered (fix: panic if only one partition and split is triggered lance-format/lance#5241)
feat: support namespace vended credentials for write (feat: support namespace vended credentials for write lance-format/lance#5161)
chore: release beta version 1.0.0-beta.2
docs: correct the comment in util.py (docs: correct the comment in util.py lance-format/lance#5252)
fix: clearer error in dataset take (fix: clearer error in dataset take lance-format/lance#5243)
chore: respect num_indices_to_merge param (chore: respect num_indices_to_merge param lance-format/lance#5253)
refactor: move blob version as a table level config (refactor: move blob version as a table level config lance-format/lance#5220)
fix: convert some panics into errors (fix: convert some panics into errors lance-format/lance#5258)
fix: docs and comment have broken links (fix: docs and comment have broken links lance-format/lance#5261)
feat: introduce community governance (feat: introduce community governance lance-format/lance#5262)
chore: fix maintainer.md indentation (chore: fix maintainer.md indentation lance-format/lance#5264)
ci: add critical fix section to release notes (ci: add critical fix section to release notes lance-format/lance#5259)
feat: add describe_indices function (feat: add describe_indices function lance-format/lance#5221)
feat: add target_bases extension to python write_fragments API (feat: add target_bases extension to python write_fragments API lance-format/lance#5234)
chore: fix runway company name (chore: fix runway company name lance-format/lance#5273)
ci: allow number_prefix in examples (ci: allow number_prefix in examples lance-format/lance#5276)
ci: fix long pending warp runner workflows (ci: fix long pending warp runner workflows lance-format/lance#5277)
feat: support credentials vending for file reader and session (feat: support credentials vending for file reader and session lance-format/lance#5256)
chore: release beta version 1.0.0-beta.3
refactor!: deprecate TFRecord support (refactor!: deprecate TFRecord support lance-format/lance#4593)
ci: use warp runner for ci workflows that were pending (ci: use warp runner for ci workflows that were pending lance-format/lance#5278)
docs: capitalization change (docs: capitalization change lance-format/lance#5269)
feat: introduce blob arrow extension type (feat: introduce blob arrow extension type lance-format/lance#5239)
ci: migrate to github hosted runners (ci: migrate to github hosted runners lance-format/lance#5287)
fix: handle logical rows deletion properly for zonemap and bloomfilter (fix: handle logical rows deletion properly for zonemap and bloomfilter lance-format/lance#5140)
chore: release beta version 1.0.0-beta.4
fix: blob version should be passed in Projection (fix: blob version should be passed in Projection lance-format/lance#5295)
perf: parallelize split job assigning (perf: parallelize split job assigning lance-format/lance#5265)
chore: release beta version 1.0.0-beta.5
refactor: add helper functions to delta.rs tests (refactor: add helper functions to delta.rs tests lance-format/lance#5298)
ci: move to lance-format pypi fury index (ci: move to lance-format pypi fury index lance-format/lance#5289)
chore: release beta version 1.0.0-beta.6
test: simplify IO tests (test: simplify IO tests lance-format/lance#5228)
ci: fix fury publish target to lance-format (ci: fix fury publish target to lance-format lance-format/lance#5305)
chore: release beta version 1.0.0-beta.7
feat(java): add binding for rest and dir namespaces (feat(java): add binding for rest and dir namespaces lance-format/lance#5292)
chore: remove unused vcpkg refs in workflows (chore: remove unused vcpkg refs in workflows lance-format/lance#5285)
docs: fix broken links and 404s (docs: fix broken links and 404s lance-format/lance#5284)
chore: remove ignore_namespace_table_storage_options=True in namespace integration test (chore: remove ignore_namespace_table_storage_options=True in namespace integration test lance-format/lance#5319)
fix: index overestimates the posting list size (fix: index overestimates the posting list size lance-format/lance#5327)
fix: update CachedFileMetadata version API to V2.0 (fix: update CachedFileMetadata version API to V2.0 lance-format/lance#5330)
docs: build project specific guidelines into web doc (docs: build project specific guidelines into web doc lance-format/lance#5324)
fix: correctly handle OSS commit protocol to prevent data loss (fix: correctly handle OSS commit protocol to prevent data loss lance-format/lance#5332)
fix: update btree index with its own zone size instead of DEFAULT_BTR… (fix: update btree index with its own zone size instead of DEFAULT_BTR… lance-format/lance#5301)
fix: join job may cause inconsistent delta indices (fix: join job may cause inconsistent delta indices lance-format/lance#5328)
chore: release beta version 1.0.0-beta.8
feat: expose file upload and download in Lance file session (feat: expose file upload and download in Lance file session lance-format/lance#5336)
chore: release beta version 1.0.0-beta.9
feat(java): support credential vending at write time (feat(java): support credential vending at write time lance-format/lance#5309)
chore: release beta version 1.0.0-beta.10
chore: polish error message while fragment not exist (chore: polish error message while fragment not exist lance-format/lance#5329)
ci: increase size of python lint runner as it is running out of disk (ci: increase size of python lint runner as it is running out of disk lance-format/lance#5343)
refactor!: use org.lance namespace for java package (refactor!: use org.lance namespace for java package lance-format/lance#5339)
Also use larger runner for python-linux and rust-linux jobs
chore: release beta version 1.0.0-beta.11
test: remove substrait-expr dependency (test: remove substrait-expr dependency lance-format/lance#5335)
refactor: move LanceNamespace interface to pylance and java lance-core (refactor: move LanceNamespace interface to pylance and java lance-core lance-format/lance#5345)
chore: release beta version 1.0.0-beta.12
fix: panicException when calling compaction (fix: panicException when calling compaction lance-format/lance#5282)
chore: increase retry time to dynamodb (chore: increase retry time to dynamodb lance-format/lance#5344)
refactor: allow datafiles to contain columns without field id (refactor: allow datafiles to contain columns without field id lance-format/lance#5348)
ci: switch back to warp runners since GHA runners hit limit (ci: switch back to warp runners since GHA runners hit limit lance-format/lance#5354)
feat(ds-sql-api): support JSON bulit in functions in ds.sql API (feat(ds-sql-api): support JSON bulit in functions in ds.sql API lance-format/lance#5350)
feat: support GEO types (feat: support GEO types lance-format/lance#4678)
feat(python): expose DatasetDeltaBuilder and relevant apis (feat(python): expose DatasetDeltaBuilder and relevant apis lance-format/lance#5091)
chore: auto-bump lance-geo (chore: auto-bump lance-geo lance-format/lance#5356)
chore: release beta version 1.0.0-beta.13
chore: add more information while error happened (chore: add more information while error happened lance-format/lance#5357)
chore: do not store benchmark result in PR (chore: do not store benchmark result in PR lance-format/lance#5358)
refactor: rename RowIdSelection to RowAddrSelection (refactor: rename RowIdSelection to RowAddrSelection lance-format/lance#5263)
feat: return Unprocessable error while expected error happened (feat: return Unprocessable error while expected error happened lance-format/lance#5347)
feat: add huggingface native support (feat: add huggingface native support lance-format/lance#5353)
perf: use CPU pool to run WAND algo (perf: use CPU pool to run WAND algo lance-format/lance#5363)
refactor: separate out python and java LanceNamespace interface (refactor: separate out python and java LanceNamespace interface lance-format/lance#5364)
chore: release beta version 1.0.0-beta.14
feat: dynamically choose distance type (feat: dynamically choose distance type lance-format/lance#5370)
chore: bump opendal to 0.55 (chore: bump opendal to 0.55 lance-format/lance#5371)
refactor: align with blob v2 logical types change (refactor: align with blob v2 logical types change lance-format/lance#5375)
fix: parallelize bitmap partition loading in IsIn expressions (fix: parallelize bitmap partition loading in IsIn expressions lance-format/lance#5355)
chore: release beta version 1.0.0-beta.15
ci: fix flaky test_read_btree_index_with_defer_index_remap (ci: fix flaky test_read_btree_index_with_defer_index_remap lance-format/lance#5380)
fix: avx512 related symbol not found in mac x86 (fix: avx512 related symbol not found in mac x86 lance-format/lance#5379)
chore: release beta version 1.0.0-beta.16
feat(java): support writing schema metadata through java LanceFileWriter API (feat(java): support writing schema metadata through java LanceFileWriter API lance-format/lance#5310)
chore: update lock file for python binding (chore: update lock file for python binding lance-format/lance#5376)
perf: avoid allocating filtered nodes on HNSW search path (perf: avoid allocating filtered nodes on HNSW search path lance-format/lance#5377)
chore: polish agents.md for better behavior (chore: polish agents.md for better behavior lance-format/lance#5383)
fix: add graceful shutdown and start for rest namespace adapter (fix: add graceful shutdown and start for rest namespace adapter lance-format/lance#5325)
refactor!: deprecate mac x86 support (refactor!: deprecate mac x86 support lance-format/lance#5391)
chore: bump main to 1.1.0-beta.0
refactor: rename RowIdTreeMap to RowAddrTreeMap (refactor: rename RowIdTreeMap to RowAddrTreeMap lance-format/lance#5266)
fix: don't allow change blob version during update (fix: don't allow change blob version during update lance-format/lance#5386)
fix: respect index metric when user overrides (fix: respect index metric when user overrides lance-format/lance#5395)
chore: remove lancedb in github discussion links and java pom file (chore: remove lancedb in github discussion links and java pom file lance-format/lance#5394)
feat(blob_v2): add external blob support (feat(blob_v2): add external blob support lance-format/lance#5385)
refactor: split dataset tests in a tests mod (refactor: split dataset tests in a tests mod lance-format/lance#5387)
ci: run workflows also on release branch (ci: run workflows also on release branch lance-format/lance#5398)
chore: release beta version 1.1.0-beta.1
fix: take_blobs_by_indices fails with stable row IDs on fragment 1+ (fix: take_blobs_by_indices fails with stable row IDs on fragment 1+ lance-format/lance#5392)
chore: release beta version 1.1.0-beta.2
test: add btree bitmap benchmark (test: add btree bitmap benchmark lance-format/lance#5389)
fix: remove expensive clone in bitmap search (fix: remove expensive clone in bitmap search lance-format/lance#5409)
fix: stop documenting FTS index type, standardize on INVERTED (fix: stop documenting FTS index type, standardize on INVERTED lance-format/lance#5315)
feat: fallback to CPU if GPU accelerating is unavailable (feat: fallback to CPU if GPU accelerating is unavailable lance-format/lance#5407)
feat: disable default features on internal use (feat: disable default features on internal use lance-format/lance#5372)
ci: remove cached wheels before building (ci: remove cached wheels before building lance-format/lance#5414)
fix: remove logging for project_batch (fix: remove logging for project_batch lance-format/lance#5267)
chore: add a test case for variable packed struct (chore: add a test case for variable packed struct lance-format/lance#5384)
perf: do not instrument self in multipart upload (perf: do not instrument self in multipart upload lance-format/lance#5416)
test: extend random access benchmarks (test: extend random access benchmarks lance-format/lance#5417)
feat(blob_v2): add dedicated blob support (feat(blob_v2): add dedicated blob support lance-format/lance#5406)
ci: bump up timeout for benchmark regression job (ci: bump up timeout for benchmark regression job lance-format/lance#5420)
ci: ignore rustls-pemfile warnings until upgrade (ci: ignore rustls-pemfile warnings until upgrade lance-format/lance#5431)
ci: resolve rustdoc warnings and add CI check (ci: resolve rustdoc warnings and add CI check lance-format/lance#5428)
test: use cheaper builds for profiling (test: use cheaper builds for profiling lance-format/lance#5436)
refactor: write bitmap index statistics in file instead (refactor: write bitmap index statistics in file instead lance-format/lance#5251)
refactor: consolidate logic between zonemap and bloomfilter indexes (refactor: consolidate logic between zonemap and bloomfilter indexes lance-format/lance#5374)
test: skip expensive gcs benchmarks (test: skip expensive gcs benchmarks lance-format/lance#5440)
chore: show agents how to view test code coverage (chore: show agents how to view test code coverage lance-format/lance#5426)
feat(blob_v2): add packed blob support (feat(blob_v2): add packed blob support lance-format/lance#5413)
feat: strategized plan compaction (feat: strategized plan compaction lance-format/lance#5233)
fix!: null handling when using NOT with scalar indices (fix!: null handling when using NOT with scalar indices lance-format/lance#5270)
test: introduce query integration tests (test: introduce query integration tests lance-format/lance#4745)
feat: support add sub-column to struct col (feat: support add sub-column to struct col lance-format/lance#5126)
docs: fix Append call in distributed write guide (docs: fix Append call in distributed write guide lance-format/lance#5439)
test: full test coverage of deletion.rs (test: full test coverage of deletion.rs lance-format/lance#5427)
test: get some lance-core util modules to 100% coverage (test: get some lance-core util modules to 100% coverage lance-format/lance#5429)
test: achieve 100% test coverage for lance-arrow json module (test: achieve 100% test coverage for lance-arrow json module lance-format/lance#5430)
chore: use uv to run dbpedia benchmark (chore: use uv to run dbpedia benchmark lance-format/lance#5452)
perf: various btree performance improvements (perf: various btree performance improvements lance-format/lance#5446)
feat: distributed range-based BTree index (feat: distributed range-based BTree index lance-format/lance#5202)
refactor: use the same path for dedicated and packed blob (refactor: use the same path for dedicated and packed blob lance-format/lance#5449)
feat: upgrade lance-namespace to 0.3.1 and add missing apis (feat: upgrade lance-namespace to 0.3.1 and add missing apis lance-format/lance#5457)
chore: bump to 2.0.0-beta.1 based on breaking change detection
feat: add additional index APIs to support count rows split plan (feat: add additional index APIs to support count rows split plan lance-format/lance#5447)
chore: release beta version 2.0.0-beta.2
feat(blob_v2): add BlobAray API for user input (feat(blob_v2): add BlobAray API for user input lance-format/lance#5451)
fix: fix vector index prewarm index (fix: fix vector index prewarm index lance-format/lance#5412)
perf: reuse session context (perf: reuse session context lance-format/lance#5462)
fix: panic unwrap on None in decoder.rs (fix: panic unwrap on None in decoder.rs lance-format/lance#5424)
fix: dir namespace cloud storage path removes one subdir level (fix: dir namespace cloud storage path removes one subdir level lance-format/lance#5464)
chore: release beta version 2.0.0-beta.3
test: add CI Benchmark for training bitmap and btree indexes (test: add CI Benchmark for training bitmap and btree indexes lance-format/lance#5438)
ci: instrument a few functions involved in btree search (ci: instrument a few functions involved in btree search lance-format/lance#5466)
feat: support using FTS as a filter in vector search (feat: support using FTS as a filter in vector search lance-format/lance#4928)
docs: fix and improve the description about row id (docs: fix and improve the description about row id lance-format/lance#5463)
ci: update tpc-h dataset generation to use less ram (ci: update tpc-h dataset generation to use less ram lance-format/lance#5477)
test: skip tests that require cloud resources (test: skip tests that require cloud resources lance-format/lance#5482)
feat(java): support multi-bases for writing database (feat(java): support multi-bases for writing database lance-format/lance#5450)
chore: add .metals to .gitignore (chore: add .metals to .gitignore lance-format/lance#5470)
feat: add py.typed marker file (feat: add py.typed marker file lance-format/lance#5479)
refactor: expose take_blobs_by_addresses to python (refactor: expose take_blobs_by_addresses to python lance-format/lance#5474)
feat: support map data type in lance format version 2.2 (feat: support map data type in lance format version 2.2 lance-format/lance#5349)
feat(blob_v2): add GC support (feat(blob_v2): add GC support lance-format/lance#5473)
test: add memory test and benchmark utilities for Python (test: add memory test and benchmark utilities for Python lance-format/lance#5461)
feat(python): support cleanup_with_policy (feat(python): support cleanup_with_policy lance-format/lance#5458)
fix: ensure trailing slash is normalized in rest adapter (fix: ensure trailing slash is normalized in rest adapter lance-format/lance#5499)
feat(java): simplify the use of optional in jni (feat(java): simplify the use of optional in jni lance-format/lance#5488)
feat(python): add DatasetBasePath stub to improve IDE hints (feat(python): add DatasetBasePath stub to improve IDE hints lance-format/lance#5503)
feat: cleanup only scan managed files (feat: cleanup only scan managed files lance-format/lance#5338)
feat(java): support row lineage and cdf apis (feat(java): support row lineage and cdf apis lance-format/lance#5362)
feat(memtest): add macos support (feat(memtest): add macos support lance-format/lance#5510)
ci: fix benchmark ci job (ci: fix benchmark ci job lance-format/lance#5514)
ci: add memory and io benchmarks for building indices (ci: add memory and io benchmarks for building indices lance-format/lance#5483)
test: add tests for more primitive types (test: add tests for more primitive types lance-format/lance#5173)
feat!: track cumulative wall time in analyze plan (feat!: track cumulative wall time in analyze plan lance-format/lance#5505)
fix: head external manifest object happend 404 NotFound error (fix: head external manifest object happend 404 NotFound error lance-format/lance#5512)
feat: add support for large minichunk size (u32) in format v2.2 (feat: add support for large minichunk size (u32) in format v2.2 lance-format/lance#4959)
ci: use warp on cargo publish (ci: use warp on cargo publish lance-format/lance#5519)
chore: clean up the residual FTS hardcode for index (chore: clean up the residual FTS hardcode for index lance-format/lance#5524)
chore(python): ignore warnings from tf's numpy (chore(python): ignore warnings from tf's numpy lance-format/lance#5525)
feat(blob_v2): add Python API for Blob v2 (feat(blob_v2): add Python API for Blob v2 lance-format/lance#5491)
ci: add Claude Code GitHub Workflow (ci: add Claude Code GitHub Workflow lance-format/lance#5536)
ci: polish claude code review prompts (ci: polish claude code review prompts lance-format/lance#5537)
fix: infer multivector sampling rows (fix: infer multivector sampling rows lance-format/lance#5534)
chore: add guideline in AGENTS.md (chore: add guideline in AGENTS.md lance-format/lance#5538)
fix: json's arrow extension metadata missing (fix: json's arrow extension metadata missing lance-format/lance#5527)
fix: support ManifestNamingSchemeV2 with unordered object stores (fix: support ManifestNamingSchemeV2 with unordered object stores lance-format/lance#5539)
fix: allow storage options provider without expires_at_millis (fix: allow storage options provider without expires_at_millis lance-format/lance#5542)
fix(ci): use pull_request_target for fork PR reviews (fix(ci): use pull_request_target for fork PR reviews lance-format/lance#5544)
fix: make column name lookups case-insensitive (fix: make column name lookups case-insensitive lance-format/lance#5465)
refactor: add store_prefix to lance-io's ObjectStore (refactor: add store_prefix to lance-io's ObjectStore lance-format/lance#5468)
fix: merge_insert uses full schema path for reordered columns (fix: merge_insert uses full schema path for reordered columns lance-format/lance#5541)
ci: fix anthropics/claude-code-action@v1 (ci: fix anthropics/claude-code-action@v1 lance-format/lance#5552)
perf: offload IVF partition build to CPU pool (perf: offload IVF partition build to CPU pool lance-format/lance#5551)
chore: simplify sort line (chore: simplify sort line lance-format/lance#5555)
docs: fix duplicate words in comments and error messages (docs: fix duplicate words in comments and error messages lance-format/lance#5548)
test: add coverage for large string variant in PrefixPlusCounterGenerator (test: add coverage for large string variant in PrefixPlusCounterGenerator lance-format/lance#5550)
fix: correct null_count aggregation in boolean statistics collection (fix: correct null_count aggregation in boolean statistics collection lance-format/lance#4839)
ci: add workflow to listen to comments for benchmark requests (ci: add workflow to listen to comments for benchmark requests lance-format/lance#5556)
refactor: rename RowIdMask to RowAddrMask (refactor: rename RowIdMask to RowAddrMask lance-format/lance#5281)
feat: support global tag retrieval and improve tag api (feat: support global tag retrieval and improve tag api lance-format/lance#5088)
feat: add RTree index spec in table format (feat: add RTree index spec in table format lance-format/lance#5360)
fix: restore decrease max_fragment_id in manifest (fix: restore decrease max_fragment_id in manifest lance-format/lance#5554)
chore: release beta version 2.0.0-beta.4
feat: dataset supports deep_clone (feat: dataset supports deep_clone lance-format/lance#5250)
docs: auto-build refactored namespace integrations doc (docs: auto-build refactored namespace integrations doc lance-format/lance#5562)
refactor: support java 21, drop java 8 (refactor: support java 21, drop java 8 lance-format/lance#5565)
perf: materialize the tokens after WAND done (perf: materialize the tokens after WAND done lance-format/lance#5572)
chore: release beta version 2.0.0-beta.5
feat: add skip_merge for FTS index build (feat: add skip_merge for FTS index build lance-format/lance#5570)
feat(java): add full text search api (feat(java): add full text search api lance-format/lance#5563)
ci: make sure changes in lance-namespace-impls doc is auto-updated (ci: make sure changes in lance-namespace-impls doc is auto-updated lance-format/lance#5569)
feat: support credentials vending in directory namespace (feat: support credentials vending in directory namespace lance-format/lance#5566)
docs: rename RowIdTreeMap to RowAddrTreeMap in rtree.md (docs: rename RowIdTreeMap to RowAddrTreeMap in rtree.md lance-format/lance#5564)
docs: add docs for DuckDB extension (docs: add docs for DuckDB extension lance-format/lance#5578)
ci: don't trigger review on PR updates (ci: don't trigger review on PR updates lance-format/lance#5574)
docs: add research paper link to the landing page (docs: add research paper link to the landing page lance-format/lance#5549)
feat: optimize rle implementation (feat: optimize rle implementation lance-format/lance#5586)
fix: avoid panic while hitting non-null empty multi-vector (fix: avoid panic while hitting non-null empty multi-vector lance-format/lance#5588)
docs: add specification for handling indices (docs: add specification for handling indices lance-format/lance#5543)
feat: upgrade lance-namespace to 0.4.0 (feat: upgrade lance-namespace to 0.4.0 lance-format/lance#5568)
perf: reuse zstd compressors in encoding (perf: reuse zstd compressors in encoding lance-format/lance#5598)
ci: fix release notes for release branches (ci: fix release notes for release branches lance-format/lance#5597)
fix: filter garbage entries from null maps during encoding (fix: filter garbage entries from null maps during encoding lance-format/lance#5591)
ci: ping lance-namespace-reqwest-client version (ci: ping lance-namespace-reqwest-client version lance-format/lance#5610)
refactor: allow switching to bitpack inside RLE (refactor: allow switching to bitpack inside RLE lance-format/lance#5595)
feat: add dictionary encoding for 64bit types like int64/double (feat: add dictionary encoding for 64bit types like int64/double lance-format/lance#5594)
perf: improve SQ query speed (perf: improve SQ query speed lance-format/lance#5596)
perf: compute HNSW level counts after build (perf: compute HNSW level counts after build lance-format/lance#5590)
fix: panic when lance.auto_cleanup.interval is set to 0 (fix: panic when lance.auto_cleanup.interval is set to 0 lance-format/lance#5571)
feat(python): expose the distance_range param in the Python scanner nearest config (feat(python): expose the distance_range param in the Python scanner nearest config lance-format/lance#5486)
feat: support create vector index distributedly (feat: support create vector index distributedly lance-format/lance#5117)
test: add regression test for ivf/pq search (test: add regression test for ivf/pq search lance-format/lance#5476)
docs: update Lance-DuckDB docs to latest version 0.4.1 (docs: update Lance-DuckDB docs to latest version 0.4.1 lance-format/lance#5613)
feat: allow python tracing / logging to be independently configured (feat: allow python tracing / logging to be independently configured lance-format/lance#5415)
feat: support when_matched_delete in merge_insert (feat: support when_matched_delete in merge_insert lance-format/lance#4939)
feat: add parts_searched metrics for FTS (feat: add parts_searched metrics for FTS lance-format/lance#5627)
chore: add boolean match plan scaffold (chore: add boolean match plan scaffold lance-format/lance#5635)
feat(java): add builder-style scalar index params (feat(java): add builder-style scalar index params lance-format/lance#5581)
ci: pin maturin to work around Python build issue (ci: pin maturin to work around Python build issue lance-format/lance#5647)
feat: support GEO RTree index (feat: support GEO RTree index lance-format/lance#5034)
feat: merge-insert with primary key dedupe (feat: merge-insert with primary key dedupe lance-format/lance#5633)
feat(java): add detached flag to commitTransaction (feat(java): add detached flag to commitTransaction lance-format/lance#5626)
fix: remove imports that are not needed (fix: remove imports that are not needed lance-format/lance#5651)
chore: upgrade datafusion to 51, arrow to 57, pyo3 to 0.26 (chore: upgrade datafusion to 51, arrow to 57, pyo3 to 0.26 lance-format/lance#5291)
feat: upgrade lance-namespace to 0.4.5 (feat: upgrade lance-namespace to 0.4.5 lance-format/lance#5611)
feat: support dropping sub-column of list for struct (feat: support dropping sub-column of list(struct) lance-format/lance#5469)
perf: improve FTS indexing perf and reduce memory footprint (perf: improve FTS indexing perf and reduce memory footprint lance-format/lance#5650)
feat: support FixedSizeList (feat: support FixedSizeList<Struct> lance-format/lance#5593)
feat(oss): add sts token support for aliyun oss via storage_options (feat(oss): add sts token support for aliyun oss via storage_options lance-format/lance#5632)
feat: support merge_insert with source dedupe on first seen value (feat: support merge_insert with source dedupe on first seen value lance-format/lance#5603)
ci: fix cargo lock not updated (ci: fix cargo lock not updated lance-format/lance#5669)
feat: allow configure temp dir size for datafusion exec (feat: allow configure temp dir size for datafusion exec lance-format/lance#5659)
ci: fix accidentally reverted version bump (ci: fix accidentally reverted version bump lance-format/lance#5677)
chore: release beta version 2.0.0-beta.6
feat: add order to primary key (feat: add order to primary key lance-format/lance#5683)
fix: allow nearest applied in default_scan_options (fix: allow nearest applied in default_scan_options lance-format/lance#5666)
fix: trait Array has been sealed in arrow new version (fix: trait Array has been sealed in arrow new version lance-format/lance#5690)
perf: tighten WAND block score upper bound (perf: tighten WAND block score upper bound lance-format/lance#5668)
chore: release beta version 2.0.0-beta.7
ci: use github large runnr for java build (ci: use github large runnr for java build lance-format/lance#5692)
chore: remove .vscode (chore: remove .vscode lance-format/lance#5693)
chore: add FTS alias for inverted index (chore: add FTS alias for inverted index lance-format/lance#5680)
ci: add PyYAML to release env (ci: add PyYAML to release env lance-format/lance#5694)
chore: release beta version 2.0.0-beta.8
ci: fix ci failure (ci: fix ci failure lance-format/lance#5695)
fix(python): correct type hint for to_tensor_fn parameter (fix(python): correct type hint for to_tensor_fn parameter lance-format/lance#5577)
feat: refactor use of Error::io (feat: refactor use of Error::io lance-format/lance#5612)
feat: support truncate table api (feat: support truncate table api lance-format/lance#5604)
fix: reduce verbosity of errors due to string conversion (fix: reduce verbosity of errors due to string conversion lance-format/lance#5600)
feat: support set beigin/end timestamp in cdf (feat(cdf): support set start/end timestamp in cdf lance-format/lance#5378)
feat(java): add support for optimizing indices (feat(java): add support for optimizing indices lance-format/lance#5663)
feat: make OneShotPartitionStream pub (feat: make OneShotPartitionStream pub lance-format/lance#5672)
ci: switch to use claude API key (ci: switch to use claude API key lance-format/lance#5705)
fix: project_by_schema now reorders fields inside List types (fix: project_by_schema now reorders fields inside List<Struct> types lance-format/lance#5703)
fix!: check metric compatibility before using vector index (fix!: check metric compatibility before using vector index lance-format/lance#5609)
feat: add Error::External variant for preserving user errors (feat: add Error::External variant for preserving user errors lance-format/lance#5606)
feat: use independent region manifest for MemWAL (feat: use independent region manifest for MemWAL lance-format/lance#5689)
feat: add stats() method to ObjectStoreRegistry (feat: add stats() method to ObjectStoreRegistry lance-format/lance#5706)
fix(java): support FixedSizeList for java LanceField (fix(java): support FixedSizeList for java LanceField lance-format/lance#5509)
feat!: make v2 manifest default (feat!: make v2 manifest default lance-format/lance#5656)
test: add more test cases to improve test coverage for the write functionality in Lance (test: add more test cases to improve test coverage for the write functionality in Lance lance-format/lance#5619)
feat: make on arg optional for merge insert api (feat: make on arg optional for merge insert api lance-format/lance#5667)
perf: avoid copying tokens while merging (perf: avoid copying tokens while merging lance-format/lance#5661)
perf: use binary search to skip documents (perf: use binary search to skip documents lance-format/lance#5636)
fix: allocate too much memory for block max scores (fix: allocate too much memory for block max scores lance-format/lance#5718)
feat: support dynamic context for lance namespace (feat: support dynamic context for lance namespace lance-format/lance#5710)
chore: release beta version 2.0.0-beta.9
chore: change the FTS benchmark data distribution (chore: change the FTS benchmark data distribution lance-format/lance#5721)
docs: in dataset.rs, fix comment for get_fragments (docs: in dataset.rs, fix comment for get_fragments lance-format/lance#5724)
perf: cache global BM25 idf per query (perf: cache global BM25 idf per query lance-format/lance#5727)
feat: cleanup partial idx files when merging distributed vector index (feat: cleanup partial idx files when merging distributed vector index lance-format/lance#5729)
refactor: improve error handling for environment variable parsing (fix: improve error handling for environment variable parsing lance-format/lance#5560)
feat: improve the random access file benchmark (feat: improve the random access file benchmark lance-format/lance#5628)
feat!: define default index name and return IndexMetadata after building index (feat!: define default index name and return IndexMetadata after building index lance-format/lance#5645)
refactor!: introduce storage options accessor (refactor!: introduce storage options accessor lance-format/lance#5728)
feat: support array_contains in LabelList scalar index (feat: support array_contains in LabelList scalar index lance-format/lance#5681)
fix(python): close SQLite connections in BatchUDFCheckpoint (fix(python): close SQLite connections in BatchUDFCheckpoint lance-format/lance#5733)
perf: use LRU cache for session contexts in get_session_context (perf: use LRU cache for session contexts in get_session_context lance-format/lance#5736)
fix: remove credential vending features from python and java bindings (fix: remove credential vending features from python and java bindings lance-format/lance#5737)
chore: release beta version 2.0.0-beta.10
test: add f16/f64 coverage to PQ distance table benchmarks (test: add f16/f64 coverage to PQ distance table benchmarks lance-format/lance#5745)
feat(java): expose index description and statistics (feat(java): expose index description and statistics lance-format/lance#5655)
refactor: introduce RowSetOps and refactor RowAddrTreeMap (refactor: introduce RowSetOps and refactor RowAddrTreeMap lance-format/lance#5624)
feat: make blob v2 dedicated threshold configurable (feat: make blob v2 dedicated threshold configurable lance-format/lance#5719)
perf: merge partitions in stream style (perf: merge partitions in stream style lance-format/lance#5754)
chore: bump main to 2.1.0-beta.0
fix(lance-linalg): check fp16kernels feature before arch-specific code (fix(lance-linalg): check fp16kernels feature before arch-specific code lance-format/lance#5747)
fix: skip missing indices in compaction rewrite (fix: skip missing indices in compaction rewrite lance-format/lance#5739)
feat(rust): add datafusion catalog_provider through namespace (feat(rust): add datafusion catalog_provider through namespace lance-format/lance#5686)
ci: fix version inconsistency and pytorch deprecation warning (ci: fix version inconsistency and pytorch deprecation warning lance-format/lance#5761)
refactor(python): migrate torch.jit.script to torch.compile (refactor(python): migrate torch.jit.script to torch.compile lance-format/lance#5759)
feat: support List and Struct type for KeyValue in inserted_rows.rs (feat: support List and Struct type for KeyValue in inserted_rows.rs lance-format/lance#5713)
feat(java): support building vector index distributively (feat(java): support building vector index distributively lance-format/lance#5664)
feat: add Lance-HF docs to lance.org/integrations/huggingface/ (feat: add Lance-HF docs to lance.org/integrations/huggingface/ lance-format/lance#5748)
feat: support get upsert rows for cdf (feat(cdf): cdf support upsert for views lance-format/lance#5369)
perf: don't concat the batches for writing posting lists (perf: don't concat the batches for writing posting lists lance-format/lance#5769)
fix: fix deletion when using file-object-store:// (fix: fix deletion when using file-object-store:// lance-format/lance#5760)
docs: fix issues in HF integration docs (docs: fix issues in HF integration docs lance-format/lance#5778)
ci: fix broken windows tests (ci: fix broken windows tests lance-format/lance#5787)
refactor: change reader's get_range result to be a static future (refactor: change reader's get_range result to be a static future lance-format/lance#5755)
test: fix tests broken by pandas 3 release (test: fix tests broken by pandas 3 release lance-format/lance#5786)
ci: publish windows RCs to fury (not pypi) (ci: publish windows RCs to fury (not pypi) lance-format/lance#5785)
ci: try to make tests more stable (ci: try to make tests more stable lance-format/lance#5789)
test: update doctest to new pandas behavior (test: update doctest to new pandas behavior lance-format/lance#5788)
feat: expose blob handling APIs to python (feat: expose blob handling APIs to python lance-format/lance#5790)
fix: allow unused_unsafe for __cpuid to support both stable and nightly (fix: allow unused_unsafe for __cpuid to support both stable and nightly lance-format/lance#5793)
chore: fix typo (chore: fix typo lance-format/lance#5796)
perf: use cpu pool to process all posting lists (perf: use cpu pool to process all posting lists lance-format/lance#5780)
perf: add vector throughput benchmark (perf: add vector throughput benchmark lance-format/lance#5644)
perf: add a full text search benchmark (perf: add a full text search benchmark lance-format/lance#5665)
fix: remove unreasonable nullable check for data types in hash_joiner during merge operation (fix: remove unreasonable nullable check for data types in hash_joiner during merge operation lance-format/lance#5784)
ci: fix MSRV check (ci: fix MSRV check lance-format/lance#5799)
feat: add blob handling support for fragment (feat: add blob handling support for fragment lance-format/lance#5801)
ci: fix wikipedia benchmark (ci: fix wikipedia benchmark lance-format/lance#5802)
feat: add alter column nullable to non-nullable support (feat: add alter column nullable to non-nullable support lance-format/lance#5589)
fix(java)!: correct spelling of addFiledStatistics to addFieldStatistics (fix(java)!: correct spelling of addFiledStatistics to addFieldStatistics lance-format/lance#5763)
feat: introduce MemWAL regional writer and MemTable reader
Cleanup comments and fix typo
fix: store relative generation folder name in manifest for cross-platform compatibility
feat(python): support namespace for tensorflow (feat(python): support namespace for tensorflow lance-format/lance#5750)
fix: set JUnit dependency as test scope (fix: set JUnit dependency as test scope lance-format/lance#5815)
ci(python): fix test of huggingface dataset (ci(python): fix test of huggingface dataset lance-format/lance#5816)
perf: calculate cardinality lazily (perf: calculate cardinality lazily lance-format/lance#5783)
ci: fix benchmark regress (ci: fix benchmark regress lance-format/lance#5821)
feat(java): support json extraction by scanning (feat(java): support json extraction by scanning lance-format/lance#5770)
feat: dictionary index always32 bits (feat: dictionary index always32 bits lance-format/lance#5011)
feat: evolute all_null_layout to constant layout (feat: evolute all_null_layout to constant layout lance-format/lance#5641)
feat: add RLE support for block (feat: add RLE support for block lance-format/lance#4937)
docs: fix MkDocs protobuf reference for ConstantLayout (docs: fix MkDocs protobuf reference for ConstantLayout lance-format/lance#5833)
feat: abort dictionary encode if not useful (feat: abort dictionary encode if not useful lance-format/lance#5055)
fix: fix remap so that it handles deletions correctly (fix: fix remap so that it handles deletions correctly lance-format/lance#5828)
fix: support system columns in dataset.take operations (fix: support system columns in dataset.take* operations lance-format/lance#5722)*
ci: fix rust release workflow and allow manual publish (ci: fix rust release workflow and allow manual publish lance-format/lance#5842)
refactor: align blob behavior that write via file format version, read via layout (refactor: align blob behavior that write via file format version, read via layout lance-format/lance#5752)
fix: fix mini-block dictionary bitpacking panic (fix: fix mini-block dictionary bitpacking panic lance-format/lance#5860)
test: skip recall test for 4bit PQ (test: skip recall test for 4bit PQ lance-format/lance#5861)
fix: fix boolean inline constant decoding (fix: fix boolean inline constant decoding lance-format/lance#5862)
feat: add plan/execute separation to FilteredReadExec (feat: add plan/execute separation to FilteredReadExec lance-format/lance#5843)
feat: support tencent cos (feat: support tencent cos lance-format/lance#5740)
chore: add CODEOWNERS for spec files (chore: add CODEOWNERS for spec files lance-format/lance#5858)
fix(java): panic when reading CreateIndex transaction (fix(java): panic when reading CreateIndex transaction lance-format/lance#5853)
fix: inconsistent transposed pq code and metadata when build ivf_pq index distributedly (fix: inconsistent transposed pq code and metadata when build ivf_pq index distributedly lance-format/lance#5834)
fix: flaky test test_ann_prefilter for HNSW (fix: flaky test test_ann_prefilter for HNSW lance-format/lance#5870)
fix(java): init allocator for new dataset when checkout branch/tag (fix(java): init allocator for new dataset when checkout branch/tag lance-format/lance#5876)
feat(compaction): binary copy capability for compaction (feat(compaction): binary copy capability for compaction lance-format/lance#5434)
ci: fix bump version failed (ci: fix bump version failed lance-format/lance#5879)
ci: remove obsolete advisory and update bytes to 1.11.1 (ci: remove obsolete advisory and update bytes to 1.11.1 lance-format/lance#5882)
chore: bump to 3.0.0-beta.1 based on breaking change detection
test: add validation tests for full-text indexes (test: add validation tests for full-text indexes lance-format/lance#5875)
fix: deduplicate row addresses in take to prevent panic (fix: deduplicate row addresses in take to prevent panic lance-format/lance#5881)
perf: add a lightweight scheduler implementation (perf: add a lightweight scheduler implementation lance-format/lance#5773)
fix: open additional storage options provider related apis in lance dataset (fix: open additional storage options provider related apis in lance dataset lance-format/lance#5869)
docs: add documentation for array type support (docs: add array type support lance-format/lance#5884)
fix: avoid panic when repdef serializes empty offsets (fix: avoid panic when repdef serializes empty offsets lance-format/lance#5890)
fix: split index_statistics to reduce rustc query depth (fix: split index_statistics to reduce rustc query depth lance-format/lance#5894)
fix: avoid bitmap range panic on inverted bounds (fix: avoid bitmap range panic on inverted bounds lance-format/lance#5893)
fix: fts flat search drops rows when avg_doc_length < 1.0 (fix: fts flat search drops rows when avg_doc_length < 1.0 lance-format/lance#5897)
perf: replace flatmap in build_distance_table (perf: replace flatmap in build_distance_table lance-format/lance#5898)
chore: release beta version 3.0.0-beta.2
fix(java): align version type from i32 to u64 (fix(java): align version type from i32 to u64 lance-format/lance#5892)
feat: add LSM scanner with point lookup and vector search support (feat: add LSM scanner with point lookup and vector search support lance-format/lance#5850)
feat: add rename table implementations to REST namespaces (feat: add rename table implementations to REST namespaces lance-format/lance#5874)
feat: add range to External blob (feat: add range to External blob lance-format/lance#5765)
feat: support cleanup across branches (feat: support cleanup across branches lance-format/lance#5009)
feat: add third party licenses lists (feat: add third party licenses lists lance-format/lance#5922)
fix: handle NULL elements in LABEL_LIST index results and explain_plan (fix: handle NULL elements in LABEL_LIST index results and explain_plan lance-format/lance#5867)
ci: add codex workflows for backport and fixing CI (ci: add codex workflows for backport and fixing CI lance-format/lance#5926)
fix: avoid panic on empty list LABEL_LIST filters (fix: avoid panic on empty list LABEL_LIST filters lance-format/lance#5914)
refactor: correct panic message typos in OrderableScalarValue::cmp (refactor: correct panic message typos in OrderableScalarValue::cmp lance-format/lance#5913)
feat: spill page metadata to disk during IVF shuffle (feat: spill page metadata to disk during IVF shuffle lance-format/lance#5921)
perf!: remove shuffle buffer (perf!: remove shuffle buffer lance-format/lance#5912)
fix: don't drop field metadata on merge insert path (fix: don't drop field metadata on merge insert path lance-format/lance#5927)
perf: build fp16kernels with NEON support on iOS (perf: build fp16kernels with NEON support on iOS lance-format/lance#5917)
feat!: support index progress reporting via callbacks (feat!: support index progress reporting via callbacks lance-format/lance#5910)
feat: introduce RowIdSet and RowIdMask (feat: introduce RowIdSet and RowIdMask lance-format/lance#5771)
fix: apply SchemaAdapter in Updater (fix: apply SchemaAdapter in Updater lance-format/lance#5928)
feat(python): expose search_filter in scanner (feat(python): expose search_filter in scanner lance-format/lance#5506)
fix: correct OR null semantics for nullable masks (fix: correct OR null semantics for nullable masks lance-format/lance#5919)
feat(java): support session (feat(java): support session lance-format/lance#5931)
docs(governance): introduce incubating subproject concept and update subproject list (docs(governance): introduce incubating subproject concept and update subproject list lance-format/lance#5847)
ci: bump rust toolchain to 1.91.0 (ci: bump rust toolchain to 1.91.0 lance-format/lance#5937)
docs: add FTS docs (docs: add FTS docs lance-format/lance#5888)
ci: improve codex prompt for backport and ci fix (ci: improve codex prompt for backport and ci fix lance-format/lance#5940)
docs: add lance-trino doc (docs: add lance-trino doc lance-format/lance#5943)
feat(python): expose enable_stable_row_ids in commit() (feat(python): expose enable_stable_row_ids in commit() lance-format/lance#5908)
fix: correct OR null handling for BlockList|BlockList (fix: correct OR null handling for BlockList|BlockList lance-format/lance#5944)
feat: support aggregate in scanner (feat: support aggregate in scanner lance-format/lance#5911)
feat(java): support creating IVF_RQ index (feat(java): support creating IVF_RQ index lance-format/lance#5648)
feat: add python and java binding for aggregate (feat: add python and java binding for aggregate lance-format/lance#5951)
ci: improve release workflows (ci: improve release workflows lance-format/lance#5949)
perf: change Dataset::sample to sort its random indices (perf: change Dataset::sample to sort its random indices lance-format/lance#5915)
fix: remove unnecessary column projection for count aggregate (fix: remove unnecessary column projection for count aggregate lance-format/lance#5950)
perf: create local writer for efficient local writes (perf: create local writer for efficient local writes lance-format/lance#5939)
perf: reduce peak memory in nullable training data sampling (perf: reduce peak memory in nullable training data sampling lance-format/lance#5935)
feat: add proto serialization for FilteredReadExec (feat: add proto serialization for FilteredReadExec lance-format/lance#5954)
fix: respect fragment restrictions in vector and FTS searches when requested fragments (fix: respect fragment restrictions in vector and FTS searches when requested fragments lance-format/lance#5924)
docs: add schema data types and field IDs documentation (docs: add schema data types and field IDs documentation lance-format/lance#5925)
chore: release beta version 3.0.0-beta.3
perf: upgrade roaring to 0.11 and improve bitmap/range conversions (perf: upgrade roaring to 0.11 and improve bitmap/range conversions lance-format/lance#5961)
feat: create an arrow-scalar crate utilizing arrow-row and arrow-data (feat: create an arrow-scalar crate utilizing arrow-row and arrow-data lance-format/lance#5955)
refactor: deprecate list_indices and migrate tests to describe_indices (refactor: deprecate list_indices and migrate tests to describe_indices lance-format/lance#5945)
docs: add new maintainers (docs: add new maintainers lance-format/lance#5959)
feat: update minimum supported rust version from 1.88 to 1.91 (feat: update minimum supported rust version from 1.88 to 1.91 lance-format/lance#5964)
feat: add Dataset::with_object_store for request-scoped store overrides (feat: add Dataset::with_object_store for request-scoped store overrides lance-format/lance#5966)
feat: add size to object store tracing (feat: add size to object store tracing lance-format/lance#5962)
fix: invalidate index fragment bitmaps after data replacement and stale merge (fix: invalidate index fragment bitmaps after data replacement and stale merge lance-format/lance#5929)
fix: ensure blob encoding work when using file reader directly (fix: ensure blob encoding work when using file reader directly lance-format/lance#5193)
feat: support namespace as external manifest store (feat: support namespace as external manifest store lance-format/lance#5968)
chore: release beta version 3.0.0-beta.4
feat: serialize storage options in table identifier proto (feat: serialize storage options in table identifier proto lance-format/lance#5973)
docs: add docs for branch (docs: add docs for branch lance-format/lance#5104)
feat: add progress monitoring via callbacks for inverted indexes (feat: add progress monitoring via callbacks for inverted indexes lance-format/lance#5958)
chore: fix test compile error (chore: fix test compile error lance-format/lance#5975)
fix: spawn part load in fts training (fix: spawn part load in fts training lance-format/lance#5977)
fix: rest namespace integration with table version apis (fix: rest namespace integration with table version apis lance-format/lance#5980)
chore: release beta version 3.0.0-beta.5
docs: add lance skills as user guide (docs: add lance skills as user guide lance-format/lance#5877)
chore(io): remove shellexpand and use std::env::home_dir (chore(io): remove shellexpand and use std::env::home_dir lance-format/lance#5987)
ci: fix python datafusion test (ci: fix python datafusion test lance-format/lance#5991)
fix: respect requested indexed fragment in vector and FTS searches (fix: respect requested indexed fragment in vector and FTS searches lance-format/lance#5953)
fix: improve error messages in FixedSizeListArrayExt::convert_to_floating_point (fix: improve error messages in FixedSizeListArrayExt::convert_to_floating_point lance-format/lance#5836)
perf: improve parallelism of data_stats (perf: improve parallelism of data_stats lance-format/lance#5990)
feat: make geodatafusion/geoarrow optional via geo feature flag (feat: make geodatafusion/geoarrow optional via geo feature flag lance-format/lance#5934)
chore: bump main to 3.1.0-beta.0
fix: various bugs to namespace access (fix: various bugs to namespace access lance-format/lance#5996)
chore: release beta version 3.1.0-beta.1
fix(encoding): handle empty rows in variable packed struct decode (fix(encoding): handle empty rows in variable packed struct decode lance-format/lance#5995)
docs: clarify v2.2 nested drop rollback risk (docs: clarify v2.2 nested drop rollback risk lance-format/lance#5999)
docs: expand the FTS index doc explaining the training process and multiple partitions (docs: expand the FTS index doc explaining the training process and multiple partitions lance-format/lance#5988)
feat: add DeleteResult with num_deleted_rows (feat: add DeleteResult with num_deleted_rows lance-format/lance#6001)
feat: introduce IncompatibleTransaction error (feat: introduce IncompatibleTransaction error lance-format/lance#6003)
fix: set namespace commit handler for LanceDataset.commit (fix: set namespace commit handler for LanceDataset.commit lance-format/lance#6002)
chore: release beta version 3.1.0-beta.2
fix: fast_search limits full text search to indexed fragments (fix: fast_search limits full text search to indexed fragments lance-format/lance#6006)
fix: correctly calculate max visible level when a list has no def (fix: correctly calculate max visible level when a list has no def lance-format/lance#6008)
fix: fast_search should ignore any unindexed data for vector search (fix: fast_search should ignore any unindexed data for vector search lance-format/lance#6007)
feat: compress complex all null (feat: compress complex all null lance-format/lance#4990)
feat(core): add Levenshtein-based suggestions to not-found errors in schema (feat(core): add Levenshtein-based suggestions to not-found errors in schema lance-format/lance#5976)
refactor: use dict entries and encoded size instead of cardinality for dict decision (refactor: use dict entries and encoded size instead of cardinality for dict decision lance-format/lance#5891)
perf: speed up format 2.2 300% by spawning structural decode batch tasks (perf: speed up format 2.2 300% by spawning structural decode batch tasks lance-format/lance#5982)
feat!: upgrade DataFusion dependency to 52.1.0 (feat!: upgrade DataFusion dependency to 52.1.0 lance-format/lance#6015)
fix: make overwrites retryable instead of compatible (fix: make overwrites retryable instead of compatible lance-format/lance#6014)
feat: add URI-based commit support to Java SDK (feat: add URI-based commit support to Java SDK lance-format/lance#5978)
feat: add ability to pass custom headers to objectstore requests (feat: add ability to pass custom headers to objectstore requests lance-format/lance#5989)
fix: concurrent read and write to directory namespace (fix: concurrent read and write to directory namespace lance-format/lance#5983)
chore: bump to 4.0.0-beta.1 based on breaking change detection
test: add v2.2 format gate regression tests (test: add v2.2 format gate regression tests lance-format/lance#6012)
fix(python): avoid interpreter shutdown panic in BackgroundExecutor (fix(python): avoid interpreter shutdown panic in BackgroundExecutor lance-format/lance#6023)
perf: reduce peak memory during cosine IVF-PQ index training (perf: reduce peak memory during cosine IVF-PQ index training lance-format/lance#6016)
perf: avoid oversized variable buffers in full-zip scan batches (perf: avoid oversized variable buffers in full-zip scan batches lance-format/lance#6013)
perf: fast rotation for RQ quantization (perf: fast rotation for RQ quantization lance-format/lance#6024)
docs: require data_storage_version=2.2 in map type example (docs: require data_storage_version=2.2 in map type example lance-format/lance#6032)
docs: update file versioning matrix for 2.2 rollout (docs: update file versioning matrix for 2.2 rollout lance-format/lance#6033)
docs: reorganize blob docs around blob v2 and clarify legacy compatibility (docs: reorganize blob docs around blob v2 and clarify legacy compatibility lance-format/lance#6034)
chore: release beta version 4.0.0-beta.2
ci: upgrade Claude model to opus-4-6 (ci: upgrade Claude model to opus-4-6 lance-format/lance#6037)
docs: align 2.2 encoding docs and nested add-column notes (docs: align 2.2 encoding docs and nested add-column notes lance-format/lance#6038)
perf: speed up format v2.2 scans by adding shortcut for full page (perf: speed up format v2.2 scans by adding shortcut for full page lance-format/lance#5981)
test: extend file format version matrices to include 2.2 (test: extend file format version matrices to include 2.2 lance-format/lance#6036)
feat: surface ambiguous merge insert error as InvalidInput (feat: surface ambiguous merge insert error as InvalidInput lance-format/lance#6048)
perf: disable auto FSST for binary fields (perf: disable auto FSST for binary fields lance-format/lance#6047)
chore: release beta version 4.0.0-beta.3
perf: avoid re-open shard indices and small reads (perf: avoid re-open shard indices and small reads lance-format/lance#6026)
refactor!: refactor java access to file format version (#6053)
chore: release beta version 4.0.0-beta.4
chore: release beta version 4.0.0-beta.5
docs: clarify how to generate TPCH benchmark dataset locally (#6063)
feat(java): expose Dataset.dropIndex method to drop specific index (#6065)
feat(blob): distribute blob sidecar keys with reversed binary ids (#6060)
perf: avoid frequent allocating when computing residual vectors (#6062)
fix: make blob v2 reads base-aware in multi-base datasets (#6064)
fix(build): add Android aarch64 support to lance-linalg (#6057)
fix(java): schema field id mapping bug in transaction (#5824)
feat: add env toggle for repetition index cache on read (#6069)
fix(lance-linalg): fix missing return value in u8x16::bit_and for non-x86_64/aarch64 targets (#6068)
ci: allow Claude review workflow for external contributors (#6070)
fix: resolve Python lint failure on main (#6073)
perf: add take_blob benchmark with cache_repetition_index matrix (#6067)
chore: migrate all Rust crates to edition 2024 (#6077)
fix: restore main CI by formatting take_blob imports (#6082)
feat(blob): map external blob URIs to multi-base base ids (#6066)
refactor: upgrade to SNAFU 0.9 (#6071)
fix: avoid thread pool contention between compression and write operations during FTS indexing (#6085)
fix: allowing headers for static configuration to be consistent (#6045)
feat: expose compaction binary copy configuration through python and java SDKs (#6074)
perf: add dict-values compression controls with lz4 default (#6059)
chore: make get execution summary counts pub (#6078)
chore: release beta version 4.0.0-beta.6
feat: mark 2.2 as stable and add 2.3 as the next file format version (#6088)
perf: speedup flat fts (#6054)
refactor!: remove create_empty_table usage (#6087)
feat: support prewarm for IVF-based ANN indices (#6090)
chore: release beta version 4.0.0-beta.7
fix: incorrect deletion masking in DatasetPreFilter (#6083)
fix(btree): include null pages in non-IsNull queries for correct thre… (#6043)
feat: expose use_scalar_index param in Java scanner (#5487)
feat: handle JSONB literals in Lance SQL planner (#6061)
fix: compile error for err_express (#6094)
feat(compaction): single reserve_fragment_ids after rewriting files (#6072)
fix!: bump IVF_RQ version for compatibility check (#6097)
refactor: overhaul AGENTS.md with PR review insights (#6103)
feat(cleanup): add more metrics to RemovalStats (#6025)
feat(java): expose prefilter parameter to support vector search with fragments (#6040)
feat(compaction): add Python config for defer_index_remap (#5691)
fix(python): crash when schema contains nested fixed_size_list or extension type (#6107)
fix: filter stale row IDs in TakeExec for FTS/vector after delete (#6042)
fix: disallow wrapping auto-detected fsst in other compression (#6120)
fix: pin substrait to 0.62.2 until DF supports 0.62.3 (#6121)
fix: dont sample if no vectors are needed (#6110)
feat: add skip_transpose flag to vector index builders (#6114)
fix: vector index type shown as unknown in describe_indices (#6122)
docs: update index.md to fix indexes to indices for uniformity (#6113)
fix: bitmap iterator exhaustion in mask_to_offset_ranges (#6046)
chore: add scripts for reporting new contributors and code review statistics (#5805)
fix(index): preserve stable row-id entries during scalar index optimize (#6117)
fix: handle inverted index worker exits during dispatch (#6129)
feat(cleanup): support rate limiter for cleanup operation (#6084)
chore: ci/cd workflow improvements and fixes (#6127)
chore: release beta version 4.0.0-beta.8
fix: replace fetch_arrow_table with to_arrow_table (#6146)
perf: parallelize FTS prewarming (#6144)
ci: fix pinned auth action ref in benchmarks workflow (#6153)
docs: document the rules for transaction conflicts (#6158)
fix: handle DataType::Null in adjust_child_validity to prevent panic (#6160)
fix: persist frag reuse index external file on local filesystem (#6163)
feat: clearer progress reporting for IVF (#6126)
chore: release beta version 4.0.0-beta.9
chore: clippy cleanups (#6172)
ci: move Linux and Windows jobs to GitHub-hosted runners (#6175)
docs: add alicloud oss configuration (#6167)
fix: avoid empty range reads for zero-length blobs (#6168)
perf: remove shard content key sorting from distributed merge (#6179)
perf(inverted)!: reduce fts indexing time and memory (#6174)
chore: release beta version 4.0.0-beta.10
docs: update the rules for data replacement conflicts to reflect reality (#6182)
fix: handle nullable validity layers without def levels (#6187)
fix: preserve merge insert delete-by-source semantics (#6148)
fix: use to_arrow_reader in benchmark datagen (#6190)
perf(inverted): reuse posting batch builder and merge tail partitions (#6191)
refactor: rename arrow-scalar to lance-arrow-scalar (#6199)
feat: support vector indices in describe_indices filtering (#6145)
fix: memory_limit and num_workers params are not passed to index worker (#6197)
chore: release beta version 4.0.0-beta.11
fix: disallowing stale credentials from directory namespace (#6194)
chore: release beta version 4.0.0-beta.12
feat: enable HNSW-accelerated partition assignment for fp16 vectors (#6119)
refactor: use the dataset file version to determine index file version (#6142)
feat: add compaction options in manifest config (#6170)
feat: reduce open file handles during IVF training (#6169)
fix: preserve create index transaction semantics (#6204)
fix: allow same field name with different type in dataset overwrites (#6206)
docs: add example to show how to index JSON column (#6208)
perf: reuse distance calculator at selecting candidates (#6202)
chore: bump lz4_flex patch versions (#6212)
fix: prewarm all segments for named indices (#6211)
fix: prevent duplicate manifest entries from concurrent table creation (#6143)
fix: add missing type hint for producer function (#6133)
fix: maintaining individual fragment operation when calling take_source (#5844)
perf: pre-transpose PQ codebook for SIMD-friendly L2 distance (#5923)
feat: add file list with sizes to IndexMetadata (#5497)
docs: fix incorrect URLs and cleanup (#5317)
feat: add index segment commit API (#6209)
feat: add abfss:// scheme support for Azure ADLS Gen2 (#6192)
docs: document vector index RAM (training) & storage requirements (#6108)
feat: support atomic multi-table transactions via namespace manifest (#6173)
chore: release beta version 4.0.0-beta.13
docs: remove legacy preview index note (#6218)
chore: bump main to 4.1.0-beta.0
perf: new layout for positions and new algo for phrase query (#6203)
fix: 2.1/2.2 panic when a list column had small values and many empty values (#6234)
feat: bounding source fragments for compaction execution (#6232)
fix: filter out detached versions when scanning manifests (#6245)
fix: like queries with a prefix should be accelerated by btree and zonemap (#6188)
feat: allow setting transaction properties in various operations (#6246)
chore(deps): bump tar from 0.4.44 to 0.4.45 (#6244)
build: remove legacy rustls-webpki 0.101.7 and bump 0.103.10 (#6249)
refactor: distributed vector segment build (#6220)
chore: release beta version 4.1.0-beta.1
fix: adding namespace support to java SDK CommitBuilder from dataset (#6257)
chore: release beta version 4.1.0-beta.2
fix: resolve_latest_location converts errors to not_found unconditionally (#6248)
perf: batched WAND and new WAND structure, ~50% faster (#6241)
fix: return errors for unsupported fixed-size-list child types (#6253)
feat: add OpenDAL Azdls backend for abfss:// with use_opendal flag (#6256)
chore: release beta version 4.1.0-beta.3
test: fix flaky distributed vector build results test (#6268)
fix: handle list-level NULLs in NOT filters (#6044)
refactor!: remove staging from distributed vector indexing (#6269)
test: stabilize distributed IVF grouped build query test (#6281)
fix: pass dataset_options to SafeLanceDataset in worker processes (#6278)
fix: respect the old data filter on inverted index (#6216)
feat(DirectoryNamespace): support index and transaction related operations (#6196)
feat: add aimd throttled object store (#6266)
feat: support stop-word gaps in phrase queries (#6277)
fix: support hamming distance in IndicesBuilder (#6295)
perf: add benchmark for distributed vector merge finalization (#6176)
refactor: move DatasetIndexExt out of lance-index (#6280)
fix: restore namespace build after DatasetIndexExt move (#6302)
feat: actually using namespace error variants (#6275)
fix: multiple improvements for gh workflows (#6306)
feat!: support sampling selected fragments (#6294)
feat(java): add non-blocking AsyncScanner with CompletableFuture API (#6102)
fix(namespace): support nested types in convert_json_arrow_type (#6300)
feat: add a fast dataset version ID API (#6303)
feat: clarify logical indices and physical index segments (#6270)
docs: shorten core major release vote window (#6154)
test: tighten multi-segment vector index coverage and cleanup (#6251)
feat: support non-shared centroid vector index builds (#6296)
feat(python): add storage_options to IvfModel and PqModel save/load (#6312)
perf: optimize stable row_id index build from O(rows) to O(fragments) (#6310)
feat: move rate limiting to the object store (#6293)
chore: bump to 5.0.0-beta.1 based on breaking change detection
docs: add conflict handling and FRI guidance (#6304)
Fix write-starvation during high read-count

The files are generated with `make licenses`, currently expected to run manually. In the future, some automations could be built.

lance-format#5867) closes lance-format#5682 changes: - Treat element-level NULLs in LABEL_LIST as non-matches so array_has_any/array_has_all return TRUE/FALSE when the list itself is non-NULL. - Allow nullable list literals in `LabelListQuery::to_expr` to prevent `explain_plan()` panics. - Add Python tests covering element-level NULLs, list-level NULLs, NULL-literal filters and explain behavior. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Introduce 2 CodeX workflows that could be commonly used: 1. patch a merged PR to a specific release branch 2. fix a CI workflow that is currently breaking main branch

`array_has_any/all(labels, [])` yields an empty label-list query; LabelList index merge (`set_union` / `set_intersection`) `unwraps` the first element and panics on an empty iterator. Skip LABEL_LIST index parsing for empty lists to avoid the panic and fall back to normal execution.

…ance-format#5913) Fix two copy-paste typos in `OrderableScalarValue::cmp` panic messages

During IVF shuffle, we have a FileWriter per partition and each accumulates page metadata in memory over the course of the shuffle. With large datasets and large numbers of partitions, this memory grows over time to dominate the memory cost of IVF shuffle. This patch adds optional functionality to the FileWriter that serializes page metadata to a spill file and enables it by default in the IVF shuffler.

This removes a buffer in the shuffler that accumulated batches for batched writes to temporary storage. This was configured with a public buffer_size parameter hence the breaking change. Previously, when we shuffled data we accumulated this many batches for each partition in memory and then flushed them all to disk at once. This may have been intended as an optimization in the original implementation of the shuffler, which supported external shuffling through arbitrary object storage. However, the shuffler was subsequently hardcoded to use local disk (where this kind of buffering serves no benefit) and even on remote object storage, we already have a layer of buffering in the storage writer. Instead of buffering batches, just write them directly to the FileWriter. This results in much more predictable memory usage and also faster index builds.

There are various ways to write data and several of them are failing for JSON data at the moment because it requires a conversion from logical to physical representation. I'd like to rework this more generally but I want to get the current implementation working first. This fixes one of the merge_insert paths which is currently failing because the field metadata is lost as part of the operation (and so the data appears to be a normal string column and is not converted)

This PR also adds support for `mtune=apple-a13` when building lance-linalg for iOS, which is the earliest supported iOS architecture at time of writing. This is pulled out of lance-format#5866 after the initial bug was fixed in lance-format#5747 and lancedb updated to Lance 2.x

) This adds support for progress reporting on index builds, allowing callers to determine whether a build is progressing, what stage it is in, and in some cases how far into the stage. Breaking: updates some index builder function signatures to include a progress callback implementation. A noop implementation is included in the patch.

This will ensure JSON columns are properly converted in paths like add_columns

Adding python api to support using fts as a filter for vector search, or using vector_query as a filter for fts search. Related to lance-format#4928.

closes lance-format#5895 `Allow|Block OR` previously assumed `NULL` rows were always included in the `block.selected` set; when they were not, `NULL`s could be dropped and `FALSE`/`NULL` were mixed, leading to incorrect results. This change computes `TRUE/FALSE/NULL` explicitly for `Allow|Block OR` and derives `NULL/selected` sets from those.

This allows Java to also pass in a Session shared across datasets similar to python and rust. Session can then be used for engine side caching implementation in Spark and Trino

…subproject list (lance-format#5847) See lance-format#5848 for more background and vote --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Link: https://github.com/lance-format/lance/actions/runs/21917803271 Summary of failure: - build-no-lock failed because rustc 1.90.0 is too old for updated aws-smithy dependencies (requires 1.91). Fixes applied: - Bumped the pinned Rust toolchain to 1.91.0 to satisfy dependency MSRV in no-lock builds. --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

We lack docs for how to create and query an FTS index in Lance. This PR adds detailed docs, based on the latest v1.0.4 `pylance` API. All code snippets have been tested with the latest stable release and I gave it another pass with GPT 5.2 to ensure accuracy per the latest Rust code in [tokenizer.rs](https://github.com/lance-format/lance/blob/f77769778aa93f47572187b83ccd7b6638dc39a3/rust/lance-index/src/scalar/inverted/tokenizer.rs#L180).

1. allow backporting multiple PRs in the same run 2. use a more meaningful PR title for CI fixes

…5908) ## Summary - Expose `enable_stable_row_ids` parameter in `LanceDataset.commit()` and `commit_transaction()`, allowing atomic creation of datasets with stable row IDs via the commit path. - Thread the parameter through Python → PyO3 → `CommitBuilder.use_stable_row_ids()`. Closes lance-format#5906 ## Test plan - [x] `cargo check -p pylance` passes - [x] `cargo clippy -p pylance` passes with no warnings - [x] New test `test_commit_with_stable_row_ids` verifies that `commit(Overwrite, enable_stable_row_ids=True)` creates a dataset with sequential stable row IDs across append

) `BlockList|BlockList` OR handled NULLs incorrectly: when one side was TRUE and the other was NULL, the result could stay NULL, leading to wrong query results. Fix by computing FALSE rows explicitly and deriving NULL rows from three‑valued logic.

The main use case of this PR is to allow engines like Spark and Trino to pushdown an aggregate into Lance scanner in distributed worker when possible. Today we technically already supports `COUNT(*)` pushdown through `scanner.count_rows()` to count rows of each fragment distributedly, this is a more generic version of that. My plan is to allow an engine to pass a Substrait Aggregate expression to scanner in the worker to support pushdown other aggregations like `SUM`, `MAX`, `MIN`. Another alternative I have thought about is to just update the `dataset.sql()` API to accept a full Substrait plan so we can execute a plan with aggregate, and update distributed worker to run a SQL statement instead of running the scanner. But doing this feature in scanner feels more aligned with how engines implement the distributed execution. Basically whatever that could be executed by a single worker in a distributed environment (predicate pushdown, column projection, aggregate pushdown) should be supported by the scanner. Note that with this change, we can technically remove `create_count_plan` since it's just a subcase of `create_aggregate_plan`, but we are not doing it in this PR. Once we agree upon this direction, I will do a separated PR to refactor that. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

This PR should be rebased after lance-format#5664 --------- Co-authored-by: majin.nathan <majin.nathan@bytedance.com>

Follow up after lance-format#5911

1. use relative release note for beta releases. Example fix: https://github.com/jackye1995/lance/releases/tag/v4.0.0-beta.5 2. ensure RC voting period is correct. Example fix: jackye1995#73 3. implement minor release in release branch: - Example proper failure when trying to create minor release in release branch when main branch is not at major version: https://github.com/jackye1995/lance/actions/runs/21975317405/job/63485677361 - Example successful minor release in release branch: jackye1995#80, https://github.com/jackye1995/lance/releases/edit/v3.1.0-rc.1, https://github.com/jackye1995/lance/releases/edit/v3.1.0

…#5915) This changes Dataset::sample to sort its random indices. Supplying sorted inputs to take results in a 50% reduction in peak memory consumption. This change causes the IVF training stage of IVF-PQ index builds to take approximately half as much memory.

…format#5950) Make sure we always only do a metadata projection to avoid scanning data when doing a count. Also: 1. remove `create_count_plan` since `count_rows` is now just a case of aggregate, no longer need a dedicated query plan. 2. remove duplicated calls to `aggregate_required_columns` by storing required columns directly at aggregate construction time.

This creates a new LocalWriter that wraps tokio::fs::File in a BufWriter for local file writes. ObjectStore::create() now returns one of these when working against local storage, and an ObjectWriter for remote storage. Prior to this commit, local writes (e.g for shuffling) went through a local object writer implementation that required a 5MB buffer per writer and also simulated multipart upload machinery. For local writing, this is slower than necessary and uses a lot of memory in situations where many writers are open at once. This change results in a substantial memory reduction and incremental speedup for IVF shuffle. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

…rmat#5935) When we sample data for IVF training we go down one of two paths, depending on whether the target column is nullable or not. For the not-nullable path, we use dataset.sample. For the nullable path, we do a block-level sampling of batches, then filter out all batches that contain all-null rows, and interleave the result into the output. The interleaving step requires two copies of the filtered batches to be held in RAM. This commit adds a specialization for the case of nullable fixed-length array columns. We now pre-size an output vector, and process the input scan as a stream. For each batch, we copy not-null values into the output vector, and drop the batch. This saves one copy of the filtered output and halves the memory consumption of the sampling step.

…mat#6269) This refactors distributed vector indexing to remove the staging-root workflow and treat worker outputs as segments written directly under `indices/<segment_uuid>/`. Segment planning and build now operate on segments directly, and vector indexing no longer uses `merge_index_metadata`. The API and docs are updated around `with_segments(...)` / `plan_segments(...)`, and the focused distributed vector tests were updated to cover the new workflow. Java's API will be affected

…t#6281)

…nce-format#6278) Worker processes were opening the dataset without dataset_options, silently dropping any options (storage_options, version, index_cache_size, etc.) set by the caller. Also remove the debug print statement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

This is basically the same problem encountered in lance-format#5929 However, the solution there (and in other indices fixed since then) has been to prune indices during the update phase to remove old fragments. Unfortunately, FTS indices are maybe too large for this pruning to make sense (it would require a complete scan through of the old index) though I haven't verified how bad the performance would be. This PR instead keeps track of which fragments have been invalidated. It then uses this as a block-list at search time.

…tions (lance-format#6196) #### Index and transaction - `create_table_index` - `list_table_indices` - `describe_table_index_stats` - `describe_transaction` - `create_table_scalar_index` - `drop_table_index` --------- Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>

This does not hook the throttle up anywhere yet, that will come in a future PR. Closes lance-format#6237 Closes lance-format#6238

This change enables phrase queries to match across stop-word gaps. Example: For `doc="love the format"` indexed with `remove_stop_words=True`, the index does not store the stop word the. With this change, users can still match the document with the phrase query `q="love the format"`. In this mode, all stop words are treated as equivalent placeholders for phrase matching, so `q="love a format"` will also match the same document. This makes queries that containing stop words 3x~10x faster in the cost of a lit bit accuracy --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

## Summary - `IndicesBuilder` rejected uint8 vector columns and didn't include "hamming" in its allowed distance types, even though the underlying Rust `train_ivf_model` supports hamming via k-modes - Relaxes `_normalize_column` to accept unsigned integer value types alongside floats - Adds "hamming" to `_normalize_distance_type`'s allowed list - Adds `test_ivf_centroids_hamming` test with uint8 vectors ## Test plan - [ ] `test_ivf_centroids_hamming` — end-to-end IVF training with uint8 vectors and hamming distance - [ ] Existing `test_ivf_centroids` tests still pass (float path unchanged) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…format#6176) This adds a dedicated benchmark for distributed vector index finalization and a small query-side metric to count `find_partitions` calls. Together they give us a baseline for analyzing the current single-node merge bottleneck and for evaluating future segmented-index work. As context, the new `distributed_merge_only_ivf_pq` benchmark already shows that finalize cost grows much faster than input bytes as shard count and partition count increase. In the local filesystem benchmark, the mean finalize time grows from about `64 ms` at `8 shards / 256 partitions` to about `2.87 s` at `128 shards / 1024 partitions`. --- Based on this benchmark, I noticed that our current logic performs poorly as the number of shards increases. <img width="2750" height="1584" alt="2026-03-12-distributed-merge-trend-matplotlib" src="https://github.com/user-attachments/assets/361aa371-5941-431d-964a-8ea1e2a086d4" />

This moves `DatasetIndexExt` and the dataset-facing index segment types out of `lance-index` and into `lance`, so the public dataset index management API lives in the dataset layer instead of the lower-level index implementation crate. Close lance-format#6221

…#6302) This fixes the main-branch CI failure introduced after lance-format#6280 moved `DatasetIndexExt` into `lance`. `rust/lance-namespace-impls/src/dir.rs` still imported and referenced the old `lance_index::DatasetIndexExt` path, which broke the Java JNI workflow when it compiled the namespace implementation.

In the rust SDK we have a `NamespaceError` that contains a message and an error code. Additionally, in both the Java and python SDKs we have logic to parse this type (and error codes) to produce native Java exceptions. The problem, is that in the `namespace-impls` we are never using them. Therefore the rust `NamespaceError` type is never used and the Java / python SDK conversion code the parse and throw native errors is essentially dead code. In this PR, we update the `namespace-impls` to actually throw the correct error types. Closes lance-format#6240 --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Esteban Gutierrez <esteban@lancedb.com>

…ance-format#6102) ## Summary Implements a parallel async scanner alongside the existing blocking `LanceScanner` to prevent thread starvation in Java query engines like Presto and Trino. This PR adds **AsyncScanner** - a non-blocking alternative to `LanceScanner` that uses `CompletableFuture` for true async I/O operations. ## Motivation Query engines like Presto and Trino rely on non-blocking I/O to efficiently multiplex thousands of concurrent queries on a limited thread pool. The current `LanceScanner` blocks Java threads during Rust I/O operations, causing thread starvation and poor performance in these environments. ## Key Features ✅ **Non-blocking I/O**: Spawns Tokio tasks instead of blocking Java threads ✅ **CompletableFuture API**: Native Java async patterns for seamless integration ✅ **Persistent JNI dispatcher**: Single thread attached to JVM for zero-overhead callbacks ✅ **Task-based architecture**: Uses task IDs instead of JNI refs to prevent memory leaks ✅ **Full feature parity**: Supports filters, projections, vector search, FTS, aggregates ✅ **Clean cancellation**: Proper task cleanup without resource leaks ✅ **Parallel to existing API**: No breaking changes, `LanceScanner` unchanged ## Architecture The implementation uses the **"Task ID + Dispatcher" pattern**: 1. **Java manages futures**: `ConcurrentHashMap<taskId, CompletableFuture<Long>>` maps task IDs to pending requests 2. **Rust spawns async tasks**: Returns immediately while Tokio handles I/O in background 3. **Lock-free completion channel**: Carries `(taskId, resultPointer)` from Tokio to dispatcher 4. **Persistent dispatcher thread**: Attaches to JVM once, completes Java futures via cached JNI method IDs ### Components **Rust (`java/lance-jni/src/`):** - `dispatcher.rs` - Persistent JNI thread with cached method IDs for callbacks - `task_tracker.rs` - Thread-safe task registry using `RwLock<HashMap>` - `async_scanner.rs` - AsyncScanner with Tokio task spawning and JNI exports - `lib.rs` - Modified to add `JNI_OnLoad` hook for dispatcher initialization **Java (`java/src/main/java/org/lance/ipc/`):** - `AsyncScanner.java` - CompletableFuture-based async API with task management **Tests (`java/src/test/java/org/lance/`):** - `AsyncScannerTest.java` - 6 comprehensive examples demonstrating usage patterns ## Usage Examples ### Basic async scan ```java ScanOptions options = new ScanOptions.Builder().batchSize(20L).build(); try (AsyncScanner scanner = AsyncScanner.create(dataset, options, allocator)) { CompletableFuture<ArrowReader> future = scanner.scanBatchesAsync(); ArrowReader reader = future.get(10, TimeUnit.SECONDS); while (reader.loadNextBatch()) { // Process batches without blocking } } ``` ### Multiple concurrent scans (key benefit for Presto/Trino) ```java List<CompletableFuture<Integer>> futures = new ArrayList<>(); for (int i = 0; i < 100; i++) { AsyncScanner scanner = AsyncScanner.create(dataset, options, allocator); futures.add(scanner.scanBatchesAsync() .thenApply(reader -> processInBackground(reader))); } // All scans run in parallel without blocking threads! CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])) .get(30, TimeUnit.SECONDS); ``` ## Testing ```bash cd java ./mvnw test -Dtest=AsyncScannerTest ./mvnw compile # Full build verification ``` All tests pass ✅ ## Compatibility - ✅ No breaking changes to existing APIs - ✅ `LanceScanner` remains unchanged - ✅ Uses same `ScanOptions` for consistency - ✅ Opt-in: users choose blocking or async based on their needs ## Performance Benefits For query engines running hundreds of concurrent queries: - **Before**: Thread pool exhaustion as threads block on I/O - **After**: Threads immediately return, I/O happens in background ## Checklist - [x] Rust code compiles (`cargo check`) - [x] Java code compiles (`./mvnw compile`) - [x] Code formatted (`./mvnw spotless:apply && cargo fmt`) - [x] Comprehensive tests added (`AsyncScannerTest.java`) - [x] Documentation in code examples - [x] No breaking changes ## Related Issues Addresses the need for non-blocking I/O in Java query engines that integrate with LanceDB. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…e-format#6300) `convert_json_arrow_type` only handled scalar types (int/float/utf8/binary), causing deserialization failures for any schema containing list, struct, large_binary, large_utf8, fixed_size_list, map, decimal, or date/time types. This made `arrow_type_to_json` and `convert_json_arrow_type` asymmetric: serialization worked for all types but deserialization rejected most of them with "Unsupported Arrow type". In practice this broke the DuckDB lance extension's fast schema-from-REST path — tables with list/struct columns fell back to opening the S3 dataset for every DESCRIBE, making SHOW ALL TABLES ~20x slower than necessary. Add support for: float16, large_utf8, large_binary, fixed_size_binary, decimal32/64/128/256, date32, date64, timestamp, duration, list, large_list, fixed_size_list, struct, and map. Add a roundtrip test covering all supported types. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - add `Dataset::version_id()` in Rust to return the checked-out manifest version without building full version metadata - route Python `dataset.version` and default checkout refs through the new fast path - route Java `Dataset.version()` through a new JNI version-id accessor while keeping `getVersion()` unchanged - extend Rust, Python, and Java tests to cover current, updated, and historical version reads ## Testing - `cargo test -p lance test_version_id_fast_path` - `cargo check --manifest-path java/lance-jni/Cargo.toml` - `cargo check --manifest-path python/Cargo.toml` - Python pytest target not run in this environment (`pytest` unavailable) - Java Maven test target not run in this environment (JRE unavailable)

…at#6270) This change makes the logical-index and physical-segment split explicit in the user-facing index APIs without breaking existing behavior. `describe_indices` remains the logical view, `describe_index_segments` becomes the explicit physical-segment view, and index statistics now expose `num_segments` / `segments` alongside the legacy fields for compatibility. The Rust, Python, and Java bindings now use the same model so segment-aware callers do not need to infer semantics from raw manifest metadata. I validated the Rust path with `cargo test -p lance test_optimize_delta_indices -- --nocapture` and the Java path with `./mvnw -q -Dtest=DatasetTest#testDescribeIndicesByName test`.

This updates the documented minimum voting period for core stable major releases from 1 week to 3 days. It also updates the RC voting discussion automation so generated vote windows stay aligned with the documented policy. This follows the discussion in lance-format#6147, where feedback converged on narrowing the proposal to core major releases only while keeping patch and subproject stable releases unchanged.

…format#6251) This tightens the new multi-segment vector index path added in lance-format#6220. It enforces disjoint fragment coverage when committing a segment set, adds regression coverage that grouped segment coverage matches the union of its source shard coverage, and verifies that remap only touches segments covering affected fragments. It also adds cleanup coverage for both replaced committed segments and stale uncommitted `_indices/<uuid>` artifacts, and documents these contracts in the distributed indexing guide.

) This PR builds on lance-format#6294 and exposes the remaining pieces needed to construct non-shared centroid vector index builds. It adds fragment-scoped IVF/PQ training in Rust and exports the same training flow to Python, so users can train per-segment artifacts and feed them into the existing distributed build path.

…ance-format#6312) Closes lance-format#6311 ## What Add an optional `storage_options` keyword argument to `IvfModel.save()`, `IvfModel.load()`, `PqModel.save()`, and `PqModel.load()`, and forward it to the underlying `LanceFileWriter` / `LanceFileReader`. ## Why These methods accept cloud storage URIs (e.g. `s3://`, `gs://`) but previously had no way to pass credentials or backend-specific configuration. Users were forced to rely on environment variables for authentication, which breaks down when dealing with multiple storage backends that require different credentials. Both `LanceFileWriter` and `LanceFileReader` already support `storage_options` — this change simply threads the parameter through. ## Change Summary - **`python/python/lance/indices/ivf.py`**: Add `storage_options` parameter to `IvfModel.save()` and `IvfModel.load()`, forward to `LanceFileWriter` / `LanceFileReader`. - **`python/python/lance/indices/pq.py`**: Same change for `PqModel.save()` and `PqModel.load()`. ## Usage ```python from lance.indices.ivf import IvfModel from lance.indices.pq import PqModel opts = {"aws_access_key_id": "...", "aws_secret_access_key": "...", "region": "us-east-1"} # Save ivf_model.save("s3://bucket/ivf.lance", storage_options=opts) pq_model.save("s3://bucket/pq.lance", storage_options=opts) # Load ivf_model = IvfModel.load("s3://bucket/ivf.lance", storage_options=opts) pq_model = PqModel.load("s3://bucket/pq.lance", storage_options=opts)

…lance-format#6310) ## Summary When `enable_stable_row_id` is enabled, the first `take_rows()` call triggers a full `RowIdIndex` build. On large datasets this cold start was extremely slow (18.9s for 968M rows, 220s for 4.26B rows). ### Root Causes **1. O(total_rows) segment expansion in `decompose_sequence`** The original code expanded every `U64Segment` element-by-element, even for `Range` segments with no deletions. For a `Range(0..273711)` with no deletions, this meant 273K iterations, deletion vector checks, temporary allocations, and re-compression — only to produce the same Range back. Across 18,243 fragments averaging 233K rows, this totaled **4.26 billion iterations** with ~32 GB of temporary allocations. **2. O(N²) fragment lookup in `load_row_id_index`** The original code called `fragments.iter().find()` (O(N) linear search) for each of N fragments, resulting in O(N²) comparisons. `try_join_all` spawned all N futures at once, overwhelming the async runtime. `get_deletion_vector()` was called unconditionally even for fragments without deletion files. ### Solution **Fix 1: O(1) fast path for Range segments without deletions.** When a fragment has no deletions and its row_id sequence is a `Range`, construct the index chunk directly without iterating. **Fix 2: HashMap lookup + conditional deletion vector loading.** Use a `HashMap<u32, &FileFragment>` for O(1) lookup, `buffer_unordered` for controlled concurrency, and skip `get_deletion_vector()` when there's no deletion file. ### Results | Dataset | Before | After | Speedup | |---------|--------|-------|---------| | 968M rows, 3,540 fragments | 18.9s | 150ms | **126x** | | 4.26B rows, 18,243 fragments | 220s | 89ms | **2,471x** | ## Test plan - [x] All existing rowids tests pass (45 tests) - [x] Clippy clean - [x] New test `test_large_range_segments_no_deletions` validates fast path correctness at boundaries and performance (100 fragments × 250K rows completes < 1s) - [ ] Verify on real datasets with deletions to ensure slow path correctness 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Closes lance-format#6239 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Add a "Conflict Handling" section to the performance guide explaining how concurrent operation conflicts affect throughput, with common conflict examples and a link to the transaction spec - Add a "Fragment Reuse Index" subsection to the performance guide describing how the FRI avoids compaction/index conflicts, with a Python API snippet - Add an "Impacts" section to the FRI spec covering conflict resolution changes, index load cost, and FRI growth/cleanup ## Test plan - [ ] Verify doc links resolve correctly (conflict resolution anchor, FRI spec link) - [ ] Review rendered markdown for formatting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-30T07:29:28Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

jackye1995 and others added 30 commits February 9, 2026 16:16

feat: add third party licenses lists (lance-format#5922)

9551d14

The files are generated with `make licenses`, currently expected to run manually. In the future, some automations could be built.

ci: add codex workflows for backport and fixing CI (lance-format#5926)

1b38a74

Introduce 2 CodeX workflows that could be commonly used: 1. patch a merged PR to a specific release branch 2. fix a CI workflow that is currently breaking main branch

refactor: correct panic message typos in OrderableScalarValue::cmp (l…

9e64e0b

…ance-format#5913) Fix two copy-paste typos in `OrderableScalarValue::cmp` panic messages

feat: introduce RowIdSet and RowIdMask (lance-format#5771)

27f9765

fix: apply SchemaAdapter in Updater (lance-format#5928)

dcb5bd8

This will ensure JSON columns are properly converted in paths like add_columns

feat(python): expose search_filter in scanner (lance-format#5506)

97ca6b9

Adding python api to support using fts as a filter for vector search, or using vector_query as a filter for fts search. Related to lance-format#4928.

feat(java): support session (lance-format#5931)

6deadb8

This allows Java to also pass in a Session shared across datasets similar to python and rust. Session can then be used for engine side caching implementation in Spark and Trino

docs(governance): introduce incubating subproject concept and update …

3605ce3

…subproject list (lance-format#5847) See lance-format#5848 for more background and vote --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

ci: improve codex prompt for backport and ci fix (lance-format#5940)

e0301f5

1. allow backporting multiple PRs in the same run 2. use a more meaningful PR title for CI fixes

docs: add lance-trino doc (lance-format#5943)

75622b1

feat(java): support creating IVF_RQ index (lance-format#5648)

1ce6347

This PR should be rebased after lance-format#5664 --------- Co-authored-by: majin.nathan <majin.nathan@bytedance.com>

feat: add python and java binding for aggregate (lance-format#5951)

86e5967

Follow up after lance-format#5911

Xuanwo and others added 27 commits March 25, 2026 00:03

test: stabilize distributed IVF grouped build query test (lance-forma…

b678812

…t#6281)

feat: add aimd throttled object store (lance-format#6266)

e2231fd

This does not hook the throttle up anywhere yet, that will come in a future PR. Closes lance-format#6237 Closes lance-format#6238

fix: multiple improvements for gh workflows (lance-format#6306)

2ad415a

Co-authored-by: Esteban Gutierrez <esteban@lancedb.com>

feat!: support sampling selected fragments (lance-format#6294)

f0aa55f

feat: move rate limiting to the object store (lance-format#6293)

209f99b

Closes lance-format#6239 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

chore: bump to 5.0.0-beta.1 based on breaking change detection

75b8d2a

Fix write-starvation during high read-count

d543ed8

andrea-reale marked this pull request as draft March 30, 2026 07:29

andrea-reale closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emilk/fix write starvation#12

emilk/fix write starvation#12
andrea-reale wants to merge 699 commits intorerunfrom
emilk/fix-write-starvation

andrea-reale commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

andrea-reale commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants