Lance 0.25.2 but update `chrono` and `arrow` by emilk · Pull Request #4 · rerun-io/lance

emilk · 2025-04-01T13:55:06Z

This is lance 0.25.2 plus one commit: f936f84

This is to work-around lance pinning chrono to an old version:

https://github.com/lancedb/lance/blob/12be3491a61dc5af6475e3e3decb625e562576ea/Cargo.toml#L91-L93

We should make a PR to lance to have them remove that pin

…rmat#3126) This PR tries to add helper function `expect_stat` and `expect_single_stat` to make DataBlock statistics easier to use.

…sk (lance-format#3183) Now `FileFragment::create` only support create one file fragment and in spark connector will cause these two issues: 1. if the spark task is empty, this api will have exception since there is no data to create the fragment. 2. if the task data stream is very large, it will generate a huge file in lance format. It is not friendly for spark parallism. So I remove the assigned fragment id and add a new method named `FileFragment::create_fragments` to generate empty or multi fragments. ![image](https://github.com/user-attachments/assets/54fb2497-8163-4652-9e0b-d50a88fade53)

…le columns (lance-format#3189) fix lance-format#3188 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…3197) `.java-version` is generated automatically by [jenv](https://www.jenv.be/). Jenv is a very popular tool that is used to manage java environment.

support drop dataset for python & java. support 'drop table' & 'create or replace table' for spark --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Co-authored-by: Jeremy Leibs <jeremy@rerun.io> Co-authored-by: Lei Xu <eddyxu@gmail.com>

4bit PQ is 3x faster than before: ``` 16000,l2,PQ=96x4,DIM=1536 time: [187.17 µs 187.95 µs 188.52 µs] change: [-65.789% -65.641% -65.520%] (p = 0.00 < 0.10) Performance has improved. 16000,cosine,PQ=96x4,DIM=1536 time: [214.16 µs 214.52 µs 214.89 µs] change: [-62.748% -62.594% -62.442%] (p = 0.00 < 0.10) Performance has improved. 16000,dot,PQ=96x4,DIM=1536 time: [190.12 µs 191.27 µs 192.22 µs] change: [-65.496% -65.303% -65.086%] (p = 0.00 < 0.10) Performance has improved. ``` post 8bit PQ results here for comparing, in short 4bit PQ is about 2x faster with the same index params: ``` compute_distances: 16000,l2,PQ=96,DIM=1536 time: [405.11 µs 405.72 µs 406.92 µs] change: [-0.2844% +0.1588% +0.6035%] (p = 0.50 > 0.10) No change in performance detected. compute_distances: 16000,cosine,PQ=96,DIM=1536 time: [419.98 µs 421.05 µs 421.99 µs] change: [-0.2540% +0.1098% +0.4928%] (p = 0.59 > 0.10) No change in performance detected. compute_distances: 16000,dot,PQ=96,DIM=1536 time: [432.08 µs 433.63 µs 435.69 µs] change: [-25.522% -25.243% -24.938%] (p = 0.00 < 0.10) Performance has improved. ``` --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…rmat#3200) Empty & null lists are interesting. If you have them then your final repetition & definition buffers will have more items than you have in your flattened array. This fact required a considerably reworking in how we build and unravel rep/def buffers. When building we record the position of the specials and then, when we serialize into rep/def buffers, we insert these special values. When unraveling we need to deal with the fact that certain rep/def values are "invisible" to the current context in which we are unraveling. In addition, we now need to start keeping track of the structure of each layer of repetition in the page metadata. This helps us understand the meaning behind different definition levels later when we are unraveling. This PR adds the changes to the rep/def utilities. We still aren't actually using repetition levels at all yet. That will come in future PRs.

@BubbleCal

…t#3211) There were a few issues with our null handling in scalar indices. First, it appears I assumed earlier that `X < NULL` and `X > NULL` would always be false. However, in `arrow-rs` the ordering considers `NULL` to be "the smallest value" and so `X < NULL` always evaluated to true. This required some changes to the logic in the btree and bitmap indices. Second, the btree index was still using the v1 file format because it relied on the page size to keep track of the index's batch size. I've instead made the batch size a configurable property (configurable in code, not configurable by users) and made it so that btree can use the v2 file format. Finally, related to the above, I changed it so we now write v2 files for all scalar indices, even if the dataset is a v1 dataset. I think that's a reasonable decision at this point. The logic to fallback and read the old v1 files was already in place (I believe @BubbleCal added it back when working on inverted index) but I added a migration test just to be sure we weren't breaking our btree / bitmap support. Users with existing bitmap indices will get the new correct behavior without any changes. Users with existing btree indices will get some of the new correct behavior but will need to retrain their indices to get all of the correct behavior. BREAKING CHANGE: Bitmap and btree indices will no longer be readable by older versions of Lance. This is not a "backwards compatibility change" (no APIs or code will stop working) but rather a "forwards compatibility change" (you need to be careful in a multi-verison deployment or if you roll back)

…ance-format#3213)

Closes lance-format#3195

…ormat#3194) As discussion in [PR](lance-format#3084), I had implement the _rowid meta column just in java package.

When we create the version bump commit it currently updates the lock file `Cargo.lock` to point to the new versions. I suspect it is the `cargo ws version --no-git-commit -y --exact --force 'lance*' ${{ inputs.part }}` command that does this. However, we have two lock files, and `python/Cargo.lock` is not updated. This PR adds a step to the version bump to also update `python/Cargo.lock`.

Support handling Blob data in PyTorch loader

…e-format#3219)

… for pylance (lance-format#3216) Although `enable_move_stable_row_ids` is still under experimental, and it still need to be add to pylance write_dataset interface for experimental usage.

…mat#3208) The repetition index is what will give us random access support when we have list data. At a high level it stores the number of top-level rows in each mini-block chunk. We can use this later to figure out which chunks we need to read. In reality things are a little more complicated because we don't mandate that each chunk starts with a brand new row (e.g. a row can span multiple mini-block chunks). This is useful because we eventually want to support arbitrarily deep nested access. If we create not-so-mini blocks in the presence of large lists then we introduce read amplification we'd like to avoid.

This PR tries to add packed struct encoding. During encoding, it packs a struct with fixed width fields, producing a row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to compressor to a `MiniBlock Layout`. during decoding, it first uses `ValueDecompressor` to get the row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock` for output. lance-format#3173 lance-format#2601

The current main has test failure in `test_fsl_packed_struct` because I introduced statistics gathering for `StructDataBlock`, but not statistics gathering in `FixedSizeListDataBlock`. To fix this test, I can add statistics gathering(in this case, only `Stat::MaxLength` is needed) for `FixedSizeListDataBlock`. I think the variants of `FixedSizeList's child datablock` can only be either `fixed width datablock` or another `fixed size list`? This PR however only disables the packed struct encoding test for `fixed size list`. ---------- after reading https://docs.rs/arrow-array/53.3.0/src/arrow_array/array/fixed_size_list_array.rs.html#133, it looks like that the child of a `fixed size list` can be any array type, which means to support `MaxLength` statistics for `fixed size list data block`, we need to have `MaxLength` statistics for all other `datablock` types.

This adds support for the sql `col BETWEEN x AND y` clause --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

this also fixes a bug that the `AddAssign` impl for u8x16 is not saturated - 2x faster than before, so 4x faster than 8bit PQ - slightly improves recall --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Python 3.11 and later versions natively support tomllib, but lower versions require third-party libraries like tomli or toml. For example: ``` # Python 3.11+ (built-in) import tomllib # For Python <3.11: # First install: pip install tomli import tomli as tomllib ```

This will be extra helpful once this is merged: lance-format#3572

…format#3591) A breaking change was introduced in jhpratt/deranged#18 which was not given a semver bump. As a result we pick it up and it fails the "no lock file" test. This PR just avoids the try_into entirely since it doesn't seem to be necessary (we only work on 64-bit machines so usize->u64 should be safe).

…#3596) this fixes: - divide by 0 error if remapping an empty PQ storage - 4bit PQ panic if there are less than 16 rows --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…each task (lance-format#3599) fix lance-format#3598 After this pr, the parse time will decrease to 17 minutes from 4 hours.

lance-format#3602) Support `add_columns(field | [field] | schema)`

after lance-format#3511, we discovered that we also needed support for setting the token through environment variables, so this sets storage options with the "google_storage_token" env variable --------- Co-authored-by: Alexandra Li <alexandra.li@databricks.com>

now we drop the `__ivf_part_id` when shuffling, the corner is that `num_partitions=1`: 1. if `num_partitions=1` then no shuffling is needed 2. the shuffler reader would return the data directly 3. then the `__ivf_part_id` is not dropped, it's written into the index file as well --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…t#3609) Propagate the parent span context to tasks spawned by ObjectWriter's AsyncWrite implementation. This is needed for callers might have tracing subscribers they're using for observability over write events, and without accurately having the span context propagated, tracing events that happen inside the spawned task are difficult to associate with the invoking write call.

Reverts part of lance-format#3546 which added `-SNAPSHOT` to the versions. Currently the CI build system does not publish Java artifacts on pre-releases. There is also nothing in the build script to remove the `-SNAPSHOT` designation from the version. As a result the publish failed. Currently, CI requires the version specifier point to the next stable version that will be released. This restores that so the next stable release can succeed.

lance-format#3603) fixes lance-format#3601 More info can be find in the lance-format#3601 For merge_insert, the partition number does not affect the memory size required by each partition but affect the memory size that is available for this partition. Limit the target partitions to 8 or CPU cores to reduce the chance of hitting Resources exhausted during merge insert

…hema (lance-format#3611)

…coder (lance-format#3607) `PrimitiveFieldEncoder` may generate empty `part`s and their corresponding encoding tasks, especially when `max_page_size` is small. This is unnecessary and can be confusing, as some empty part information gets recorded at the end, and redundant encoding tasks are processed needlessly. This PR fixes the issue by exiting the loop early when there is no data to process. Co-authored-by: LuQQiu <luqiujob@gmail.com>

This PR also allows NANs to exist in the btree column

@westonpace

When I execute "make format-python" every time for the Python code in the current main branch, there will always be format errors reported for these few files. So here, I want to correct the format here. @westonpace @wjones127 Help review it.

* Migrates all methods of `CommitHandler` to just use `ManifestLocation`. * Eliminates `O(num_manifests)` IOPS from `cleanup_old_versions`, since we no longer have to make a separate `HEAD` request to get the size of the file. * Eliminates `O(num_manifests)` IOPS from `list_versions()`, similar reasons as above. * Adds `e_tag` to `ManifestLocation`, so we can check we are loading the expected manifest. This eliminates the possibility that we are caching an old version of the manifest, in cases where the dataset has been deleted and recreated to the same version number.

github-actions · 2025-04-01T13:55:27Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

broccoliSpicy and others added 30 commits November 26, 2024 11:40

chore: add expect_stat, expect_single_stat in GetStat trait (lance-fo…

dc9afbb

…rmat#3126) This PR tries to add helper function `expect_stat` and `expect_single_stat` to make DataBlock statistics easier to use.

fix: full text search may produce dup results when search over multip…

dc8f0f6

…le columns (lance-format#3189) fix lance-format#3188 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix: fix typing for _write_fragment (lance-format#3171)

3d3ebf2

ci(java): introduce spotless-maven-plugin (lance-format#3193)

c5a1382

fix: fix storage options for dataset builder (lance-format#3156)

574b7d0

chore: add .java-version to .gitignore for java module (lance-format#…

e0bf62a

…3197) `.java-version` is generated automatically by [jenv](https://www.jenv.be/). Jenv is a very popular tool that is used to manage java environment.

feat: add drop to dataset (lance-format#3184)

0c2b70a

support drop dataset for python & java. support 'drop table' & 'create or replace table' for spark --------- Co-authored-by: Will Jones <willjones127@gmail.com>

fix: fix storage options for ray (lance-format#3164)

6edb1b8

chore: fix warnings on rust 1.83 (lance-format#3202)

75d526e

feat: upgrade arrow (to 53) & datafusion (to 42) (lance-format#3201)

955749e

Co-authored-by: Jeremy Leibs <jeremy@rerun.io> Co-authored-by: Lei Xu <eddyxu@gmail.com>

Bump version

6e84834

docs: add the documentation about how to install packages for tests (l…

970e7d5

…ance-format#3213)

feat: let pylance use sub-level logger of logging (lance-format#3206)

276a284

Closes lance-format#3195

feat: support _rowid meta column for spark connector in java (lance-f…

e4ab9a8

…ormat#3194) As discussion in [PR](lance-format#3084), I had implement the _rowid meta column just in java package.

chore: remove cuvs and pylibraft (lance-format#3214)

84c6fc0

feat!: support hamming distance & binary vector (lance-format#3198)

4444c60

feat: support blob api in pytorch loader (lance-format#3217)

f1c6c3e

Support handling Blob data in PyTorch loader

chore: configure the spotless maven plugin to format Scala code (lanc…

df640c4

…e-format#3219)

feat(python): add experimental parameter enable_move_stable_row_ids…

5ff966d

… for pylance (lance-format#3216) Although `enable_move_stable_row_ids` is still under experimental, and it still need to be add to pylance write_dataset interface for experimental usage.

docs: add doc and test for 4bit PQ (lance-format#3212)

ef9d0c2

feat: support between sql clauses (lance-format#3225)

7ec23f0

This adds support for the sql `col BETWEEN x AND y` clause --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

feat(java): support drop columns for dataset (lance-format#3237)

1c8d406

wjones127 and others added 24 commits March 21, 2025 17:18

feat(python): add warning about fork (lance-format#3584)

e8f4d98

perf: improve 4bit PQ performance (lance-format#3557)

ddb3b86

this also fixes a bug that the `AddAssign` impl for u8x16 is not saturated - 2x faster than before, so 4x faster than 8bit PQ - slightly improves recall --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

feat: add tracing to cleanup (lance-format#3585)

db72d25

This will be extra helpful once this is merged: lance-format#3572

feat: add JNI bindings for the file reader/writer (lance-format#3588)

9dbb06a

Bump version

babb5ab

fix: divide by 0 error if remapping PQ storage to empty (lance-format…

eb4680e

…#3596) this fixes: - divide by 0 error if remapping an empty PQ storage - 4bit PQ panic if there are less than 16 rows --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

perf(java): cache the fragments to avoid parse the fragment json for …

852b155

…each task (lance-format#3599) fix lance-format#3598 After this pr, the parse time will decrease to 17 minutes from 4 hours.

feat(python): support adding null columns with pyarrow field or schema (

c44b74f

lance-format#3602) Support `add_columns(field | [field] | schema)`

Bump version

d75c45c

docs: add example of adding new columns with only pyarrow Field or Sc…

47026e0

…hema (lance-format#3611)

feat: add support for fixed size binary to btree (lance-format#3613)

9c9c0ad

This PR also allows NANs to exist in the btree column

docs: add spark r/w lance demo (lance-format#3574)

7a49e5d

fix: fix python format (lance-format#3608)

82f6560

When I execute "make format-python" every time for the Python code in the current main branch, there will always be format errors reported for these few files. So here, I want to correct the format here. @westonpace @wjones127 Help review it.

feat: upgrade to datafusion 46 (lance-format#3618)

1b6ed1a

feat: support fuzzy query and boost query (lance-format#3610)

1aa9d5a

Update chrono and arrow

f936f84

emilk force-pushed the emilk/update-chrono-arrow branch from d10b36f to f936f84 Compare April 1, 2025 13:59

emilk changed the title ~~Update chrono and arrow~~ Lance 0.25.2 but update chrono and arrow Apr 1, 2025

rebase

42ea0c9

teh-cmc closed this May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance 0.25.2 but update `chrono` and `arrow`#4

Lance 0.25.2 but update `chrono` and `arrow`#4
emilk wants to merge 248 commits intomainfrom
emilk/update-chrono-arrow

emilk commented Apr 1, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

emilk commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

emilk commented Apr 1, 2025 •

edited

Loading