Skip to content

Lance 0.25.2 but update chrono and arrow#4

Closed
emilk wants to merge 248 commits intomainfrom
emilk/update-chrono-arrow
Closed

Lance 0.25.2 but update chrono and arrow#4
emilk wants to merge 248 commits intomainfrom
emilk/update-chrono-arrow

Conversation

@emilk
Copy link
Copy Markdown
Member

@emilk emilk commented Apr 1, 2025

This is lance 0.25.2 plus one commit: f936f84

This is to work-around lance pinning chrono to an old version:

https://github.com/lancedb/lance/blob/12be3491a61dc5af6475e3e3decb625e562576ea/Cargo.toml#L91-L93

We should make a PR to lance to have them remove that pin

broccoliSpicy and others added 30 commits November 26, 2024 11:40
…rmat#3126)

This PR tries to add helper function `expect_stat` and
`expect_single_stat` to make DataBlock statistics easier to use.
…sk (lance-format#3183)

Now `FileFragment::create` only support create one file fragment and in
spark connector will cause these two issues:
1. if the spark task is empty, this api will have exception since there
is no data to create the fragment.
2. if the task data stream is very large, it will generate a huge file
in lance format. It is not friendly for spark parallism.

So I remove the assigned fragment id and add a new method named
`FileFragment::create_fragments` to generate empty or multi fragments.


![image](https://github.com/user-attachments/assets/54fb2497-8163-4652-9e0b-d50a88fade53)
…le columns (lance-format#3189)

fix lance-format#3188

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…3197)

`.java-version` is generated automatically by
[jenv](https://www.jenv.be/). Jenv is a very popular tool that is used
to manage java environment.
support drop dataset for python & java.
support 'drop table' & 'create or replace table' for spark

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Jeremy Leibs <jeremy@rerun.io>
Co-authored-by: Lei Xu <eddyxu@gmail.com>
4bit PQ is 3x faster than before:
```
16000,l2,PQ=96x4,DIM=1536
                        time:   [187.17 µs 187.95 µs 188.52 µs]
                        change: [-65.789% -65.641% -65.520%] (p = 0.00 < 0.10)
                        Performance has improved.

16000,cosine,PQ=96x4,DIM=1536
                        time:   [214.16 µs 214.52 µs 214.89 µs]
                        change: [-62.748% -62.594% -62.442%] (p = 0.00 < 0.10)
                        Performance has improved.

16000,dot,PQ=96x4,DIM=1536
                        time:   [190.12 µs 191.27 µs 192.22 µs]
                        change: [-65.496% -65.303% -65.086%] (p = 0.00 < 0.10)
                        Performance has improved.
```

post 8bit PQ results here for comparing, in short 4bit PQ is about 2x
faster with the same index params:
```
compute_distances: 16000,l2,PQ=96,DIM=1536
                        time:   [405.11 µs 405.72 µs 406.92 µs]
                        change: [-0.2844% +0.1588% +0.6035%] (p = 0.50 > 0.10)
                        No change in performance detected.

compute_distances: 16000,cosine,PQ=96,DIM=1536
                        time:   [419.98 µs 421.05 µs 421.99 µs]
                        change: [-0.2540% +0.1098% +0.4928%] (p = 0.59 > 0.10)
                        No change in performance detected.

compute_distances: 16000,dot,PQ=96,DIM=1536
                        time:   [432.08 µs 433.63 µs 435.69 µs]
                        change: [-25.522% -25.243% -24.938%] (p = 0.00 < 0.10)
                        Performance has improved.
```

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…rmat#3200)

Empty & null lists are interesting. If you have them then your final
repetition & definition buffers will have more items than you have in
your flattened array. This fact required a considerably reworking in how
we build and unravel rep/def buffers.

When building we record the position of the specials and then, when we
serialize into rep/def buffers, we insert these special values. When
unraveling we need to deal with the fact that certain rep/def values are
"invisible" to the current context in which we are unraveling.

In addition, we now need to start keeping track of the structure of each
layer of repetition in the page metadata. This helps us understand the
meaning behind different definition levels later when we are unraveling.

This PR adds the changes to the rep/def utilities. We still aren't
actually using repetition levels at all yet. That will come in future
PRs.
…t#3211)

There were a few issues with our null handling in scalar indices.

First, it appears I assumed earlier that `X < NULL` and `X > NULL` would
always be false. However, in `arrow-rs` the ordering considers `NULL` to
be "the smallest value" and so `X < NULL` always evaluated to true. This
required some changes to the logic in the btree and bitmap indices.

Second, the btree index was still using the v1 file format because it
relied on the page size to keep track of the index's batch size. I've
instead made the batch size a configurable property (configurable in
code, not configurable by users) and made it so that btree can use the
v2 file format.

Finally, related to the above, I changed it so we now write v2 files for
all scalar indices, even if the dataset is a v1 dataset. I think that's
a reasonable decision at this point.

The logic to fallback and read the old v1 files was already in place (I
believe @BubbleCal added it back when working on inverted index) but I
added a migration test just to be sure we weren't breaking our btree /
bitmap support.

Users with existing bitmap indices will get the new correct behavior
without any changes.
Users with existing btree indices will get some of the new correct
behavior but will need to retrain their indices to get all of the
correct behavior.

BREAKING CHANGE: Bitmap and btree indices will no longer be readable by
older versions of Lance. This is not a "backwards compatibility change"
(no APIs or code will stop working) but rather a "forwards compatibility
change" (you need to be careful in a multi-verison deployment or if you
roll back)
…ormat#3194)

As discussion in [PR](lance-format#3084), I had
implement the _rowid meta column just in java package.
When we create the version bump commit it currently updates the lock
file `Cargo.lock` to point to the new versions. I suspect it is the
`cargo ws version --no-git-commit -y --exact --force 'lance*' ${{
inputs.part }}` command that does this. However, we have two lock files,
and `python/Cargo.lock` is not updated. This PR adds a step to the
version bump to also update `python/Cargo.lock`.
Support handling Blob data in PyTorch loader
… for pylance (lance-format#3216)

Although `enable_move_stable_row_ids` is still under experimental, and
it still need to be add to pylance write_dataset interface for
experimental usage.
…mat#3208)

The repetition index is what will give us random access support when we
have list data. At a high level it stores the number of top-level rows
in each mini-block chunk. We can use this later to figure out which
chunks we need to read.

In reality things are a little more complicated because we don't mandate
that each chunk starts with a brand new row (e.g. a row can span
multiple mini-block chunks). This is useful because we eventually want
to support arbitrarily deep nested access. If we create not-so-mini
blocks in the presence of large lists then we introduce read
amplification we'd like to avoid.
This PR tries to add packed struct encoding.

During encoding, it packs a struct with fixed width fields, producing a
row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to
compressor to a `MiniBlock Layout`.

during decoding, it first uses `ValueDecompressor` to get the
row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock`
for output.

lance-format#3173 lance-format#2601
The current main has test failure in `test_fsl_packed_struct` because I
introduced statistics gathering for `StructDataBlock`, but not
statistics gathering in `FixedSizeListDataBlock`.

To fix this test, I can add statistics gathering(in this case, only
`Stat::MaxLength` is needed) for `FixedSizeListDataBlock`.

I think the variants of `FixedSizeList's child datablock` can only be
either `fixed width datablock` or another `fixed size list`?

This PR however only disables the packed struct encoding test for `fixed
size list`.

----------

after reading
https://docs.rs/arrow-array/53.3.0/src/arrow_array/array/fixed_size_list_array.rs.html#133,
it looks like that the child of a `fixed size list` can be any array
type, which means to support `MaxLength` statistics for `fixed size list
data block`, we need to have `MaxLength` statistics for all other
`datablock` types.
This adds support for the sql `col BETWEEN x AND y` clause

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
wjones127 and others added 24 commits March 21, 2025 17:18
this also fixes a bug that the `AddAssign` impl for u8x16 is not
saturated

- 2x faster than before, so 4x faster than 8bit PQ
- slightly improves recall

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Python 3.11 and later versions natively support tomllib, but lower
versions require third-party libraries like tomli or toml.
For example:

```
# Python 3.11+ (built-in)
import tomllib

# For Python <3.11:
# First install: pip install tomli
import tomli as tomllib
```
This will be extra helpful once this is merged:
lance-format#3572
…format#3591)

A breaking change was introduced in
jhpratt/deranged#18 which was not given a
semver bump. As a result we pick it up and it fails the "no lock file"
test.

This PR just avoids the try_into entirely since it doesn't seem to be
necessary (we only work on 64-bit machines so usize->u64 should be
safe).
…#3596)

this fixes:
- divide by 0 error if remapping an empty PQ storage
- 4bit PQ panic if there are less than 16 rows

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…each task (lance-format#3599)

fix lance-format#3598
After this pr, the parse time will decrease to 17 minutes from 4 hours.
after lance-format#3511, we discovered that we
also needed support for setting the token through environment variables,
so this sets storage options with the "google_storage_token" env
variable

---------

Co-authored-by: Alexandra Li <alexandra.li@databricks.com>
now we drop the `__ivf_part_id` when shuffling, the corner is that
`num_partitions=1`:
1. if `num_partitions=1` then no shuffling is needed
2. the shuffler reader would return the data directly
3. then the `__ivf_part_id` is not dropped, it's written into the index
file as well

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…t#3609)

Propagate the parent span context to tasks spawned by ObjectWriter's
AsyncWrite implementation. This is needed for callers might have tracing
subscribers they're using for observability over write events, and
without accurately having the span context propagated, tracing events
that happen inside the spawned task are difficult to associate with the
invoking write call.
Reverts part of lance-format#3546 which added `-SNAPSHOT` to the versions. Currently
the CI build system does not publish Java artifacts on pre-releases.
There is also nothing in the build script to remove the `-SNAPSHOT`
designation from the version. As a result the publish failed.

Currently, CI requires the version specifier point to the next stable
version that will be released. This restores that so the next stable
release can succeed.
lance-format#3603)

fixes lance-format#3601 
More info can be find in the lance-format#3601 
For merge_insert, the partition number does not affect the memory size
required by each partition but affect the memory size that is available
for this partition.
Limit the target partitions to 8 or CPU cores to reduce the chance of
hitting Resources exhausted during merge insert
…coder (lance-format#3607)

`PrimitiveFieldEncoder` may generate empty `part`s and their
corresponding encoding tasks, especially when `max_page_size` is small.
This is unnecessary and can be confusing, as some empty part information
gets recorded at the end, and redundant encoding tasks are processed
needlessly. This PR fixes the issue by exiting the loop early when there
is no data to process.

Co-authored-by: LuQQiu <luqiujob@gmail.com>
This PR also allows NANs to exist in the btree column
When I execute "make format-python" every time for the Python code in
the current main branch, there will always be format errors reported for
these few files. So here, I want to correct the format here.

@westonpace @wjones127  Help review it.
* Migrates all methods of `CommitHandler` to just use
`ManifestLocation`.
* Eliminates `O(num_manifests)` IOPS from `cleanup_old_versions`, since
we no longer have to make a separate `HEAD` request to get the size of
the file.
* Eliminates `O(num_manifests)` IOPS from `list_versions()`, similar
reasons as above.
* Adds `e_tag` to `ManifestLocation`, so we can check we are loading the
expected manifest. This eliminates the possibility that we are caching
an old version of the manifest, in cases where the dataset has been
deleted and recreated to the same version number.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@emilk emilk force-pushed the emilk/update-chrono-arrow branch from d10b36f to f936f84 Compare April 1, 2025 13:59
@emilk emilk changed the title Update chrono and arrow Lance 0.25.2 but update chrono and arrow Apr 1, 2025
@teh-cmc teh-cmc closed this May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.