Closed
Conversation
…rmat#3126) This PR tries to add helper function `expect_stat` and `expect_single_stat` to make DataBlock statistics easier to use.
…sk (lance-format#3183) Now `FileFragment::create` only support create one file fragment and in spark connector will cause these two issues: 1. if the spark task is empty, this api will have exception since there is no data to create the fragment. 2. if the task data stream is very large, it will generate a huge file in lance format. It is not friendly for spark parallism. So I remove the assigned fragment id and add a new method named `FileFragment::create_fragments` to generate empty or multi fragments. 
…le columns (lance-format#3189) fix lance-format#3188 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…3197) `.java-version` is generated automatically by [jenv](https://www.jenv.be/). Jenv is a very popular tool that is used to manage java environment.
support drop dataset for python & java. support 'drop table' & 'create or replace table' for spark --------- Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Jeremy Leibs <jeremy@rerun.io> Co-authored-by: Lei Xu <eddyxu@gmail.com>
4bit PQ is 3x faster than before:
```
16000,l2,PQ=96x4,DIM=1536
time: [187.17 µs 187.95 µs 188.52 µs]
change: [-65.789% -65.641% -65.520%] (p = 0.00 < 0.10)
Performance has improved.
16000,cosine,PQ=96x4,DIM=1536
time: [214.16 µs 214.52 µs 214.89 µs]
change: [-62.748% -62.594% -62.442%] (p = 0.00 < 0.10)
Performance has improved.
16000,dot,PQ=96x4,DIM=1536
time: [190.12 µs 191.27 µs 192.22 µs]
change: [-65.496% -65.303% -65.086%] (p = 0.00 < 0.10)
Performance has improved.
```
post 8bit PQ results here for comparing, in short 4bit PQ is about 2x
faster with the same index params:
```
compute_distances: 16000,l2,PQ=96,DIM=1536
time: [405.11 µs 405.72 µs 406.92 µs]
change: [-0.2844% +0.1588% +0.6035%] (p = 0.50 > 0.10)
No change in performance detected.
compute_distances: 16000,cosine,PQ=96,DIM=1536
time: [419.98 µs 421.05 µs 421.99 µs]
change: [-0.2540% +0.1098% +0.4928%] (p = 0.59 > 0.10)
No change in performance detected.
compute_distances: 16000,dot,PQ=96,DIM=1536
time: [432.08 µs 433.63 µs 435.69 µs]
change: [-25.522% -25.243% -24.938%] (p = 0.00 < 0.10)
Performance has improved.
```
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…rmat#3200) Empty & null lists are interesting. If you have them then your final repetition & definition buffers will have more items than you have in your flattened array. This fact required a considerably reworking in how we build and unravel rep/def buffers. When building we record the position of the specials and then, when we serialize into rep/def buffers, we insert these special values. When unraveling we need to deal with the fact that certain rep/def values are "invisible" to the current context in which we are unraveling. In addition, we now need to start keeping track of the structure of each layer of repetition in the page metadata. This helps us understand the meaning behind different definition levels later when we are unraveling. This PR adds the changes to the rep/def utilities. We still aren't actually using repetition levels at all yet. That will come in future PRs.
…t#3211) There were a few issues with our null handling in scalar indices. First, it appears I assumed earlier that `X < NULL` and `X > NULL` would always be false. However, in `arrow-rs` the ordering considers `NULL` to be "the smallest value" and so `X < NULL` always evaluated to true. This required some changes to the logic in the btree and bitmap indices. Second, the btree index was still using the v1 file format because it relied on the page size to keep track of the index's batch size. I've instead made the batch size a configurable property (configurable in code, not configurable by users) and made it so that btree can use the v2 file format. Finally, related to the above, I changed it so we now write v2 files for all scalar indices, even if the dataset is a v1 dataset. I think that's a reasonable decision at this point. The logic to fallback and read the old v1 files was already in place (I believe @BubbleCal added it back when working on inverted index) but I added a migration test just to be sure we weren't breaking our btree / bitmap support. Users with existing bitmap indices will get the new correct behavior without any changes. Users with existing btree indices will get some of the new correct behavior but will need to retrain their indices to get all of the correct behavior. BREAKING CHANGE: Bitmap and btree indices will no longer be readable by older versions of Lance. This is not a "backwards compatibility change" (no APIs or code will stop working) but rather a "forwards compatibility change" (you need to be careful in a multi-verison deployment or if you roll back)
…ormat#3194) As discussion in [PR](lance-format#3084), I had implement the _rowid meta column just in java package.
When we create the version bump commit it currently updates the lock
file `Cargo.lock` to point to the new versions. I suspect it is the
`cargo ws version --no-git-commit -y --exact --force 'lance*' ${{
inputs.part }}` command that does this. However, we have two lock files,
and `python/Cargo.lock` is not updated. This PR adds a step to the
version bump to also update `python/Cargo.lock`.
Support handling Blob data in PyTorch loader
… for pylance (lance-format#3216) Although `enable_move_stable_row_ids` is still under experimental, and it still need to be add to pylance write_dataset interface for experimental usage.
…mat#3208) The repetition index is what will give us random access support when we have list data. At a high level it stores the number of top-level rows in each mini-block chunk. We can use this later to figure out which chunks we need to read. In reality things are a little more complicated because we don't mandate that each chunk starts with a brand new row (e.g. a row can span multiple mini-block chunks). This is useful because we eventually want to support arbitrarily deep nested access. If we create not-so-mini blocks in the presence of large lists then we introduce read amplification we'd like to avoid.
This PR tries to add packed struct encoding. During encoding, it packs a struct with fixed width fields, producing a row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to compressor to a `MiniBlock Layout`. during decoding, it first uses `ValueDecompressor` to get the row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock` for output. lance-format#3173 lance-format#2601
The current main has test failure in `test_fsl_packed_struct` because I introduced statistics gathering for `StructDataBlock`, but not statistics gathering in `FixedSizeListDataBlock`. To fix this test, I can add statistics gathering(in this case, only `Stat::MaxLength` is needed) for `FixedSizeListDataBlock`. I think the variants of `FixedSizeList's child datablock` can only be either `fixed width datablock` or another `fixed size list`? This PR however only disables the packed struct encoding test for `fixed size list`. ---------- after reading https://docs.rs/arrow-array/53.3.0/src/arrow_array/array/fixed_size_list_array.rs.html#133, it looks like that the child of a `fixed size list` can be any array type, which means to support `MaxLength` statistics for `fixed size list data block`, we need to have `MaxLength` statistics for all other `datablock` types.
This adds support for the sql `col BETWEEN x AND y` clause --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>
this also fixes a bug that the `AddAssign` impl for u8x16 is not saturated - 2x faster than before, so 4x faster than 8bit PQ - slightly improves recall --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Python 3.11 and later versions natively support tomllib, but lower versions require third-party libraries like tomli or toml. For example: ``` # Python 3.11+ (built-in) import tomllib # For Python <3.11: # First install: pip install tomli import tomli as tomllib ```
This will be extra helpful once this is merged: lance-format#3572
…format#3591) A breaking change was introduced in jhpratt/deranged#18 which was not given a semver bump. As a result we pick it up and it fails the "no lock file" test. This PR just avoids the try_into entirely since it doesn't seem to be necessary (we only work on 64-bit machines so usize->u64 should be safe).
…#3596) this fixes: - divide by 0 error if remapping an empty PQ storage - 4bit PQ panic if there are less than 16 rows --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…each task (lance-format#3599) fix lance-format#3598 After this pr, the parse time will decrease to 17 minutes from 4 hours.
lance-format#3602) Support `add_columns(field | [field] | schema)`
after lance-format#3511, we discovered that we also needed support for setting the token through environment variables, so this sets storage options with the "google_storage_token" env variable --------- Co-authored-by: Alexandra Li <alexandra.li@databricks.com>
now we drop the `__ivf_part_id` when shuffling, the corner is that `num_partitions=1`: 1. if `num_partitions=1` then no shuffling is needed 2. the shuffler reader would return the data directly 3. then the `__ivf_part_id` is not dropped, it's written into the index file as well --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…t#3609) Propagate the parent span context to tasks spawned by ObjectWriter's AsyncWrite implementation. This is needed for callers might have tracing subscribers they're using for observability over write events, and without accurately having the span context propagated, tracing events that happen inside the spawned task are difficult to associate with the invoking write call.
Reverts part of lance-format#3546 which added `-SNAPSHOT` to the versions. Currently the CI build system does not publish Java artifacts on pre-releases. There is also nothing in the build script to remove the `-SNAPSHOT` designation from the version. As a result the publish failed. Currently, CI requires the version specifier point to the next stable version that will be released. This restores that so the next stable release can succeed.
lance-format#3603) fixes lance-format#3601 More info can be find in the lance-format#3601 For merge_insert, the partition number does not affect the memory size required by each partition but affect the memory size that is available for this partition. Limit the target partitions to 8 or CPU cores to reduce the chance of hitting Resources exhausted during merge insert
…coder (lance-format#3607) `PrimitiveFieldEncoder` may generate empty `part`s and their corresponding encoding tasks, especially when `max_page_size` is small. This is unnecessary and can be confusing, as some empty part information gets recorded at the end, and redundant encoding tasks are processed needlessly. This PR fixes the issue by exiting the loop early when there is no data to process. Co-authored-by: LuQQiu <luqiujob@gmail.com>
This PR also allows NANs to exist in the btree column
When I execute "make format-python" every time for the Python code in the current main branch, there will always be format errors reported for these few files. So here, I want to correct the format here. @westonpace @wjones127 Help review it.
* Migrates all methods of `CommitHandler` to just use `ManifestLocation`. * Eliminates `O(num_manifests)` IOPS from `cleanup_old_versions`, since we no longer have to make a separate `HEAD` request to get the size of the file. * Eliminates `O(num_manifests)` IOPS from `list_versions()`, similar reasons as above. * Adds `e_tag` to `ManifestLocation`, so we can check we are loading the expected manifest. This eliminates the possibility that we are caching an old version of the manifest, in cases where the dataset has been deleted and recreated to the same version number.
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
d10b36f to
f936f84
Compare
chrono and arrow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is lance 0.25.2 plus one commit: f936f84
This is to work-around lance pinning
chronoto an old version:https://github.com/lancedb/lance/blob/12be3491a61dc5af6475e3e3decb625e562576ea/Cargo.toml#L91-L93
We should make a PR to lance to have them remove that pin