Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions java/lance-jni/Cargo.lock
Copy link
Copy Markdown
Contributor Author

@jackye1995 jackye1995 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the old problem of #5044

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 26 additions & 0 deletions protos/table.proto
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we baking in semver as a requirement for writers? Seems unnecessary for the format to be opinionated about that?

Copy link
Copy Markdown
Contributor Author

@jackye1995 jackye1995 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still works with arbitrary string. But in general it feels like recording the full semver is beneficial that we can know if the writer is a specific version, if it is the main release version or a specific beta version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't disagree with recording the full semver. More concerned that providing specific prerelease and build_metadata then we baking in semver concepts into the format. It seems like we could just have a field, version_extra or something like that where we put that segment of the version.

Copy link
Copy Markdown
Contributor Author

@jackye1995 jackye1995 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I went back and forth with it tbh.

Originally I just have a classifier field (that was originally what was proposed on Slack), then my internal debate is that for example I have a 1.2.3-beta.2 string, I can choose to store beta.2 in classifier, and I just split by - string to get that split. But what if there is a 1.2.3+build.abcde string which is still semver in the future, then it does not work.

I also thought about storing -beta.2 in the classifier, but then we need a parser to seek to the first non-version position of the string and then split there. It feels like a bit of an overkill, given for the Lance library it is basically always semver. Even if we have some internal versions in the future, it probably still makes sense to follow the semver part and just have -internal.2 or leverage build metadata to store internal info, so we can easily infer the ordering of different versions.

So that was my thoughts to arrive at this state.

I guess having a single classifier/version_extra field would also work, let me know which you prefer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking you could just parse semver as str_concat(version, version_extra).

Copy link
Copy Markdown
Contributor Author

@jackye1995 jackye1995 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is I think the reverse, how do you move from a version string to version and version_extra. It is not clear to me if for example if I have 1.0-beta.1, this is not a standard semver, does it mean I need to put everything to version, or split it to 1.0 and -beta.1.

The current approach provides a clear rule that (1) if it is semver, then can leverage those additional fields, (2) if it is not semver, everything is still stored in the version string.

If we want to move to just a version_extra, looks like we will define something like if it starts with 3 numbers connected with 2 dots, those go to version, and the rest go to the extra. Does that sound good?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach provides a clear rule that (1) if it is semver, then can leverage those additional fields, (2) if it is not semver, everything is still stored in the version string.

Okay, I guess that is fine.

Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,33 @@ message Manifest {
// that the library is semantically versioned, this is a string. However, if it
// is semantically versioned, it should be a valid semver string without any 'v'
// prefix. For example: `2.0.0`, `2.0.0-rc.1`.
//
// For forward compatibility with older readers, when writing new manifests this
// field should contain only the core version (major.minor.patch) without any
// prerelease or build metadata. The prerelease/build info should be stored in
// the separate prerelease and build_metadata fields instead.
Comment thread
jackye1995 marked this conversation as resolved.
string version = 2;
// Optional semver prerelease identifier.
//
// This field stores the prerelease portion of a semantic version separately
// from the core version number. For example, if the full version is "2.0.0-rc.1",
// the version field would contain "2.0.0" and prerelease would contain "rc.1".
//
// This separation ensures forward compatibility: older readers can parse the
// clean version field without errors, while newer readers can reconstruct the
// full semantic version by combining version, prerelease, and build_metadata.
//
// If absent, the version field is used as-is.
optional string prerelease = 3;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original proposal was to just use a single classifier string, but that makes it hard to leverage the semver parser, so I made it aligned with the semver spec

// Optional semver build metadata.
//
// This field stores the build metadata portion of a semantic version separately
// from the core version number. For example, if the full version is
// "2.0.0-rc.1+build.123", the version field would contain "2.0.0", prerelease
// would contain "rc.1", and build_metadata would contain "build.123".
//
// If absent, no build metadata is present.
optional string build_metadata = 4;
}

// The version of the writer that created this file.
Expand Down
51 changes: 51 additions & 0 deletions python/python/tests/forward_compat/test_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
# This file will be run on older versions of Lance to test that the
# current version of Lance can read the test data generated by datagen.py.

import shutil

import lance
import pyarrow as pa
import pyarrow.compute as pc
Expand Down Expand Up @@ -116,3 +118,52 @@ def test_list_indices_ignores_new_fts_index_version():
indices = ds.list_indices()
# the new index version should be ignored
assert len(indices) == 0


@pytest.mark.forward
@pytest.mark.skipif(
Version(lance.__version__) < Version("0.20.0"),
reason="Version is too old to read index files stored with Lance 2.0 file format",
)
def test_write_scalar_index(tmp_path: str):
path = get_path("scalar_index")
# copy to tmp path to avoid modifying original
shutil.copytree(path, tmp_path, dirs_exist_ok=True)

ds = lance.dataset(tmp_path)
data = pa.table(
{
"idx": pa.array([1000]),
"btree": pa.array([1000]),
"bitmap": pa.array([1000]),
"label_list": pa.array([["label1000"]]),
"ngram": pa.array(["word1000"]),
"zonemap": pa.array([1000]),
"bloomfilter": pa.array([1000]),
}
)
ds.insert(data)
# ds.optimize.optimize_indices()
ds.optimize.compact_files()


@pytest.mark.forward
@pytest.mark.skipif(
Version(lance.__version__) < Version("0.36.0"),
reason="FTS token set format was introduced in 0.36.0",
)
def test_write_fts(tmp_path: str):
path = get_path("fts_index")
# copy to tmp path to avoid modifying original
shutil.copytree(path, tmp_path, dirs_exist_ok=True)

ds = lance.dataset(tmp_path)
data = pa.table(
{
"idx": pa.array([1000]),
"text": pa.array(["new document to index"]),
}
)
ds.insert(data)
# ds.optimize.optimize_indices()
ds.optimize.compact_files()
Loading
Loading