Skip to content

perf: migrate to ManifestLocation, add e_tag#3592

Merged
wjones127 merged 8 commits intolance-format:mainfrom
wjones127:manifest-etag
Mar 28, 2025
Merged

perf: migrate to ManifestLocation, add e_tag#3592
wjones127 merged 8 commits intolance-format:mainfrom
wjones127:manifest-etag

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Mar 24, 2025

  • Migrates all methods of CommitHandler to just use ManifestLocation.
    • Eliminates O(num_manifests) IOPS from cleanup_old_versions, since we no longer have to make a separate HEAD request to get the size of the file.
    • Eliminates O(num_manifests) IOPS from list_versions(), similar reasons as above.
  • Adds e_tag to ManifestLocation, so we can check we are loading the expected manifest. This eliminates the possibility that we are caching an old version of the manifest, in cases where the dataset has been deleted and recreated to the same version number.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 changed the title perf: migrate to ManifestLocation, add e_tag perf: migrate to ManifestLocation, add e_tag Mar 25, 2025
Comment thread rust/lance/src/dataset.rs
Comment on lines +5558 to +5589
#[tokio::test]
async fn test_replace_dataset() {
let test_dir = tempdir().unwrap();
let test_uri = test_dir.path().to_str().unwrap();

let data = gen()
.col("int", array::step::<Int32Type>())
.into_batch_rows(RowCount::from(20))
.unwrap();
let data1 = data.slice(0, 10);
let data2 = data.slice(10, 10);
let mut ds = InsertBuilder::new(test_uri)
.execute(vec![data1])
.await
.unwrap();

ds.object_store().remove_dir_all(test_uri).await.unwrap();

let ds2 = InsertBuilder::new(test_uri)
.execute(vec![data2.clone()])
.await
.unwrap();

ds.checkout_latest().await.unwrap();
let roundtripped = ds.scan().try_into_batch().await.unwrap();
assert_eq!(roundtripped, data2);

ds.validate().await.unwrap();
ds2.validate().await.unwrap();
assert_eq!(ds.manifest.version, 1);
assert_eq!(ds2.manifest.version, 1);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main test of interest: can we delete a dataset, recreate it, and then use checkout_latest() on an old handle to detect the recreated version.

@wjones127 wjones127 marked this pull request as ready for review March 25, 2025 23:26
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 25, 2025

Codecov Report

Attention: Patch coverage is 81.06667% with 71 lines in your changes missing coverage. Please review.

Project coverage is 78.74%. Comparing base (852b155) to head (8eb55e2).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...ust/lance-table/src/io/commit/external_manifest.rs 63.79% 19 Missing and 2 partials ⚠️
rust/lance-table/src/io/commit.rs 83.33% 8 Missing and 8 partials ⚠️
rust/lance-io/src/object_writer.rs 85.71% 3 Missing and 3 partials ⚠️
rust/lance/src/dataset.rs 93.33% 2 Missing and 4 partials ⚠️
rust/lance/src/io/commit.rs 57.14% 6 Missing ⚠️
rust/lance/src/index.rs 37.50% 2 Missing and 3 partials ⚠️
rust/lance/src/dataset/refs.rs 50.00% 1 Missing and 3 partials ⚠️
rust/lance/src/dataset/schema_evolution.rs 66.66% 0 Missing and 3 partials ⚠️
rust/lance/src/dataset/cleanup.rs 75.00% 0 Missing and 2 partials ⚠️
rust/lance/src/dataset/optimize.rs 75.00% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3592      +/-   ##
==========================================
+ Coverage   78.69%   78.74%   +0.04%     
==========================================
  Files         258      259       +1     
  Lines       96813    97030     +217     
  Branches    96813    97030     +217     
==========================================
+ Hits        76185    76403     +218     
+ Misses      17560    17552       -8     
- Partials     3068     3075       +7     
Flag Coverage Δ
unittests 78.74% <81.06%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work 👍

object_writer.shutdown().await.unwrap();
let res = object_writer.shutdown().await.unwrap();
assert_eq!(res.size, 256 * 3);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test big enough to trigger multiple write parts?

// Use an ETag scheme based on that used by many popular HTTP servers
// <https://httpd.apache.org/docs/2.2/mod/core.html#fileetag>
// <https://stackoverflow.com/questions/47512043/how-etags-are-generated-and-configured>
format!("{inode:x}-{mtime:x}-{size:x}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

_ => None,
});

let e_tag = item.get("e_tag").and_then(|attr| attr.as_s().ok().cloned());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does dynamodb need any kind of migration for adding a new column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if you are creating new key columns / indices. For general columns, it's a schemaless document database.


// On S3, the etag can change if originally was MultipartUpload and later was Copy
// https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html#AmazonS3-Type-Object-ETag
// We generally only do MultipartUpload for > 5MB files, so we can skip this check
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We generally only do MultipartUpload for > 5MB files, so we can skip this check
// We only do MultipartUpload for > 5MB files, so we can skip this check

I hope not generally if we're going to skip the step 😄

Comment thread rust/lance/src/dataset.rs
Comment on lines +348 to +353
self.manifest.version == location.version
&& location.e_tag.as_ref().is_some_and(|e_tag| {
self.manifest_e_tag
.as_ref()
.is_some_and(|current_e_tag| e_tag == current_e_tag)
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this version or the previous version does not have the etag then we have to fallback to previous behavior and assume it isn't already checked out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I wanted to avoid a situation where we get into a reload loop because the e_tag keeps coming back as None.

Comment on lines +183 to +184
let manifest =
read_manifest(&self.dataset.object_store, &location.path, location.size).await?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay

@wjones127 wjones127 merged commit 245a745 into lance-format:main Mar 28, 2025
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants