Skip to content

feat: retry merge_insert when possible#3614

Merged
wjones127 merged 13 commits intolance-format:mainfrom
wjones127:merge-insert-conflict
Apr 24, 2025
Merged

feat: retry merge_insert when possible#3614
wjones127 merged 13 commits intolance-format:mainfrom
wjones127:merge-insert-conflict

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Mar 27, 2025

Part of #3397

  • Pull out Backoff utility into separate struct.
    • Set default backoff to 50ms, 100ms, 200ms, 400ms (previously started at 100ms)
  • Changed Transaction::conflicts_with() to return an enum that differentiates Retryable and non-retryable conflicts.
  • Made merge_insert retry on retry-able conflicts up to 10 times.
    • After 10 attempts, will now return a TooMuchContention error.
  • Added a spill utility that allows replaying the same stream multiple times.
  • Reimplemented background_iterator so that it preserves size_hint().
  • Simplified error message of CommitConflict so it's easier to see which operations conflicted.

@github-actions github-actions Bot added enhancement New feature or request python labels Mar 27, 2025
@wjones127 wjones127 force-pushed the merge-insert-conflict branch 2 times, most recently from 20581e8 to aa9fe97 Compare April 21, 2025 21:52
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 21, 2025

Codecov Report

Attention: Patch coverage is 88.76128% with 137 lines in your changes missing coverage. Please review.

Project coverage is 78.57%. Comparing base (f49c049) to head (8c44511).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/transaction.rs 84.41% 38 Missing and 3 partials ⚠️
rust/lance-datafusion/src/spill.rs 94.17% 14 Missing and 16 partials ⚠️
rust/lance/src/dataset/write.rs 73.01% 10 Missing and 7 partials ⚠️
rust/lance/src/dataset/write/merge_insert.rs 93.65% 9 Missing and 4 partials ⚠️
rust/lance-core/src/utils/backoff.rs 75.51% 12 Missing ⚠️
.../lance-datafusion/src/utils/background_iterator.rs 76.92% 12 Missing ⚠️
rust/lance-arrow/src/memory.rs 80.00% 7 Missing and 2 partials ⚠️
rust/lance/src/io/commit.rs 87.50% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3614      +/-   ##
==========================================
+ Coverage   78.50%   78.57%   +0.07%     
==========================================
  Files         268      272       +4     
  Lines      100735   101868    +1133     
  Branches   100735   101868    +1133     
==========================================
+ Hits        79078    80041     +963     
- Misses      18538    18670     +132     
- Partials     3119     3157      +38     
Flag Coverage Δ
unittests 78.57% <88.76%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wjones127 wjones127 changed the title feat: retry merge_insert when possible feat: retry merge_insert when possible Apr 21, 2025
@wjones127 wjones127 marked this pull request as ready for review April 21, 2025 22:46
},
}

impl std::fmt::Display for Operation {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could get this for free from strum_macros but would need to add an explicit dependency

Comment thread rust/lance/src/dataset/write.rs Outdated
)) as SendableRecordBatchStream
})))
} else {
// TODO: allow buffering up to 100MB in memory before spilling to disk.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not do any in-memory buffering at all right now? 100MB seems small for some server-side use-cases, and we should make it configurable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't do any in-memory buffering, no.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100MB seems small for some server-side use-cases

Why do you say that? What's the downside of spilling to disk for over 100MB of data? The operations will still work, and should be plenty fast.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we may want to buffer more than 100MB in memory, and we will want some way to configure it, possibly as a global pool across multiple datasets

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to buffer more than 100MB in memory

I was wondering more if you think it's dangerous to ship as-is. The only times I can think where you might want to change this seem to be edge cases:

  1. You are writing 1GB of data, but the latency hit from writing to disk is meaningful. If the final destination is object storage, and your cache is an SSD or even just HDD, I don't think this is true. Could be true is final destination is also SSD.
  2. You are writing 10GB of data, and you have 10GB of memory available but not 10GB of disk available. I think 99% of times a computer will have more disk than memory available.

Are there others you have in mind?

.ok()
.expect_ok()??;

let tmp_path = tmp_dir.path().join("spill.arrows");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where/when does this get cleaned up? We should discuss how we will respond to disk full errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the SpillStreamIter is dropped, the inner tempfile::TempDir is dropped, at which point the temporary directory will be deleted.

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
Copy link
Copy Markdown
Contributor

@rpgreen rpgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think we will want to avoid spilling to disk up to some threshold that we can configure.

wip

wip: utilities

wip: stream

finish spill

handle errors through spill

test

background iter preserves size_hint

test more

clippy

fix lifetime issues

add missing file

finish commit stuff

fix transaction conflict handling

lint
@wjones127 wjones127 force-pushed the merge-insert-conflict branch from bea7daa to 112ae18 Compare April 23, 2025 18:59
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of nits but overall this looks pretty good. Some nice helper utilities in here as well. Thanks!

/// * This counts the **total** size of the buffers, even if the array is a slice.
/// Round-tripped data may use less memory because of this.
#[derive(Default)]
pub struct MemoryAccumulator {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation to use this instead of get_array_memory_size? Is it because you are worried about shared buffers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, shared / sliced buffers. Worried we or the user might have something that sliced the input data into smaller batches. Naively using get_array_memory_size will double count in those cases, as SaintBacchus found while attempting this PR: #3435 (comment)

Comment on lines +73 to +74
///
/// If the spill has been dropped, an error will be returned.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped at all? Or dropped without finish being called? Wouldn't the writer normally call finish and then drop the write end?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped at all. The state (whether or not it's spilled / finish writing) is held in a channel, that becomes inaccessible when dropped. We could probably change this in a future version if we want, but for now you have to keep the sender alive even after calling finish().

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem right. Though I just noticed you are using tokio::sync::watch::channel? Why watch? Won't that potentially drop data? Why not mpsc channel?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh...I see...this is just for the status. Ok, I was confusing myself and thought you were sending batches over the channel. Now it makes sense.

Comment thread rust/lance-datafusion/src/spill.rs Outdated
Comment on lines +20 to +35
/// Start a spill of Arrow data to a temporary file. The file is an Arrow IPC
/// stream file.
///
/// Up to `memory_limit` bytes of data can be buffered in memory before a spill
/// is created. If the memory limit is never reached before [`SpillSender::finish()`]
/// is called, then the data will simply be kept in memory and no spill will be
/// created.
///
/// The [`SpillSender`] allows you to write batches to the spill.
///
/// The [`SpillReceiver`] can open a [`SendableRecordBatchStream`] that reads
/// batches from the spill. This can be opened before, during, or after batches
/// have been written to the spill.
///
/// Once [`SpillSender`] is dropped, the temporary file is deleted. This will
/// cause the [`SpillReceiver`] to return an error if it is still open.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention in here somewhere that path is the path the data will be written to?

Comment on lines +360 to +372
self.state = SpillState::Spilling {
writer,
batches_written,
};
if let SpillState::Spilling {
writer,
batches_written,
} = &mut self.state
{
(writer, batches_written)
} else {
unreachable!()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird to set the enum and then immediately turn around and if let it but I can't think of a better way 😆

Comment thread rust/lance-datafusion/src/spill.rs Outdated
Comment thread rust/lance/src/dataset/transaction.rs Outdated
Comment thread rust/lance/src/dataset/transaction.rs Outdated
Comment on lines +600 to +611
// TODO(rmeng): check that the new indices isn't on the column being replaced
true
NotCompatible
}
Operation::Rewrite { .. } => {
// TODO(rmeng): check that the fragments being replaced are not part of the groups
true
NotCompatible
}
Operation::DataReplacement { .. } => {
// TODO(rmeng): check cell conflicts
true
NotCompatible
}
_ => true,
_ => NotCompatible,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all NotCompatible because this is still kind of half-finished right? It seems DataReplacement could be retried in many cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figured I'll rebase #3631 on this and finish it there.

Comment thread rust/lance/src/dataset/write.rs
// this struct. When this struct is dropped, the Drop implementation of
// tempfile::TempDir will delete the temp dir.
#[allow(dead_code)] // Exists to keep the temp dir alive
tmp_dir: tempfile::TempDir,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want this to be configurable at some point in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figure once we have a DataFusion SessionContext in the Session, I can use the DataFusion DiskManager here.

Comment thread rust/lance-core/src/error.rs Outdated
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal for this PR but I think the utilities in spill are not what I would normally expect as "spill". I think of "spill" as something that is written to and then read from. We wouldn't write batches we've already read. It's a temporary structure meant for one-time execution.

I think a more accurate description might be "temporary table". Though maybe that is a distinction without merit. We are first writing the batches to a temporary table and then playing back from that temporary table to execute the operation.

Another analogy could be SQL server's spool operator (an operator that uses a temporary table to store data and then that table will be read multiple times throughout the execution of a plan) which is used for a slightly different purpose (to share the output of a node with multiple readers) but is implemented in much the same way.

Comment on lines +140 to +145
let reader = AsyncStreamReader::open(spill_path.clone()).await?;
// Skip batches we've already read.
for _ in 0..self.batches_read {
reader.read().await?;
}
self.state = SpillReaderState::Reader { reader };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok...so if we've read batches 0, 1, 2, 3, 4...

And then the writer decides to spill. It will still write batches 0, 1, 2, 3, 4?

This makes sense (as this will potentially need to be replayed multiple times) but it was confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment explaining that.

@wjones127
Copy link
Copy Markdown
Contributor Author

Not a big deal for this PR but I think the utilities in spill are not what I would normally expect as "spill".

That's a fair point. I think I can rename the function create_replay_spill instead.

@wjones127 wjones127 merged commit e601072 into lance-format:main Apr 24, 2025
26 of 27 checks passed
@wjones127 wjones127 deleted the merge-insert-conflict branch April 24, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants