feat: retry `merge_insert` when possible by wjones127 · Pull Request #3614 · lance-format/lance

wjones127 · 2025-03-27T18:58:51Z

Part of #3397

Pull out Backoff utility into separate struct.
- Set default backoff to 50ms, 100ms, 200ms, 400ms (previously started at 100ms)
Changed Transaction::conflicts_with() to return an enum that differentiates Retryable and non-retryable conflicts.
Made merge_insert retry on retry-able conflicts up to 10 times.
- After 10 attempts, will now return a TooMuchContention error.
Added a spill utility that allows replaying the same stream multiple times.
Reimplemented background_iterator so that it preserves size_hint().
Simplified error message of CommitConflict so it's easier to see which operations conflicted.

codecov-commenter · 2025-04-21T22:29:23Z

Codecov Report

Attention: Patch coverage is 88.76128% with 137 lines in your changes missing coverage. Please review.

Project coverage is 78.57%. Comparing base (f49c049) to head (8c44511).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/transaction.rs	84.41%	38 Missing and 3 partials ⚠️
rust/lance-datafusion/src/spill.rs	94.17%	14 Missing and 16 partials ⚠️
rust/lance/src/dataset/write.rs	73.01%	10 Missing and 7 partials ⚠️
rust/lance/src/dataset/write/merge_insert.rs	93.65%	9 Missing and 4 partials ⚠️
rust/lance-core/src/utils/backoff.rs	75.51%	12 Missing ⚠️
.../lance-datafusion/src/utils/background_iterator.rs	76.92%	12 Missing ⚠️
rust/lance-arrow/src/memory.rs	80.00%	7 Missing and 2 partials ⚠️
rust/lance/src/io/commit.rs	87.50%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3614      +/-   ##
==========================================
+ Coverage   78.50%   78.57%   +0.07%     
==========================================
  Files         268      272       +4     
  Lines      100735   101868    +1133     
  Branches   100735   101868    +1133     
==========================================
+ Hits        79078    80041     +963     
- Misses      18538    18670     +132     
- Partials     3119     3157      +38

Flag	Coverage Δ
unittests	`78.57% <88.76%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rpgreen · 2025-04-22T14:16:54Z

    },
 }

+impl std::fmt::Display for Operation {


nit: we could get this for free from strum_macros but would need to add an explicit dependency

rpgreen · 2025-04-22T14:22:00Z

+                )) as SendableRecordBatchStream
+            })))
+        } else {
+            // TODO: allow buffering up to 100MB in memory before spilling to disk.


Does this not do any in-memory buffering at all right now? 100MB seems small for some server-side use-cases, and we should make it configurable.

It doesn't do any in-memory buffering, no.

100MB seems small for some server-side use-cases

Why do you say that? What's the downside of spilling to disk for over 100MB of data? The operations will still work, and should be plenty fast.

I mean, we may want to buffer more than 100MB in memory, and we will want some way to configure it, possibly as a global pool across multiple datasets

we may want to buffer more than 100MB in memory

I was wondering more if you think it's dangerous to ship as-is. The only times I can think where you might want to change this seem to be edge cases:

You are writing 1GB of data, but the latency hit from writing to disk is meaningful. If the final destination is object storage, and your cache is an SSD or even just HDD, I don't think this is true. Could be true is final destination is also SSD.

You are writing 10GB of data, and you have 10GB of memory available but not 10GB of disk available. I think 99% of times a computer will have more disk than memory available.

Are there others you have in mind?

rpgreen · 2025-04-22T14:24:06Z

+        .ok()
+        .expect_ok()??;
+
+        let tmp_path = tmp_dir.path().join("spill.arrows");


Where/when does this get cleaned up? We should discuss how we will respond to disk full errors

When the SpillStreamIter is dropped, the inner tempfile::TempDir is dropped, at which point the temporary directory will be deleted.

rpgreen

Nice! I think we will want to avoid spilling to disk up to some threshold that we can configure.

wip wip: utilities wip: stream finish spill handle errors through spill test background iter preserves size_hint test more clippy fix lifetime issues add missing file finish commit stuff fix transaction conflict handling lint

westonpace

A bunch of nits but overall this looks pretty good. Some nice helper utilities in here as well. Thanks!

westonpace · 2025-04-24T12:40:50Z

+/// * This counts the **total** size of the buffers, even if the array is a slice.
+///   Round-tripped data may use less memory because of this.
+#[derive(Default)]
+pub struct MemoryAccumulator {


What's the motivation to use this instead of get_array_memory_size? Is it because you are worried about shared buffers?

Yeah, shared / sliced buffers. Worried we or the user might have something that sliced the input data into smaller batches. Naively using get_array_memory_size will double count in those cases, as SaintBacchus found while attempting this PR: #3435 (comment)

westonpace · 2025-04-24T12:44:19Z

+    ///
+    /// If the spill has been dropped, an error will be returned.


Dropped at all? Or dropped without finish being called? Wouldn't the writer normally call finish and then drop the write end?

Dropped at all. The state (whether or not it's spilled / finish writing) is held in a channel, that becomes inaccessible when dropped. We could probably change this in a future version if we want, but for now you have to keep the sender alive even after calling finish().

That doesn't seem right. Though I just noticed you are using tokio::sync::watch::channel? Why watch? Won't that potentially drop data? Why not mpsc channel?

Oh...I see...this is just for the status. Ok, I was confusing myself and thought you were sending batches over the channel. Now it makes sense.

westonpace · 2025-04-24T12:45:09Z

+/// Start a spill of Arrow data to a temporary file. The file is an Arrow IPC
+/// stream file.
+///
+/// Up to `memory_limit` bytes of data can be buffered in memory before a spill
+/// is created. If the memory limit is never reached before [`SpillSender::finish()`]
+/// is called, then the data will simply be kept in memory and no spill will be
+/// created.
+///
+/// The [`SpillSender`] allows you to write batches to the spill.
+///
+/// The [`SpillReceiver`] can open a [`SendableRecordBatchStream`] that reads
+/// batches from the spill. This can be opened before, during, or after batches
+/// have been written to the spill.
+///
+/// Once [`SpillSender`] is dropped, the temporary file is deleted. This will
+/// cause the [`SpillReceiver`] to return an error if it is still open.


Mention in here somewhere that path is the path the data will be written to?

westonpace · 2025-04-24T12:51:19Z

+                    self.state = SpillState::Spilling {
+                        writer,
+                        batches_written,
+                    };
+                    if let SpillState::Spilling {
+                        writer,
+                        batches_written,
+                    } = &mut self.state
+                    {
+                        (writer, batches_written)
+                    } else {
+                        unreachable!()
+                    }


This is weird to set the enum and then immediately turn around and if let it but I can't think of a better way 😆

westonpace · 2025-04-24T13:20:42Z

                    // TODO(rmeng): check that the new indices isn't on the column being replaced
-                    true
+                    NotCompatible
                }
                Operation::Rewrite { .. } => {
                    // TODO(rmeng): check that the fragments being replaced are not part of the groups
-                    true
+                    NotCompatible
                }
                Operation::DataReplacement { .. } => {
                    // TODO(rmeng): check cell conflicts
-                    true
+                    NotCompatible
                }
-                _ => true,
+                _ => NotCompatible,


These are all NotCompatible because this is still kind of half-finished right? It seems DataReplacement could be retried in many cases?

Yeah I figured I'll rebase #3631 on this and finish it there.

westonpace · 2025-04-24T13:23:13Z

+    // this struct. When this struct is dropped, the Drop implementation of
+    // tempfile::TempDir will delete the temp dir.
+    #[allow(dead_code)] // Exists to keep the temp dir alive
+    tmp_dir: tempfile::TempDir,


We may want this to be configurable at some point in the future.

Yeah I figure once we have a DataFusion SessionContext in the Session, I can use the DataFusion DiskManager here.

westonpace

Not a big deal for this PR but I think the utilities in spill are not what I would normally expect as "spill". I think of "spill" as something that is written to and then read from. We wouldn't write batches we've already read. It's a temporary structure meant for one-time execution.

I think a more accurate description might be "temporary table". Though maybe that is a distinction without merit. We are first writing the batches to a temporary table and then playing back from that temporary table to execute the operation.

Another analogy could be SQL server's spool operator (an operator that uses a temporary table to store data and then that table will be read multiple times throughout the execution of a plan) which is used for a slightly different purpose (to share the output of a node with multiple readers) but is implemented in much the same way.

westonpace · 2025-04-24T16:21:42Z

+            let reader = AsyncStreamReader::open(spill_path.clone()).await?;
+            // Skip batches we've already read.
+            for _ in 0..self.batches_read {
+                reader.read().await?;
+            }
+            self.state = SpillReaderState::Reader { reader };


Ok...so if we've read batches 0, 1, 2, 3, 4...

And then the writer decides to spill. It will still write batches 0, 1, 2, 3, 4?

This makes sense (as this will potentially need to be replayed multiple times) but it was confusing.

I'll add a comment explaining that.

wjones127 · 2025-04-24T16:44:12Z

Not a big deal for this PR but I think the utilities in spill are not what I would normally expect as "spill".

That's a fair point. I think I can rename the function create_replay_spill instead.

github-actions Bot added enhancement New feature or request python labels Mar 27, 2025

wjones127 force-pushed the merge-insert-conflict branch 2 times, most recently from 20581e8 to aa9fe97 Compare April 21, 2025 21:52

wjones127 changed the title ~~feat: retry merge_insert when possible~~ feat: retry merge_insert when possible Apr 21, 2025

wjones127 marked this pull request as ready for review April 21, 2025 22:46

wjones127 mentioned this pull request Apr 21, 2025

Implement retry-based conflict resolution #3397

Closed

3 tasks

rpgreen reviewed Apr 22, 2025

View reviewed changes

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated

rpgreen reviewed Apr 22, 2025

View reviewed changes

wjones127 added 9 commits April 23, 2025 11:59

feat: retry merge_insert when possible

1aa62b4

wip wip: utilities wip: stream finish spill handle errors through spill test background iter preserves size_hint test more clippy fix lifetime issues add missing file finish commit stuff fix transaction conflict handling lint

cleanup

84eb1c9

cleanup

d51fb3f

update python lockfile

9da4562

Use bufreader and bufwriter

256e92c

memory utility

84ba0d6

buffer some data in memory

c0943e9

get spill tested

7a79c5d

add column

112ae18

wjones127 force-pushed the merge-insert-conflict branch from bea7daa to 112ae18 Compare April 23, 2025 18:59

update error message

4383db0

westonpace approved these changes Apr 24, 2025

View reviewed changes

pr feedback on conflict check

d8cd445

westonpace reviewed Apr 24, 2025

View reviewed changes

pr feedback

11087aa

rename

8c44511

wjones127 merged commit e601072 into lance-format:main Apr 24, 2025
26 of 27 checks passed

wjones127 deleted the merge-insert-conflict branch April 24, 2025 17:24

This was referenced May 5, 2025

Merge_insert retries can cause thrashing in high-latency envs #3776

Closed

Merge insert appears to have a memory leak #3798

Open

chore: temporarily revert retries in merge insert #3799

Closed

		///
		/// If the spill has been dropped, an error will be returned.

Conversation

wjones127 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rpgreen left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjones127 commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjones127 commented Mar 27, 2025 •

edited

Loading

codecov-commenter commented Apr 21, 2025 •

edited

Loading