feat: strategized plan compaction#5233
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
A few (mostly minor) suggestions
| // get all fragments by default | ||
| fn get_fragments(&self, dataset: &Dataset, _options: &CompactionOptions) -> Vec<FileFragment> { | ||
| // get_fragments should be returning fragments in sorted order (by id) | ||
| // and fragment ids should be unique | ||
| dataset.get_fragments() | ||
| } | ||
|
|
||
| // no filter by default | ||
| async fn filter_fragments( | ||
| &self, | ||
| _dataset: &Dataset, | ||
| fragments: Vec<FileFragment>, | ||
| _options: &CompactionOptions, | ||
| ) -> Result<Vec<FileFragment>> { | ||
| Ok(fragments) | ||
| } |
There was a problem hiding this comment.
Do these really need to be trait methods? I think we can probably leave them out and just let individual implementations use them if they want to. It will keep the trait simpler.
| Ok(fragments) | ||
| } | ||
|
|
||
| async fn plan(&self, dataset: &Dataset, options: &CompactionOptions) -> Result<CompactionPlan>; |
There was a problem hiding this comment.
added docs
| Ok(fragments) | ||
| } | ||
|
|
||
| async fn plan(&self, dataset: &Dataset, options: &CompactionOptions) -> Result<CompactionPlan>; |
There was a problem hiding this comment.
Will CompactionOptions be flexible enough for all possible strategies? Should we maybe accept options as a JSON string or a Map<String, String>? This way different strategies can expose their own custom options. That would leave the API a little less defined but it would be more flexible.
There was a problem hiding this comment.
Do we even need to take CompactionOptions? Maybe it should be a argument to the constructor of the individual structs. That way each could have their own arguments but also be strongly typed.
There was a problem hiding this comment.
Will CompactionOptions be flexible enough for all possible strategies? Should we maybe accept options as a JSON string or a Map<String, String>? This way different strategies can expose their own custom options. That would leave the API a little less defined but it would be more flexible.
Hi @westonpace Thanks a lot for your review. Added Map<String, String> for flexible.
Do we even need to take CompactionOptions? Maybe it should be a argument to the constructor of the individual structs. That way each could have their own arguments but also be strongly typed.
Hi @wjones127 Thanks a lot for your review.
I have tried several ways to eliminate the CompactionOptions parameter in the plan method, but none of them are perfect :( The main contradiction is that during users start planning compaction based on the built planner, they may dynamically adjust the options parameters on certain conditions, such as
If options are passed in when building the planner, then after modifying the options subsequently, it must also be ensured that the options in the planner can be seen. Therefore, we need Arc + mutex and cannot use clone.
On the contrary, it might be simpler and more flexible to pass in the desired options each time the plan method is called here.
There was a problem hiding this comment.
If options are passed in when building the planner, then after modifying the options subsequently, it must also be ensured that the options in the planner can be seen. Therefore, we need Arc + mutex and cannot use clone.
I don't understand this. The logic of validate() can live in the planner and be internal.
If I were to rewrite compact_files, I would do:
pub async fn compact_files(
dataset: &mut Dataset,
mut options: CompactionOptions,
remap_options: Option<Arc<dyn IndexRemapperOptions>>, // These will be deprecated later
) -> Result<CompactionMetrics> {
info!(target: TRACE_DATASET_EVENTS, event=DATASET_COMPACTING_EVENT, uri = &dataset.uri);
// .validate() now happens inside of `from_options`
let planner = DefaultCompactionPlanner::from_options(options);
compact_files_with_planner(dataset, &planner, remap_options).await
}There was a problem hiding this comment.
changed as you suggested ~
| /// Compacts the files in the dataset without reordering them. | ||
| /// | ||
| /// This does a few things: | ||
| /// By default, his does a few things: |
There was a problem hiding this comment.
| /// By default, his does a few things: | |
| /// By default, this does a few things: |
| pub async fn compact_files( | ||
| dataset: &mut Dataset, | ||
| options: CompactionOptions, | ||
| remap_options: Option<Arc<dyn IndexRemapperOptions>>, // These will be deprecated later |
|
Hi @westonpace and @wjones127 Thanks a lot for your review. All comments are addressed. PTAL :) |
wjones127
left a comment
There was a problem hiding this comment.
Sorry for the delay, I accidentally left my comments pending. I'm still not sure about the design. It seems like it could be simplified further.
| // get all fragments by default | ||
| fn get_fragments(&self, dataset: &Dataset, _options: &CompactionOptions) -> Vec<FileFragment> { | ||
| // get_fragments should be returning fragments in sorted order (by id) | ||
| // and fragment ids should be unique | ||
| dataset.get_fragments() | ||
| } |
There was a problem hiding this comment.
Was already commented on, but do we need this? It seems like individual implementations can just call dataset.get_fragments() and then do whatever filtering they would like.
| Ok(fragments) | ||
| } | ||
|
|
||
| async fn plan(&self, dataset: &Dataset, options: &CompactionOptions) -> Result<CompactionPlan>; |
There was a problem hiding this comment.
If options are passed in when building the planner, then after modifying the options subsequently, it must also be ensured that the options in the planner can be seen. Therefore, we need Arc + mutex and cannot use clone.
I don't understand this. The logic of validate() can live in the planner and be internal.
If I were to rewrite compact_files, I would do:
pub async fn compact_files(
dataset: &mut Dataset,
mut options: CompactionOptions,
remap_options: Option<Arc<dyn IndexRemapperOptions>>, // These will be deprecated later
) -> Result<CompactionMetrics> {
info!(target: TRACE_DATASET_EVENTS, event=DATASET_COMPACTING_EVENT, uri = &dataset.uri);
// .validate() now happens inside of `from_options`
let planner = DefaultCompactionPlanner::from_options(options);
compact_files_with_planner(dataset, &planner, remap_options).await
}
No worried @westonpace . Really appreciate for your response. Also all comments are addressed. PTAL~ |
wjones127
left a comment
There was a problem hiding this comment.
Just got back from vacation, so sorry for the delay.
This is headed in the right direction. I still have some comments on the API.
|
|
||
| fn get_compaction_options(&self) -> &CompactionOptions; |
There was a problem hiding this comment.
Why is this necessary? The CompactionPlan already contains the CompactionOptions, so isn't the return value of plan sufficient?
| async fn plan( | ||
| &self, | ||
| dataset: &Dataset, | ||
| configs: HashMap<String, String>, |
There was a problem hiding this comment.
Why do we need configs? The point of my earlier suggestion was to get rid of string typed configurations in favor of proper constructors. For example:
pub trait CompactionPlanner: Send + Sync {
async fn plan(&self, dataset: &Dataset) -> Result<CompactionPlan>;
}
pub struct DefaultCompactionPlanner {
options: CompactionOptions,
}
impl CompactionPlanner for DefaultCompactionPlanner {
async fn plan(&self, dataset: &Dataset) -> Result<CompactionPlan> {
let tasks = todo!();
Ok(CompactionPlan {
tasks,
read_version: dataset.manifest.version,
options: self.options.clone()
})
}
}
pub struct CustomPlanner {
config_a: i32,
config_b: Duration,
options: CompactionOptions
}
impl CustomPlanner {
fn new(config_a: i32, config_b: Duration, options: CompactionOptions) -> {
todo!()
}
}
impl CompactionPlanner for CustomPlanner {
async fn plan(&self, dataset: &Dataset) -> Result<CompactionPlan> {
let tasks = todo!("Use config_a and config_b to plan tasks");
Ok(CompactionPlan {
tasks,
read_version: dataset.manifest.version,
options: self.options.clone()
})
}
}Notice how CustomPlanner:new() takes strongly-typed parameters. We no long have to validate string inputs, which makes the API safer and easier to use.
There was a problem hiding this comment.
BTW we should probably distinguish from configurations for planning and configurations for executing. The reason we need CompactionOptions later is they are needed for execution. But the options that will vary for planners are the planning options.
There was a problem hiding this comment.
BTW we should probably distinguish from configurations for planning and configurations for executing. The reason we need
CompactionOptionslater is they are needed for execution. But the options that will vary for planners are the planning options.
I quickly glanced through it, and there are quite a few parts that need to be modified. I will try to create a new PR to solve this problem :)
Hi @wjones127. Thanks for your response and wish u have a fantastic vacation! I will address the comments asap :) |
There was a problem hiding this comment.
Pull request overview
This PR introduces a strategy pattern for compaction planning in Lance, allowing users to implement custom compaction strategies via the new CompactionPlanner trait. The existing compaction logic has been refactored into a DefaultCompactionPlanner implementation, maintaining backward compatibility while enabling extensibility.
- Adds
CompactionPlannertrait for custom compaction strategies - Refactors existing compaction planning logic into
DefaultCompactionPlanner - Introduces
compact_files_with_plannerAPI to enable custom planner usage
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| rust/lance/src/dataset/optimize.rs | Adds CompactionPlanner trait and DefaultCompactionPlanner struct; refactors plan_compaction logic into the default planner; adds compact_files_with_planner function; updates documentation; adds test for default planner |
| java/lance-jni/Cargo.lock | Updates lance-namespace-impls version from 1.0.0-beta.4 to 1.0.0-beta.7 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @wjones127 Really appreciate for your help. Comments are all addressed . PTAL ~ |
wjones127
left a comment
There was a problem hiding this comment.
This looks great. Thanks for working with me on this.
Thanks @wjones127 :) Will optimize the compaction plan strategy, based on the horizontal and tiered vertical strategies we discuss before as next step ~ |
Close #5186