We have a new multi-bucket features that would benefit from the multiple base paths feature we added in shallow clone: #4765
The proposed API is something like:
...
dataset.add_base("s3://bucket/abc", name="primary")
dataset.add_base("s3://bucket3/abc", name="us-west-2")
dataset.add_base("s3://bucket3/abc", name="eu-west-2")
...
lance.write_dataset(data, dataset, mode="append", base_paths=["primary", "us-west-2"])
Or adding that as a part of dataset creation:
lance.write_dataset(data, dataset, mode="create", base_paths=[
{ "path": "s3://bucket/abc", "name": "primary" },
{ "path": "s3://bucket3/abc", "name": "us-west-2" }
])
There are 2 topics not closed yet:
- it seems like we should give a deterministic name to the table's root location since it is not recorded in the path but we still want to reference it in the
base_paths. In the example above, the user does not want to write data to the root location but only the newly added bases. But there should be also cases that user want to write to it, and should be able to specify something like base_paths=["root", "primary", "us-west-2"]
- should we create new transaction operations like
UpdateBasePaths? It seems like we can also just add it to the UpdateConfig operation, but that might be overloading it.
Curious what people think, @jaystarshot @majin1102 @wjones127
We have a new multi-bucket features that would benefit from the multiple base paths feature we added in shallow clone: #4765
The proposed API is something like:
Or adding that as a part of dataset creation:
There are 2 topics not closed yet:
base_paths. In the example above, the user does not want to write data to the root location but only the newly added bases. But there should be also cases that user want to write to it, and should be able to specify something likebase_paths=["root", "primary", "us-west-2"]UpdateBasePaths? It seems like we can also just add it tothe UpdateConfigoperation, but that might be overloading it.Curious what people think, @jaystarshot @majin1102 @wjones127