If a file is uploaded in multiple parts (multipart upload) the current s3fs strategy to copy is just using a static chunksize (5GB). Which might result in fewer number of API calls and the ETag being different than usual. Example scenerio;
test_file1 = test_bucket_name + "/test/multipart-upload.txt"
test_file2 = test_bucket_name + "/test/multipart-upload-copy.txt"
with s3.open(test_file1, "wb", block_size=5 * 2 ** 21) as stream:
for _ in range(5):
stream.write(b"b" * (stream.blocksize + random.randrange(200)))
In the code above test_file1 is created in 5 different parts which all are in different sizes. In that case, the etag looks something like this b3da0a2caaab0a4e4d81b91f8e80762d-5. Though if we copy this over (via managed copy) it will copy the whole thing in one operation due it uses a static block size that is bigger than the total size of the file and this would result with an etag that looks like this 96a4c244831bd2b4898f8b014d9c128a-1.
DVC needs this use case temporarily (until we revisit our internals). The implementation works simply by determining the block size on the fly by matching each copied part's block size with the part size on the source blob.
I think we can extend the copy() function with a flag like preserve_etag: bool = False and then have this functionality independently so that no behavior changes for normal use cases.
If a file is uploaded in multiple parts (multipart upload) the current s3fs strategy to copy is just using a static chunksize (5GB). Which might result in fewer number of API calls and the ETag being different than usual. Example scenerio;
In the code above
test_file1is created in 5 different parts which all are in different sizes. In that case, the etag looks something like thisb3da0a2caaab0a4e4d81b91f8e80762d-5. Though if we copy this over (via managed copy) it will copy the whole thing in one operation due it uses a static block size that is bigger than the total size of the file and this would result with an etag that looks like this96a4c244831bd2b4898f8b014d9c128a-1.DVC needs this use case temporarily (until we revisit our internals). The implementation works simply by determining the block size on the fly by matching each copied part's block size with the part size on the source blob.
I think we can extend the
copy()function with a flag likepreserve_etag: bool = Falseand then have this functionality independently so that no behavior changes for normal use cases.