Skip to content

Ability to copy big files while preserving etag #440

@isidentical

Description

@isidentical

If a file is uploaded in multiple parts (multipart upload) the current s3fs strategy to copy is just using a static chunksize (5GB). Which might result in fewer number of API calls and the ETag being different than usual. Example scenerio;

test_file1 = test_bucket_name + "/test/multipart-upload.txt"
test_file2 = test_bucket_name + "/test/multipart-upload-copy.txt"

with s3.open(test_file1, "wb", block_size=5 * 2 ** 21) as stream:
    for _ in range(5):
        stream.write(b"b" * (stream.blocksize + random.randrange(200)))

In the code above test_file1 is created in 5 different parts which all are in different sizes. In that case, the etag looks something like this b3da0a2caaab0a4e4d81b91f8e80762d-5. Though if we copy this over (via managed copy) it will copy the whole thing in one operation due it uses a static block size that is bigger than the total size of the file and this would result with an etag that looks like this 96a4c244831bd2b4898f8b014d9c128a-1.

DVC needs this use case temporarily (until we revisit our internals). The implementation works simply by determining the block size on the fly by matching each copied part's block size with the part size on the source blob.

I think we can extend the copy() function with a flag like preserve_etag: bool = False and then have this functionality independently so that no behavior changes for normal use cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions