s3: ignore empty directories while walking files#2683
Conversation
|
Had an offline discussion with @efiop where we argued if DVC should track empty directories or not.
What do you think, @Suor , @efiop , sholud we track empty dirs? |
|
@MrOutis It is not an empty dir, it is a special object that is used that way. I think we have to support it, as someone might have those outputs as empty dirs and have logic built on the assumption that that prefix already exists. |
|
@efiop, what kind of logic? |
|
@efiop , depending on existing directories breaks reproducibility (preferred way: $ dvc init --no-scm
$ mkdir data
$ echo 'foo' > data/foo
$ dvc add data
$ echo 'bar' > data/bar
$ dvc add data/bar
ERROR: Paths for outs:
'data'('data.dvc')
'data/bar'('data/bar.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help! |
|
@MrOutis The logic of something assuming that the prefix exists. For example your first stage is creating that directory that way by doing
This patch still doesn't handle |
efiop
left a comment
There was a problem hiding this comment.
The issue is not fixed yet.
@efiop , but why would the next stage expect that such prefix exist? |
|
@MrOutis I have no idea, something created that empty dir in users example, so something is using it and I won't be surprised if something else would be expecting it to exist. I would think about it as creating an empty dir in local fs, but just adapted to s3 caveats. |
|
I believe the original issue - "an empty dir" created by user making a mess - is fixed. The supposition that some later user code will rely on that "empty dir" to exist is a separate thing. I am not sure it's actually a thing, the issue might, however, come up if a user is using "a dir file" as an actual file with contents. And this can't be fixed without giving up on the dir notion for S3 and using a prefix notion instead. |
|
@Suor This PR is hiding the issue and we still don't support "empty dirs" here. We should solve it properly. |
efiop
left a comment
There was a problem hiding this comment.
So there are 2 parts to what we've discussed here in terms of the implementation.
-
collection
Same as with local directories,walk_filesshould ignore empty directories. But the top level directory shouldn't be ignored, and that can be implemented by makingisdirreturnTrueon empty directories. That wayget_checksumwill also be able to properly callget_dir_checksuminstead of callingget_file_checksumon an "empty_dir" object. -
checkout
We shouldn't create embedded empty dirs, hence why we didn't collect them in 1. But we should create a top level empty dir. For the sake of implementation simplicity, we could simply implementmakedirsfor s3 to create empty_dir objects.
|
For the record, |
|
Sorry guys to revive this, but I still don't get the logic behind this change. If we state that an s3 dir is a collection of files with common prefix ending with Or Then why do we actively creating dir files - the ones ending with If we require dir file to be present then will dvc break when someone will try to |
|
@Suor The |
Fix #2678
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html