get: copy/download files tracked by Git#2837
Conversation
1115bfe to
f1e7b0d
Compare
There was a problem hiding this comment.
Please use os.path.join 🙂
Also, CODE is used in lots of tests, so moving it like that is dangerous. If you need some file like that in your test - just create it where needed.
There was a problem hiding this comment.
I wouldn't add any new files to this fixture, it is basically a remnant from pre-pytest times. You have 2 better choices now:
- create whatever you need in test itself,
- create a separate pytest fixture that creates it, if you need the same thing in several tests.
There was a problem hiding this comment.
No need to modify unrelated tests 🙂 Just create what you need for your tests and don't touch other tests.
|
@danihodovic Please take a look at failed tests. |
efiop
left a comment
There was a problem hiding this comment.
Thanks for the PR @danihodovic 🙏 Please see a few comments above. 🙂
There was a problem hiding this comment.
From TestDirFixture perspective, all files/dirs(DATA, DATA_DIR, FOO, BAR, etc) are regular and not tracked by dvc. You probably want to modify erepo fixture to have those files, not this global fixture.
There was a problem hiding this comment.
Why exactly do those files need to be tracked by dvc? I thought we were testing the retrieval of a normal file that is tracked by Git.
FOO, BAR are regular files, but they have an additional .dvc file which (I assume) represents the dvc format of the file. When running a breakpoint in a test I can find these files created on the filesytem. I want to test that dvc get works with files that don't have a *.dvc representation.
src = erepo.REGULAR_FILE
dst = erepo.REGULAR_FILE + "_imported"
breakpoint()
Repo.get(erepo.root_dir, src, dst)
assert os.path.exists(dst)
assert os.path.isfile(dst)
assert filecmp.cmp(src, dst, shallow=False)
Creates the following structure
$ ls /tmp/dvc-test.15186.qxfon8q6.evynmKznxaWjUTntqBTJiP
.rw-rw-r-- dani dani 16 B Sat Nov 23 21:39:30 2019 тест
drwxrwxr-x dani dani 4 KB Sat Nov 23 21:39:30 2019 lib
.rw-rw-r-- dani dani 66 B Sat Nov 23 21:39:30 2019 code.py
.rw-rw-r-- dani dani 143 B Sat Nov 23 21:39:30 2019 foo.dvc
.rw-rw-r-- dani dani 3 B Sat Nov 23 21:39:30 2019 foo
.rw-rw-r-- dani dani 4 B Sat Nov 23 21:39:30 2019 bar
.rw-rw-r-- dani dani 143 B Sat Nov 23 21:39:30 2019 bar.dvc
.rw-rw-r-- dani dani 152 B Sat Nov 23 21:39:30 2019 data_dir.dvc
drwxrwxr-x dani dani 4 KB Sat Nov 23 21:39:30 2019 data_dir
.rw-rw-r-- dani dani 6 B Sat Nov 23 21:39:30 2019 version
.rw-rw-r-- dani dani 147 B Sat Nov 23 21:39:30 2019 version.dvc
The added tests fail when the logic I've added is removed.
https://github.com/danihodovic/dvc/blob/feat/2515/dvc/repo/get.py#L39-L43
If I've still misunderstood the problem please inform me :)
There was a problem hiding this comment.
Why exactly do those files need to be tracked by dvc? I thought we were testing the retrieval of a normal file that is tracked by Git.
They don't, that is why "regular file/dir" terminology doesn't make sense here 🙂These files(FOO, BAR, etc) are dvc added in erepo fixture, and this base fixture you are modifying doesn't do any assumptions like that. That is why you should modify erepo fixture(or somewhere on top of it) to add files/dirs that are not dvc added, instead of modifying the unrelated base fixture and breaking unrelated tests. 🙂
There was a problem hiding this comment.
We should stop creating those gigantic fixtures. I suggest making them more granular so that you only create whatever you need for your tests. Now it is lots of unneeded work for most tests.
There was a problem hiding this comment.
And we can start by not adding things to erepo not to TestDirFixture.
There was a problem hiding this comment.
I agree with @Suor "One structure for all" also makes debugging harder.
There was a problem hiding this comment.
You are modifying unrelated tests, please see the comment above 🙂
There was a problem hiding this comment.
Disregarding the comment above; the changes were made because the addition of the lib directory changes the filesystem tree. If you'd like me to avoid modifying other tests I'd need to create a separate fixture which extends TestDirFixture and adds regular files.
There was a problem hiding this comment.
I guess there is a misunderstading here 🙂 I'm simply trying to say that the correct way to approach this is to modify exiting erepo fixture or build on top of it (you can even avoid using var names like FOO,BAR and use plain string filename instead, no problem) to add files/dirs that you need. That way you won't need to break and fix unrelated tests.
There was a problem hiding this comment.
So let's clarify things to avoid confusion. How exactly do you want a basic smoke test setup to look?
Here are my assumptions:
- We create a test directory into which we want to use
dvc getto retrieve and store files. The test directory should probably exist in/tmp/. Let's call this directory A. - We want a remote git project to
dvc getfiles from. Ideally this should be a local temporary directory as well to avoid going over the network for reproducibility and performance reasons. Let's call this directory B. - We want to retrieve a file called
file.txtfrom directory B. - We use
dvc getand afterwards we want to assert that the filename and contents of the newly created file in directory A is equal to the file in directory B.
Dir A Dir B
+----------+ +----------+
| | | |
| | | |
1. | +<-----------+ file.txt |
| | | |
+----------+ +----------+
+----------+ +----------+
| | | |
2. | | | |
| file.txt| | file.txt|
| | | |
+----------+ +----------+
3. assert A/file.txt == B/file.txt
The questions that remain for me before implementation:
- is it OK to use a temporary tmp directory without any git or dvc setup for directory A, from which we
dvc getfiles? - can I use the erepo fixture to create directory B and fetch the file
code.pyfrom directory B.code.pydoesn't seem to be checked into dvc (get: copy/download files tracked by Git #2837 (comment), https://github.com/iterative/dvc/blob/master/tests/conftest.py#L141-L174)
e.g
import os
import tempfile
def test_get_regular_file(erepo):
dir = tempfile.TemporaryDirectory()
dst = os.path.join(dir.name, "code_imported")
os.chdir(dir.name)
Repo.get(erepo.root_dir, erepo.CODE, dst)
assert filecmp.cmp(src, dst, shallow=False)This would still require creating a nested directory in the erepo fixture in order to test dvc get with directories. Do you want me to create the nested directory or to create a new fixture specifically for these two tests?
There was a problem hiding this comment.
There is definitely a misunderstanding here, your current tests test_get_regular_file and test_get_regular_dir are totally fine, except for the fact that you are modifying TestDirFixture for no good reason. Just stop modifying it and either create those files in place as @Suor mentioned or modify erepo(or create a fixture based on it) to have those files. I'm not against the latter case, because we will soon have to test the same stuff for dvc import and we'll need the same files anyway, but we could do that when we actually need it.
Just to absolutely clarify this, here is an UNTESTED example of how an acceptable test_get_regular_file could look like:
def test_get_regular_file(repo_dir, erepo):
src = "file"
dst = src + "_imported"
with open(os.path.join(erepo.root_dir, src), "w+") as fobj:
fobj.write("something")
erepo.scm.add([src])
erepo.scm.commit("add file")
Repo.get(erepo.root_dir, src, dst)
assert os.path.exists(dst)
assert os.path.isfile(dst)
assert filecmp.cmp(src, dst, shallow=False)
There was a problem hiding this comment.
You can use:
erepo.create(src, "something")instead of with open... I guess.
There was a problem hiding this comment.
@Suor Right, forgot that erepo here is not a Repo instance. Thanks!
|
@danihodovic Oh, looks like we forgot to explicitly mention #2780 . We need to make sure that currently fails with but it shouldn't as we're trying to |
There was a problem hiding this comment.
Not on this PR, but on this code:
- looks like
repoando.repoare the same, so we shouldn't confuse code reader by referencing it differently - hence we can use single
withclause, - we don't need to use
.pull()if we are to checkout it later from cache anyway, we should use.fetch() - if we use
.pull()then we can move an artifact, which might be faster than checkout.
There was a problem hiding this comment.
@Suor there is no cloud.fetch. cloud.pull doesn't checkout, so we are fine, even though the naming could be improved 🙂
There was a problem hiding this comment.
The cloud.fetch? Let's move it into a separate refactoring ticket (for all of those points) or can solve it right away, doesn't look too complex.
There was a problem hiding this comment.
The comment explains what are we doing but not why, at least not in any specific form. Why do we need it anyway? Why can't we sort it where we need it? Why don't we sort files?
There was a problem hiding this comment.
The tests for tree equality fail because the order is not deterministic for directories. https://github.com/iterative/dvc/blob/master/tests/func/test_tree.py#L109-L128
I can sort it only for my tests, but I'd imagine most developers writing tests to compare tree structure would assume there is a deterministic order in the first place. It's unfortunate it's not consistent.
Why don't we sort files?
There are multiple files in the test directory and they preserved the sorted order. As soon as we added more than one directory the order would change between test runs.
There was a problem hiding this comment.
So if we don't modify the big fixtures this won't be an issue? BTW, generally we shouldn't change how code works to make tests pass, we should change tests.
There was a problem hiding this comment.
💯 agree with @Suor . When we encounter that the ordering matters, we just use sets or explicitly sort in our tests, there are a few places in our tests already that do that IIRC.
There was a problem hiding this comment.
I wouldn't add any new files to this fixture, it is basically a remnant from pre-pytest times. You have 2 better choices now:
- create whatever you need in test itself,
- create a separate pytest fixture that creates it, if you need the same thing in several tests.
There was a problem hiding this comment.
So you use it exactly once. You should simply create a file in erepo and add it to git, no need to change existing or adding your fixtures.
There was a problem hiding this comment.
I'm not against of changing erepo or creating a new fixture on top because we will soon have to test the same stuff for dvc import. But it is also okay to do that when we actually need it and for now just create files in-place as noted in #2837 (comment)
There was a problem hiding this comment.
I would prefer a separate fixture even in the future.
51fffc9 to
7b42ffd
Compare
|
PTAL :) re: #2837 (comment) Running your test script works. I don't know if you want a test for this explicitly as the current tests should prove correctness because erepo doesn't have a git upstream by default. |
There was a problem hiding this comment.
How does it work if we haven't even added this new file to git? We are probably not testing it hard enough or smth.
There was a problem hiding this comment.
.create("directory/file", ...) will create a directory for you.
|
What is "PTAL"? |
pared
left a comment
There was a problem hiding this comment.
Great stuff @danihodovic!
Two (actually one) minor things.
There was a problem hiding this comment.
Also, erepo already has few things inside that can be get-ed:
for example:
in case of file: Repo.get(erepo.root_dir, erepo.FOO, "foo_imported")
in case of dir: Repo.get(erepo.root_dir, erepo.DATA_DIR, "dir_imported")
There was a problem hiding this comment.
Also probably an idea for new issue ticket: forbid the user from providing the full path in get.
There was a ticket recently from a user that uses full path to get his external local output like that, so we can't really forbid it for good 🙂
There was a problem hiding this comment.
Also, erepo already has few things inside that can be get-ed:
for example:
in case of file: Repo.get(erepo.root_dir, erepo.FOO, "foo_imported")
in case of dir: Repo.get(erepo.root_dir, erepo.DATA_DIR, "dir_imported")
Yes, but those are dvc added in erepo fixture and this PR is aiming to test geting files that are tracked by git and not dvc.
There was a problem hiding this comment.
Ahh right, sorry. Still, for a file that could be erepo.code for example.
There was a problem hiding this comment.
@pared sure, but we also need to test a dir too :)
There was a problem hiding this comment.
Same as above, src cannot be full path.
There was a problem hiding this comment.
repo_dir is unnecessary here, can be removed
There was a problem hiding this comment.
@pared it is necessary, otherwise we will pollute repo root.
There was a problem hiding this comment.
@efiop I do not agree:
erepo already uses repo_dir, so temp test dir is already created, and we are back to it thanks to os.chdir inside erepo
There was a problem hiding this comment.
@pared Unless I'm missing something, if we take a look at erepo fixture https://github.com/iterative/dvc/blob/0.71.0/tests/conftest.py#L173 we can see that it returns back to the saved dir, which is our repository root. And then down below we do Repo.get, so the file will be downloaded to our repository root, which is bad.
There was a problem hiding this comment.
@efiop I disagree:
erepousesrepo_dirfixture, which gets created first and it creates temporary test dir and moves into it in_pushderepocreatesTestDvcGitFixturewhich creates another temporary dir, but thanks to 1.
its_saved_diris actually ourrepo_dirroot.- at the end of
erepocreation,chdirmoves back torepo_dirroot
There was a problem hiding this comment.
Conclusion gentlemen?
There was a problem hiding this comment.
@danihodovic The conclusion is that @pared tried it and it works, so he is right and removing is fine :) That being said, I'm a bit surprised by that behavior of erepo, don't remember if it was meant to do that. Seems like it creates a yet another test directory, which harmless but might not be what we need. Need to finally revisit our old unittest classes and new fixtures (I feel your pain @Suor 🙂).
|
@efiop @pared @Sour - test changes in 1ca336d |
96ba415 to
0baa307
Compare
0baa307 to
811855a
Compare
| shutil.copytree(src_full_path, dst_full_path) | ||
| else: | ||
| shutil.copy2(src_full_path, dst_full_path) | ||
| except FileNotFoundError: |
There was a problem hiding this comment.
We are not really raсing against anything, so how about we
if not os.path.exists(src_full_path):
raise PathOutsideRepoError(src, repo_url)
before "if os.path.isdir()" instead of wrapping it in try&except, to make it more linear? Looks like shutil.copy2(src_full_path, dst_full_path) is the only line that could raise this exception, as isdir will return False on non-existing path. Seems like it would make it easier to grasp. I don't have a strong opinion here. What do you think?
There was a problem hiding this comment.
Like so?
def _copy_git_file(repo, src, dst, repo_url):
src_full_path = os.path.join(repo.root_dir, src)
dst_full_path = os.path.abspath(dst)
if os.path.isdir(src_full_path):
shutil.copytree(src_full_path, dst_full_path)
return
try:
shutil.copy2(src_full_path, dst_full_path)
except FileNotFoundError:
raise PathOutsideRepoError(src, repo_url)There was a problem hiding this comment.
@danihodovic Well, that would do it for me too 😄 Thanks! 🙂
efiop
left a comment
There was a problem hiding this comment.
Looks great! A few minor comments up above.
| is_git_file = output_error and not os.path.isabs(path) | ||
| is_not_cached = output and not output.use_cache |
There was a problem hiding this comment.
If it is not cached then this is also a git file, so this var names are confusing. Overall this logic is more complicated than necessary. It is simply either cached or a git managed file, so:
try:
if out and out.use_cache:
# do the pull and checkout
...
else:
# Non cached out outside repo, can't be handled
if os.path.abspath(src):
raise PathOutsideRepoError(...)
# it's git-handled and already checked out to tmp dir
# we should just copy, not a git specific operation
...
copy_dir_or_file(src_full, dst_full)
except FileNotFoundError:
raise FileMissingError(...)And again forgot about generic exception, basically:
$ dvc get http://some.repo non-existing-path
Can't find non-existing-path in some.repo neither as output
nor as git handled file/dirMessage may be different, but the idea is that we don't know whether user tried to get an out or a git handled file.
There was a problem hiding this comment.
@Suor We've agreed to not use FileMissingError as its message is not applicable here. Hence PathOutsideRepoError, which is more suitable. The current logic corresponds to what we have in Repo.open.
There was a problem hiding this comment.
And indeed output_error is not wrapped as it is in repo.open.
There was a problem hiding this comment.
Looking closer at Repo.open, it indeed has a more clear implementation than this, as it doesn't introduce is_git_file confusion.
|
For the record: @danihodovic decided to resign, so we are merging as is and will be fixing on top. |
|
@Baranowski Fixed |
Allows
dvc getto copy regular files or directories.fixes: #2515
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addresses. Please review them carefully and fix those that actually improve code or fix bugs.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏