get: copy/download files tracked by Git by danihodovic · Pull Request #2837 · treeverse/dvc

danihodovic · 2019-11-22T20:21:53Z

Allows dvc get to copy regular files or directories.

❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addresses. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

efiop · 2019-11-23T06:25:54Z

Please use os.path.join 🙂

Also, CODE is used in lots of tests, so moving it like that is dangerous. If you need some file like that in your test - just create it where needed.

I wouldn't add any new files to this fixture, it is basically a remnant from pre-pytest times. You have 2 better choices now:

create whatever you need in test itself,

create a separate pytest fixture that creates it, if you need the same thing in several tests.

efiop · 2019-11-23T06:27:15Z

No need to modify unrelated tests 🙂 Just create what you need for your tests and don't touch other tests.

efiop · 2019-11-23T06:29:43Z

@danihodovic Please take a look at failed tests.

efiop

Thanks for the PR @danihodovic 🙏 Please see a few comments above. 🙂

efiop · 2019-11-23T19:12:14Z

From TestDirFixture perspective, all files/dirs(DATA, DATA_DIR, FOO, BAR, etc) are regular and not tracked by dvc. You probably want to modify erepo fixture to have those files, not this global fixture.

Why exactly do those files need to be tracked by dvc? I thought we were testing the retrieval of a normal file that is tracked by Git.

FOO, BAR are regular files, but they have an additional .dvc file which (I assume) represents the dvc format of the file. When running a breakpoint in a test I can find these files created on the filesytem. I want to test that dvc get works with files that don't have a *.dvc representation.

src = erepo.REGULAR_FILE dst = erepo.REGULAR_FILE + "_imported" breakpoint() Repo.get(erepo.root_dir, src, dst) assert os.path.exists(dst) assert os.path.isfile(dst) assert filecmp.cmp(src, dst, shallow=False)

Creates the following structure

$ ls /tmp/dvc-test.15186.qxfon8q6.evynmKznxaWjUTntqBTJiP .rw-rw-r-- dani dani 16 B Sat Nov 23 21:39:30 2019  тест drwxrwxr-x dani dani 4 KB Sat Nov 23 21:39:30 2019  lib .rw-rw-r-- dani dani 66 B Sat Nov 23 21:39:30 2019  code.py .rw-rw-r-- dani dani 143 B Sat Nov 23 21:39:30 2019  foo.dvc .rw-rw-r-- dani dani 3 B Sat Nov 23 21:39:30 2019  foo .rw-rw-r-- dani dani 4 B Sat Nov 23 21:39:30 2019  bar .rw-rw-r-- dani dani 143 B Sat Nov 23 21:39:30 2019  bar.dvc .rw-rw-r-- dani dani 152 B Sat Nov 23 21:39:30 2019  data_dir.dvc drwxrwxr-x dani dani 4 KB Sat Nov 23 21:39:30 2019  data_dir .rw-rw-r-- dani dani 6 B Sat Nov 23 21:39:30 2019  version .rw-rw-r-- dani dani 147 B Sat Nov 23 21:39:30 2019  version.dvc

The added tests fail when the logic I've added is removed.

https://github.com/danihodovic/dvc/blob/feat/2515/dvc/repo/get.py#L39-L43

If I've still misunderstood the problem please inform me :)

Why exactly do those files need to be tracked by dvc? I thought we were testing the retrieval of a normal file that is tracked by Git.

They don't, that is why "regular file/dir" terminology doesn't make sense here 🙂These files(FOO, BAR, etc) are dvc added in erepo fixture, and this base fixture you are modifying doesn't do any assumptions like that. That is why you should modify erepo fixture(or somewhere on top of it) to add files/dirs that are not dvc added, instead of modifying the unrelated base fixture and breaking unrelated tests. 🙂

We should stop creating those gigantic fixtures. I suggest making them more granular so that you only create whatever you need for your tests. Now it is lots of unneeded work for most tests.

And we can start by not adding things to erepo not to TestDirFixture.

I agree with @Suor "One structure for all" also makes debugging harder.

efiop · 2019-11-23T19:13:59Z

You are modifying unrelated tests, please see the comment above 🙂

Disregarding the comment above; the changes were made because the addition of the lib directory changes the filesystem tree. If you'd like me to avoid modifying other tests I'd need to create a separate fixture which extends TestDirFixture and adds regular files.

I guess there is a misunderstading here 🙂 I'm simply trying to say that the correct way to approach this is to modify exiting erepo fixture or build on top of it (you can even avoid using var names like FOO,BAR and use plain string filename instead, no problem) to add files/dirs that you need. That way you won't need to break and fix unrelated tests.

So let's clarify things to avoid confusion. How exactly do you want a basic smoke test setup to look?

Here are my assumptions:

We create a test directory into which we want to use dvc get to retrieve and store files. The test directory should probably exist in /tmp/. Let's call this directory A.

We want a remote git project to dvc get files from. Ideally this should be a local temporary directory as well to avoid going over the network for reproducibility and performance reasons. Let's call this directory B.

We want to retrieve a file called file.txt from directory B.

We use dvc get and afterwards we want to assert that the filename and contents of the newly created file in directory A is equal to the file in directory B.

Dir A Dir B +----------+ +----------+ | | | | | | | | 1. | +<-----------+ file.txt | | | | | +----------+ +----------+ +----------+ +----------+ | | | | 2. | | | | | file.txt| | file.txt| | | | | +----------+ +----------+ 3. assert A/file.txt == B/file.txt

The questions that remain for me before implementation:

is it OK to use a temporary tmp directory without any git or dvc setup for directory A, from which we dvc get files?

can I use the erepo fixture to create directory B and fetch the file code.py from directory B. code.py doesn't seem to be checked into dvc (get: copy/download files tracked by Git #2837 (comment), https://github.com/iterative/dvc/blob/master/tests/conftest.py#L141-L174)

e.g

import os import tempfile def test_get_regular_file(erepo): dir = tempfile.TemporaryDirectory() dst = os.path.join(dir.name, "code_imported") os.chdir(dir.name) Repo.get(erepo.root_dir, erepo.CODE, dst) assert filecmp.cmp(src, dst, shallow=False)

This would still require creating a nested directory in the erepo fixture in order to test dvc get with directories. Do you want me to create the nested directory or to create a new fixture specifically for these two tests?

There is definitely a misunderstanding here, your current tests test_get_regular_file and test_get_regular_dir are totally fine, except for the fact that you are modifying TestDirFixture for no good reason. Just stop modifying it and either create those files in place as @Suor mentioned or modify erepo(or create a fixture based on it) to have those files. I'm not against the latter case, because we will soon have to test the same stuff for dvc import and we'll need the same files anyway, but we could do that when we actually need it.

Just to absolutely clarify this, here is an UNTESTED example of how an acceptable test_get_regular_file could look like:

def test_get_regular_file(repo_dir, erepo): src = "file" dst = src + "_imported" with open(os.path.join(erepo.root_dir, src), "w+") as fobj: fobj.write("something") erepo.scm.add([src]) erepo.scm.commit("add file") Repo.get(erepo.root_dir, src, dst) assert os.path.exists(dst) assert os.path.isfile(dst) assert filecmp.cmp(src, dst, shallow=False)

You can use:

erepo.create(src, "something")

instead of with open... I guess.

@Suor Right, forgot that erepo here is not a Repo instance. Thanks!

efiop · 2019-11-24T12:09:25Z

@danihodovic Oh, looks like we forgot to explicitly mention #2780 . We need to make sure that dvc get for git-tracked files will work when there is no default remote setup in the erepo. Could you please take a look? 🙂 I.e.

#!/bin/bash

set -e
set -x

rm -rf mytest
mkdir mytest
cd mytest

mkdir erepo
pushd erepo
git init
dvc init
dvc run -O out 'echo foo > out'
git add .
git commit -m "init"
popd

dvc get erepo out

currently fails with

+ dvc get erepo out
ERROR: failed to get 'out' from 'erepo' - No DVC remote is specified in the target repository 'erepo': config file error: no remote specified. Setup default remote with
    dvc config core.remote <name>
or use:
    dvc pull -r <name>

but it shouldn't as we're trying to dvc get a file that is tracked by git and so it doesn't need the dvc remote to be setup. 🙂

Suor · 2019-11-24T13:35:09Z

Not on this PR, but on this code:

looks like repo and o.repo are the same, so we shouldn't confuse code reader by referencing it differently

hence we can use single with clause,

we don't need to use .pull() if we are to checkout it later from cache anyway, we should use .fetch()

if we use .pull() then we can move an artifact, which might be faster than checkout.

@Suor there is no cloud.fetch. cloud.pull doesn't checkout, so we are fine, even though the naming could be improved 🙂

So we leave this as is?

The cloud.fetch? Let's move it into a separate refactoring ticket (for all of those points) or can solve it right away, doesn't look too complex.

Suor · 2019-11-24T13:37:02Z

The comment explains what are we doing but not why, at least not in any specific form. Why do we need it anyway? Why can't we sort it where we need it? Why don't we sort files?

The tests for tree equality fail because the order is not deterministic for directories. https://github.com/iterative/dvc/blob/master/tests/func/test_tree.py#L109-L128

I can sort it only for my tests, but I'd imagine most developers writing tests to compare tree structure would assume there is a deterministic order in the first place. It's unfortunate it's not consistent.

Why don't we sort files?

There are multiple files in the test directory and they preserved the sorted order. As soon as we added more than one directory the order would change between test runs.

So if we don't modify the big fixtures this won't be an issue? BTW, generally we shouldn't change how code works to make tests pass, we should change tests.

💯 agree with @Suor . When we encounter that the ordering matters, we just use sets or explicitly sort in our tests, there are a few places in our tests already that do that IIRC.

Suor · 2019-11-24T13:39:25Z

I wouldn't add any new files to this fixture, it is basically a remnant from pre-pytest times. You have 2 better choices now:

create whatever you need in test itself,

create a separate pytest fixture that creates it, if you need the same thing in several tests.

Suor · 2019-11-24T13:50:21Z

So you use it exactly once. You should simply create a file in erepo and add it to git, no need to change existing or adding your fixtures.

I'm not against of changing erepo or creating a new fixture on top because we will soon have to test the same stuff for dvc import. But it is also okay to do that when we actually need it and for now just create files in-place as noted in #2837 (comment)

I would prefer a separate fixture even in the future.

danihodovic · 2019-11-26T16:17:20Z

PTAL :)

re: #2837 (comment)

Running your test script works. I don't know if you want a test for this explicitly as the current tests should prove correctness because erepo doesn't have a git upstream by default.

depends on: treeverse/dvc#2837

Suor · 2019-11-26T19:22:30Z

So we leave this as is?

Suor · 2019-11-26T19:30:48Z

How does it work if we haven't even added this new file to git? We are probably not testing it hard enough or smth.

Suor · 2019-11-26T19:32:36Z

.create("directory/file", ...) will create a directory for you.

jorgeorpinel · 2019-11-26T23:44:02Z

What is "PTAL"?
Will dvc import also support Git tracked files?

jorgeorpinel

A couple small things on the command output. Please also see my questions above. Thanks

pared

Great stuff @danihodovic!
Two (actually one) minor things.

pared · 2019-11-27T14:25:55Z

This must be relative path, that is also the reason for strange behaviour mentioned by @Suor here.
Also probably an idea for new issue ticket: forbid the user from providing the full path in get.

Also, erepo already has few things inside that can be get-ed:
for example:
in case of file: Repo.get(erepo.root_dir, erepo.FOO, "foo_imported")
in case of dir: Repo.get(erepo.root_dir, erepo.DATA_DIR, "dir_imported")

Also probably an idea for new issue ticket: forbid the user from providing the full path in get.

There was a ticket recently from a user that uses full path to get his external local output like that, so we can't really forbid it for good 🙂

@pared

Also, erepo already has few things inside that can be get-ed:
for example:
in case of file: Repo.get(erepo.root_dir, erepo.FOO, "foo_imported")
in case of dir: Repo.get(erepo.root_dir, erepo.DATA_DIR, "dir_imported")

Yes, but those are dvc added in erepo fixture and this PR is aiming to test geting files that are tracked by git and not dvc.

Ahh right, sorry. Still, for a file that could be erepo.code for example.

@pared sure, but we also need to test a dir too :)

pared · 2019-11-27T14:28:36Z

Same as above, src cannot be full path.

pared · 2019-11-27T14:28:55Z

repo_dir is unnecessary here, can be removed

@pared it is necessary, otherwise we will pollute repo root.

@efiop I do not agree:
erepo already uses repo_dir, so temp test dir is already created, and we are back to it thanks to os.chdir inside erepo

@pared Unless I'm missing something, if we take a look at erepo fixture https://github.com/iterative/dvc/blob/0.71.0/tests/conftest.py#L173 we can see that it returns back to the saved dir, which is our repository root. And then down below we do Repo.get, so the file will be downloaded to our repository root, which is bad.

@efiop I disagree:

erepo uses repo_dir fixture, which gets created first and it creates temporary test dir and moves into it in _pushd

erepo creates TestDvcGitFixture which creates another temporary dir, but thanks to 1.
its _saved_dir is actually our repo_dir root.

at the end of erepo creation, chdir moves back to repo_dir root

Conclusion gentlemen?

@danihodovic The conclusion is that @pared tried it and it works, so he is right and removing is fine :) That being said, I'm a bit surprised by that behavior of erepo, don't remember if it was meant to do that. Seems like it creates a yet another test directory, which harmless but might not be what we need. Need to finally revisit our old unittest classes and new fixtures (I feel your pain @Suor 🙂).

danihodovic · 2019-11-28T14:24:47Z

@efiop @pared @Sour - test changes in 1ca336d
@jorgeorpinel docs changes in bb0925f

efiop · 2019-12-06T18:58:40Z

+            shutil.copytree(src_full_path, dst_full_path)
+        else:
+            shutil.copy2(src_full_path, dst_full_path)
+    except FileNotFoundError:


We are not really raсing against anything, so how about we

if not os.path.exists(src_full_path): raise PathOutsideRepoError(src, repo_url)

before "if os.path.isdir()" instead of wrapping it in try&except, to make it more linear? Looks like shutil.copy2(src_full_path, dst_full_path) is the only line that could raise this exception, as isdir will return False on non-existing path. Seems like it would make it easier to grasp. I don't have a strong opinion here. What do you think?

Like so?

def _copy_git_file(repo, src, dst, repo_url): src_full_path = os.path.join(repo.root_dir, src) dst_full_path = os.path.abspath(dst) if os.path.isdir(src_full_path): shutil.copytree(src_full_path, dst_full_path) return try: shutil.copy2(src_full_path, dst_full_path) except FileNotFoundError: raise PathOutsideRepoError(src, repo_url)

@danihodovic Well, that would do it for me too 😄 Thanks! 🙂

efiop

Looks great! A few minor comments up above.

efiop

Looks great! 🎉

Suor · 2019-12-07T10:51:55Z

+            is_git_file = output_error and not os.path.isabs(path)
+            is_not_cached = output and not output.use_cache


If it is not cached then this is also a git file, so this var names are confusing. Overall this logic is more complicated than necessary. It is simply either cached or a git managed file, so:

try: if out and out.use_cache: # do the pull and checkout ... else: # Non cached out outside repo, can't be handled if os.path.abspath(src): raise PathOutsideRepoError(...) # it's git-handled and already checked out to tmp dir # we should just copy, not a git specific operation ... copy_dir_or_file(src_full, dst_full) except FileNotFoundError: raise FileMissingError(...)

And again forgot about generic exception, basically:

$ dvc get http://some.repo non-existing-path Can't find non-existing-path in some.repo neither as output nor as git handled file/dir

Message may be different, but the idea is that we don't know whether user tried to get an out or a git handled file.

@Suor We've agreed to not use FileMissingError as its message is not applicable here. Hence PathOutsideRepoError, which is more suitable. The current logic corresponds to what we have in Repo.open.

Agreed on the naming.

And indeed output_error is not wrapped as it is in repo.open.

Looking closer at Repo.open, it indeed has a more clear implementation than this, as it doesn't introduce is_git_file confusion.

efiop · 2019-12-09T16:57:48Z

For the record: @danihodovic decided to resign, so we are merging as is and will be fixing on top.

Baranowski · 2019-12-10T13:18:26Z

@efiop, should I wait with #2862 until you fix or resume working on it?

efiop · 2019-12-10T17:21:52Z

@Baranowski Fixed get, please rebase. Very sorry for the delay 🙁

danihodovic force-pushed the feat/2515 branch from 1115bfe to f1e7b0d Compare November 22, 2019 20:29

shcheklein requested review from Suor and efiop November 22, 2019 23:11

efiop reviewed Nov 23, 2019

View reviewed changes

efiop suggested changes Nov 23, 2019

View reviewed changes

efiop reviewed Nov 23, 2019

View reviewed changes

weekly-digest Bot mentioned this pull request Nov 24, 2019

Weekly Digest (17 November, 2019 - 24 November, 2019) #2841

Closed

Suor reviewed Nov 24, 2019

View reviewed changes

Suor suggested changes Nov 24, 2019

View reviewed changes

danihodovic force-pushed the feat/2515 branch from 51fffc9 to 7b42ffd Compare November 26, 2019 16:15

danihodovic added a commit to danihodovic/dvc.org that referenced this pull request Nov 26, 2019

cmd ref: add examples on downloading normal git files

e6ab43e

depends on: treeverse/dvc#2837

danihodovic mentioned this pull request Nov 26, 2019

get: add example on downloading normal git files treeverse/dvc.org#821

Merged

efiop requested a review from Suor November 26, 2019 18:38

efiop reviewed Nov 26, 2019

View reviewed changes

Comment thread dvc/command/get.py Outdated

efiop requested review from a user, jorgeorpinel and pared November 26, 2019 18:39

Suor suggested changes Nov 26, 2019

View reviewed changes

jorgeorpinel changed the title ~~get: copy regular files~~ get: copy/download files tracked by Git Nov 26, 2019

jorgeorpinel reviewed Nov 26, 2019

View reviewed changes

Comment thread dvc/command/get.py Outdated

jorgeorpinel suggested changes Nov 26, 2019

View reviewed changes

Comment thread dvc/command/get.py Outdated

Comment thread dvc/command/get.py Outdated

pared suggested changes Nov 27, 2019

View reviewed changes

efiop requested a review from jorgeorpinel November 28, 2019 13:13

danihodovic added 6 commits December 5, 2019 21:05

fixup

38bed22

Pass all tests

3c2633e

return early

20e7696

fixup

b17d1c5

fix error message for git files

9ae4ab8

fixup! fix error message for git files

7debaca

danihodovic force-pushed the feat/2515 branch 3 times, most recently from 96ba415 to 0baa307 Compare December 6, 2019 17:51

fixes

811855a

danihodovic force-pushed the feat/2515 branch from 0baa307 to 811855a Compare December 6, 2019 17:56

efiop reviewed Dec 6, 2019

View reviewed changes

Comment thread dvc/repo/get.py

efiop reviewed Dec 6, 2019

View reviewed changes

efiop suggested changes Dec 6, 2019

View reviewed changes

efiop requested a review from Suor December 6, 2019 19:00

fixes

d59bceb

efiop approved these changes Dec 6, 2019

View reviewed changes

efiop reviewed Dec 6, 2019

View reviewed changes

Comment thread tests/func/test_get.py Outdated

efiop reviewed Dec 6, 2019

View reviewed changes

Comment thread dvc/repo/get.py Outdated

fixup! fixes

29b34ba

efiop approved these changes Dec 6, 2019

View reviewed changes

Suor suggested changes Dec 7, 2019

View reviewed changes

pared approved these changes Dec 7, 2019

View reviewed changes

weekly-digest Bot mentioned this pull request Dec 8, 2019

Weekly Digest (1 December, 2019 - 8 December, 2019) #2919

Closed

efiop merged commit 26a9702 into treeverse:master Dec 9, 2019

jorgeorpinel mentioned this pull request Jan 6, 2020

DOC: dvc import can access non-DVC Git repositories treeverse/dvc.org#900

Closed

danihodovic deleted the feat/2515 branch March 27, 2023 09:37

		is_git_file = output_error and not os.path.isabs(path)
		is_not_cached = output and not output.use_cache

Conversation

danihodovic commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop Nov 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop commented Nov 23, 2019

Uh oh!

efiop left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danihodovic Nov 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop Nov 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop Nov 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop Nov 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop commented Nov 24, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danihodovic commented Nov 26, 2019

Uh oh!

danihodovic commented Nov 22, 2019 •

edited

Loading

efiop Nov 23, 2019 •

edited

Loading

danihodovic Nov 23, 2019 •

edited

Loading

efiop Nov 24, 2019 •

edited

Loading

efiop Nov 24, 2019 •

edited

Loading

efiop Nov 24, 2019 •

edited

Loading

jorgeorpinel left a comment •

edited

Loading