New option jobs for dvc import#4977
Conversation
`jobs` option for `dvc import`
| "--jobs", | ||
| type=int, | ||
| help=( | ||
| "Number of jobs to run simultaneously. " |
There was a problem hiding this comment.
I know this might come from some other DVC commands, but let's reconsider this message please?
Number of jobs is not very informative. Number of parallel connections? Number of download jobs?
There was a problem hiding this comment.
@shcheklein Let's not do that please, it is totally out of scope for this PR.
There was a problem hiding this comment.
Though I do see your point 🙁
There was a problem hiding this comment.
Could just refer to dvc pull/fetch/status here, to smooth this out. E.g. Please refer to dvc fetch help for description or something like that.
There was a problem hiding this comment.
I'm not referring to push/pull or any other commands (we can create a ticket for this if needed). this message in my comment was about this specific help message only.
There was a problem hiding this comment.
@shcheklein I understand that. This one was copied over from push/pull/etc, so the questions might arise there as well. If we use Please refer to dvc fetch ... here we'll dodge the bullet 😄 While still keeping the analogy correct, since this is pretty much a pull from an external repo.
There was a problem hiding this comment.
hmm ... I guess, my take on this it's fine to change only here (and later propagate if needed, and if it's needed at all). Also Please refer to dvc fetch will force us to kinda go and see fetch, and what's there, etc, etc. Also it complicates UX. To be honest, I haven't see that kind of redirects in the help messages.
There was a problem hiding this comment.
I agreed that we can change it here, and submit another one for push/pull/etc.
There was a problem hiding this comment.
In fact per #4838 import --jobs option has a special meaning? "This external tracked data might be stored in a remote DVC repository. In this situation --job which controls the parallelism level for DVC to download data from remote storage." So seems like it's not really the same as in other commands?
Please refer to dvc fetch will force us to kinda go and see fetch... I haven't see that kind of redirects in the help messages.
Unrelated, but dvc get-url -h used to refer to import-url (for the url arg. details) I think. Now only import-url has the full list of URLs.
There was a problem hiding this comment.
@jorgeorpinel dvc import clones an external repo, and then pull down the data. So I think it means the same as what in dvc pull.
There was a problem hiding this comment.
Looks good! Please check the unit tests, looks like you need to add the new flag there.
FAILED tests/unit/command/test_imp.py::test_import_no_exec - AssertionError: ...
FAILED tests/unit/command/test_imp.py::test_import - AssertionError: Expected...
jobs for dvc importjobs for dvc import
jobs for dvc importjobs for dvc import
|
There are two problem left
Now we have ways of setting
|
@karajan1001 Great point! For now we could just keep it as is and not include jobs into dvcfile.
👍
No, that's just azure being flaky :( Please don't mind it. |
| Stage.PARAM_ALWAYS_CHANGED, | ||
| Stage.PARAM_MD5, | ||
| Stage.PARAM_DESC, | ||
| Stage.PARAM_JOBS, |
There was a problem hiding this comment.
Let's not add it, this is too specific for your particular connection and might be undesired for other users.
There was a problem hiding this comment.
Ah, I see that you don't save it, you just utilize the loading this way. Hm, I guess we could keep it as is 👍 Let me see...
There was a problem hiding this comment.
Yes, without this, I can't create a stage with the jobs member variable. Stage params' saving is in dvc.schema.py.
|
|
||
| if not self.frozen and self.is_import: | ||
| sync_import(self, dry, force) | ||
| sync_import(self, dry, force, jobs=self.jobs) |
There was a problem hiding this comment.
I think you should be able to avoid setting and using self.jobs and PARAM_JOBS. Just pass jobs=... to run in dvc/repo/imp_url.py
There was a problem hiding this comment.
I had considered this before but stage.run didn't have any variable before. This represents that it all depends on member variables not on the running status. Adding a variable to it break this, and this would cause some inconsistency of dvc repo.
Feels like there should be a config-level thing to set it. There is a ticket for it already, IIRC. So I suggest just leaving as is for now. Very good observation though!
Maybe we should abandon --jobs and makes config effect?
There was a problem hiding this comment.
@karajan1001 Good point! Important consistency here is fetch/pull and import. repro is a bit odd like that with imports, where it tries to run instead of trying to pull. That's something that we are trying to fix with dvc repro --pull flag right now that will become default behaviour in the future.
So ideally there should indeed be a way of telling import that it should load config options from some config section and it should indeed save it in dvcfile. Smth like
remote:
config:
from-remote: myremote # handy to not have to remember remote name in the project you are importing from
and maybe smth like (pretty rough names, but just talking straight out of my head)
$ dvc import https://.... data --config-from-remote myremote
if we can make a generalization like that, then --jobs would indeed be no longer needed. This would be a more holistic solution, no doubt about that.
| def test_import_from_bare_git_repo( | ||
| tmp_dir, make_tmp_dir, erepo_dir, local_cloud | ||
| ): | ||
| ): # pylint:disable=unused-argument |
There was a problem hiding this comment.
I don't think these changes are needed. You probably were missing some dependencies for pylint or something.
There was a problem hiding this comment.
It's annoying, I just don't know how to fix this pylint error. I'll try to solve it later.
There was a problem hiding this comment.
@karajan1001 It only shows up locally? Try re-installing pip install '.[all,tests]'.
jobs for dvc importjobs for dvc import
| self.def_repo[self.PARAM_REV_LOCK] = repo.get_rev() | ||
|
|
||
| _, _, cache_infos = repo.fetch_external([self.def_path]) | ||
| _, _, cache_infos = repo.fetch_external([self.def_path], jobs=jobs) |
There was a problem hiding this comment.
It wasn't quite apparent how this was being used. Finally, at the 10th/11th level, I found it using repo.cloud.pull(). That's quite deep. :(
There was a problem hiding this comment.
Same to me, they need refactoring.
|
|
||
| if not self.frozen and self.is_import: | ||
| sync_import(self, dry, force) | ||
| jobs = kwargs.get("jobs", None) |
There was a problem hiding this comment.
| jobs = kwargs.get("jobs", None) | |
| jobs = kwargs.get("jobs") |
There was a problem hiding this comment.
@skshetry . With this change tests on Windows would fail, I have reverted it.
There was a problem hiding this comment.
@karajan1001 Oops, didn't notice this before merging. I think windows tests were failing for an unrelated reason. We've fixed them yesterday, they were failing because of gitpython.
Co-authored-by: Saugat Pachhai <suagatchhetri@outlook.com>
skshetry
left a comment
There was a problem hiding this comment.
Just one suggestion, looks good to me otherwise.
https://github.com/iterative/dvc/pull/4977/files#r535817055
|
Thank you, @karajan1001 ! 🙏 |
Fixes #4838
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
doc for
dvc import --jobsdvc.org#2008Thank you for the contribution - we'll try to review it as soon as possible. 🙏
It works on my computer now, but some more tests are still needed.