Skip to content

New option jobs for dvc import#4977

Merged
efiop merged 8 commits into
treeverse:masterfrom
karajan1001:fix4838
Dec 11, 2020
Merged

New option jobs for dvc import#4977
efiop merged 8 commits into
treeverse:masterfrom
karajan1001:fix4838

Conversation

@karajan1001
Copy link
Copy Markdown
Contributor

@karajan1001 karajan1001 commented Nov 26, 2020

Fixes #4838

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

jobs3
It works on my computer now, but some more tests are still needed.

Comment thread dvc/command/imp.py
"--jobs",
type=int,
help=(
"Number of jobs to run simultaneously. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this might come from some other DVC commands, but let's reconsider this message please?

Number of jobs is not very informative. Number of parallel connections? Number of download jobs?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein Let's not do that please, it is totally out of scope for this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I do see your point 🙁

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just refer to dvc pull/fetch/status here, to smooth this out. E.g. Please refer to dvc fetch help for description or something like that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not referring to push/pull or any other commands (we can create a ticket for this if needed). this message in my comment was about this specific help message only.

Copy link
Copy Markdown
Contributor

@efiop efiop Nov 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein I understand that. This one was copied over from push/pull/etc, so the questions might arise there as well. If we use Please refer to dvc fetch ... here we'll dodge the bullet 😄 While still keeping the analogy correct, since this is pretty much a pull from an external repo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ... I guess, my take on this it's fine to change only here (and later propagate if needed, and if it's needed at all). Also Please refer to dvc fetch will force us to kinda go and see fetch, and what's there, etc, etc. Also it complicates UX. To be honest, I haven't see that kind of redirects in the help messages.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agreed that we can change it here, and submit another one for push/pull/etc.

Copy link
Copy Markdown
Contributor

@jorgeorpinel jorgeorpinel Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact per #4838 import --jobs option has a special meaning? "This external tracked data might be stored in a remote DVC repository. In this situation --job which controls the parallelism level for DVC to download data from remote storage." So seems like it's not really the same as in other commands?

Please refer to dvc fetch will force us to kinda go and see fetch... I haven't see that kind of redirects in the help messages.

Unrelated, but dvc get-url -h used to refer to import-url (for the url arg. details) I think. Now only import-url has the full list of URLs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel dvc import clones an external repo, and then pull down the data. So I think it means the same as what in dvc pull.

Copy link
Copy Markdown
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please check the unit tests, looks like you need to add the new flag there.

FAILED tests/unit/command/test_imp.py::test_import_no_exec - AssertionError: ...
FAILED tests/unit/command/test_imp.py::test_import - AssertionError: Expected...

@karajan1001 karajan1001 changed the title [WIP] New option jobs for dvc import New option jobs for dvc import Nov 29, 2020
@karajan1001 karajan1001 changed the title New option jobs for dvc import [WIP] New option jobs for dvc import Nov 29, 2020
@karajan1001
Copy link
Copy Markdown
Contributor Author

@efiop

There are two problem left

  1. jobs argument is not stored in the dvcfile.

Now we have ways of setting jobs in dvc import and dvc pull after cloning a project with dvc import.
But we can't set it in dvc repro if the imported dataset changed.

  1. help message mentioned above.

FAILED tests/func/test_api.py::test_open_external[azure]
Is this fail caused by this PR?

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Nov 29, 2020

jobs argument is not stored in the dvcfile.

@karajan1001 Great point! For now we could just keep it as is and not include jobs into dvcfile. jobs kinda depends on your particular connection to the server, so it doesn't feel right to have it in dvcfile for everyone. Feels like there should be a config-level thing to set it. There is a ticket for it already, IIRC. So I suggest just leaving as is for now. Very good observation though!

help message mentioned above.

👍

Is this fail caused by this PR?

No, that's just azure being flaky :( Please don't mind it.

Comment thread dvc/stage/__init__.py Outdated
Stage.PARAM_ALWAYS_CHANGED,
Stage.PARAM_MD5,
Stage.PARAM_DESC,
Stage.PARAM_JOBS,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add it, this is too specific for your particular connection and might be undesired for other users.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that you don't save it, you just utilize the loading this way. Hm, I guess we could keep it as is 👍 Let me see...

Copy link
Copy Markdown
Contributor Author

@karajan1001 karajan1001 Nov 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without this, I can't create a stage with the jobs member variable. Stage params' saving is in dvc.schema.py.

Comment thread dvc/stage/__init__.py Outdated

if not self.frozen and self.is_import:
sync_import(self, dry, force)
sync_import(self, dry, force, jobs=self.jobs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to avoid setting and using self.jobs and PARAM_JOBS. Just pass jobs=... to run in dvc/repo/imp_url.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered this before but stage.run didn't have any variable before. This represents that it all depends on member variables not on the running status. Adding a variable to it break this, and this would cause some inconsistency of dvc repo.

Feels like there should be a config-level thing to set it. There is a ticket for it already, IIRC. So I suggest just leaving as is for now. Very good observation though!

Maybe we should abandon --jobs and makes config effect?

Copy link
Copy Markdown
Contributor

@efiop efiop Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Good point! Important consistency here is fetch/pull and import. repro is a bit odd like that with imports, where it tries to run instead of trying to pull. That's something that we are trying to fix with dvc repro --pull flag right now that will become default behaviour in the future.

So ideally there should indeed be a way of telling import that it should load config options from some config section and it should indeed save it in dvcfile. Smth like

remote:
  config:
    from-remote: myremote  # handy to not have to remember remote name in the project you are importing from

and maybe smth like (pretty rough names, but just talking straight out of my head)

$ dvc import https://.... data --config-from-remote myremote

if we can make a generalization like that, then --jobs would indeed be no longer needed. This would be a more holistic solution, no doubt about that.

Comment thread tests/func/test_import.py Outdated
def test_import_from_bare_git_repo(
tmp_dir, make_tmp_dir, erepo_dir, local_cloud
):
): # pylint:disable=unused-argument
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these changes are needed. You probably were missing some dependencies for pylint or something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying, I just don't know how to fix this pylint error. I'll try to solve it later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 It only shows up locally? Try re-installing pip install '.[all,tests]'.

@karajan1001 karajan1001 changed the title [WIP] New option jobs for dvc import New option jobs for dvc import Nov 30, 2020
Comment thread dvc/dependency/repo.py
self.def_repo[self.PARAM_REV_LOCK] = repo.get_rev()

_, _, cache_infos = repo.fetch_external([self.def_path])
_, _, cache_infos = repo.fetch_external([self.def_path], jobs=jobs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't quite apparent how this was being used. Finally, at the 10th/11th level, I found it using repo.cloud.pull(). That's quite deep. :(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same to me, they need refactoring.

Comment thread dvc/stage/__init__.py

if not self.frozen and self.is_import:
sync_import(self, dry, force)
jobs = kwargs.get("jobs", None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jobs = kwargs.get("jobs", None)
jobs = kwargs.get("jobs")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry . With this change tests on Windows would fail, I have reverted it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Oops, didn't notice this before merging. I think windows tests were failing for an unrelated reason. We've fixed them yesterday, they were failing because of gitpython.

Comment thread dvc/stage/params.py Outdated
Co-authored-by: Saugat Pachhai <suagatchhetri@outlook.com>
Copy link
Copy Markdown
Collaborator

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one suggestion, looks good to me otherwise.

https://github.com/iterative/dvc/pull/4977/files#r535817055

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Dec 11, 2020

Thank you, @karajan1001 ! 🙏

@efiop efiop merged commit 96a97ec into treeverse:master Dec 11, 2020
@karajan1001 karajan1001 deleted the fix4838 branch April 7, 2021 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import: --jobs option

5 participants