Skip to content

Initial pkg implementation#2012

Merged
efiop merged 2 commits into
treeverse:masterfrom
efiop:pkgs
Jun 17, 2019
Merged

Initial pkg implementation#2012
efiop merged 2 commits into
treeverse:masterfrom
efiop:pkgs

Conversation

@efiop
Copy link
Copy Markdown
Contributor

@efiop efiop commented May 16, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.

Docs: treeverse/dvc.org#385


Todo:

Comment thread dvc/repo/pkg/__init__.py Outdated
@efiop efiop force-pushed the pkgs branch 6 times, most recently from 76098ab to 9cd10dd Compare May 25, 2019 23:24
@Suor
Copy link
Copy Markdown
Contributor

Suor commented May 27, 2019

I think it would be better to not add knowledge about everything to config. Config should provide .get(), .set(), .update(), etc. Load and save itself but should know nothing about its keys and structure. Otherwise we are creating a superclass.

Knowledge about config contents will be ideally contained within specific code area and not slip anywhere else. E.g. remote will be able to configure itself from config and update config as needed: add or update a remote, set default one.

@efiop efiop force-pushed the pkgs branch 3 times, most recently from 328a014 to 84210ed Compare May 28, 2019 20:06
@efiop efiop changed the title [WIP] Initial pkg implementation Initial pkg implementation May 28, 2019
@efiop
Copy link
Copy Markdown
Contributor Author

efiop commented May 28, 2019

@Suor

I think it would be better to not add knowledge about everything to config. Config should provide .get(), .set(), .update(), etc. Load and save itself but should know nothing about its keys and structure. Otherwise we are creating a superclass.

There is a config like that already. It is called ConfigObj 🙂 Our config is dealing with managing particular sections(e.g. remote and pkg) within it and so has the necessary notion to properly parse them(e.g. schema, path resolutions and so on).

Knowledge about config contents will be ideally contained within specific code area and not slip anywhere else. E.g. remote will be able to configure itself from config and update config as needed: add or update a remote, set default one.

Do you mean something like this

r = Remote(repo, {"url": "https://example.com"})
r.save() # saving to config
r.set_default() # setting it in config as default remote

?

@efiop efiop changed the title Initial pkg implementation [WIP] Initial pkg implementation May 28, 2019
@efiop efiop force-pushed the pkgs branch 5 times, most recently from 02733a9 to 15b4b97 Compare May 29, 2019 02:10
@efiop efiop changed the title [WIP] Initial pkg implementation Initial pkg implementation May 30, 2019
@efiop efiop changed the title Initial pkg implementation [WIP] Initial pkg implementation May 30, 2019
@efiop efiop force-pushed the pkgs branch 2 times, most recently from 1a8e7dc to 59e89bd Compare May 30, 2019 01:34
@efiop
Copy link
Copy Markdown
Contributor Author

efiop commented Jun 6, 2019

@Suor I'm not sure how you want to handle them as the same class. They are different instances, with different fields, not sure how you want to get rid of both in favor of one. The current patch at least uses factories instead of junky def Output/Dependency functions and generalizes a lot of stuff. I'll move that logic from this PR into a separate PR, so we could discuss that there.

@efiop
Copy link
Copy Markdown
Contributor Author

efiop commented Jun 6, 2019

Moved config-related stuff to #2098 . The factory refactorings will come separately.

@Suor
Copy link
Copy Markdown
Contributor

Suor commented Jun 6, 2019

@efiop I just want hordes of code to be moved to mirror files only for really good reason.

@efiop
Copy link
Copy Markdown
Contributor Author

efiop commented Jun 6, 2019

@Suor Unless something went wrong there, moving shouldn't have deleted the history. Like git mv doesn't, even if you introduce changes on top. Inheritance logic from this PR from dependency to output makes more sense than the one that we currently have in master. Maybe you have some specific way to reorganize those on your mind? Because it seems like I don't quite follow.

@Suor
Copy link
Copy Markdown
Contributor

Suor commented Jun 7, 2019

@efiop why output subslassing dependencies is better? It's the same but the other way around. So just moving code for no reason. Even if that has some sense, which I don't get, it is still not worth the hassle as code itself is not improving, i.e. won't be easier to support in the future, and making it abstractly correct in whatever sense is low value.

@efiop efiop force-pushed the pkgs branch 2 times, most recently from 4473fee to 910f9fd Compare June 11, 2019 13:45
@efiop
Copy link
Copy Markdown
Contributor Author

efiop commented Jun 11, 2019

@Suor Sorry, forgot to reply. The refactoring does make the code better, since it removes weird if self.is_dependency calls from outputs. Let's discuss it in more details in the refactoring PRs.

In the meantime, I've pushed MVP that is detached from all the previous refactorings (which are going to be submitted separately). Todo:

  • don't push pkg imported data to remote
  • add .dvc/pkg-list.yml similar to https://github.com/sindresorhus/package-json
  • add .dvc/pkg-lock.yml similar to https://flaviocopes.com/package-lock-json/ (install should add an entry to it)
  • support importing the whole package with dvc import mypkg. This is going to be similar to specifying multiple datas within the package, e.g. dvc import mypkg data1 data2, but we will have to collect it ourselves for the whole package.
  • add dvc pkg get for simply downloading data from the url
  • For that ^ we'll need to detach dst option into a -o flag, so that we could dvc import mypkg -o dir;
  • check that when running repro on import it will ask if it needs to remove modified outputs (will be implemented by run: warn and/or prompt user when deleting an output #2027)

So currently there are three commands: dvc pkg insta/uninstall/import.

Comment thread dvc/command/pkg.py Outdated
Comment thread dvc/stage.py Outdated
Comment thread dvc/pkg.py Outdated
Comment thread dvc/pkg.py Outdated
Copy link
Copy Markdown
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't dvc/repo/_ignore() flist include self.pkg.pkg_dir ?

Comment thread tests/func/test_pkg.py Outdated
Comment thread dvc/pkg.py Outdated
Comment thread tests/func/test_pkg.py Outdated
@efiop efiop mentioned this pull request Jun 13, 2019
2 tasks
Comment thread dvc/repo/__init__.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related not this PR: looks like either this message should be moved down the stack or we should add continue here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have continue here, because we should push/pull/etc cache for the locked stage file itself. We could move this warning to repo.graph(), but it would be too specific to use it there. Hence why we put this one here.

Comment thread dvc/repo/__init__.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I remember we already have a mechanism to avoid caching of an output? can we reuse it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein That is an amazing idea! I didn't think about it that way. But indeed, we could treat it as --outs-no-cache. Thanks! 🙂

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it is not easy to do, since things like adding to gitignore and checking out are use_cache dependant. I've added get_used_cache method to Output to handle this more gracefully, but using --outs-no-cache is not a very good option for now.

Comment thread dvc/stage.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if feels like an output can decide if it needs to be cached or not

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output needs to know if he is a part of import stage, so this is still used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily, when we create an output we can pass a flag. It can be and probably should be decided on the Cmd/API level. I can imagine we will have a flag to cache output anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep that in mind in #2138 (added a comment about this discussion there as well).

Comment thread dvc/scm/git/tree.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail if does not exist?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will fail later down the line, where we will try to use this. I'll take a second look to see if failing here would provide better UX. Thanks for the heads up!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the error handling down the line to handle it more gracefully.

Comment thread dvc/pkg.py Outdated
Comment thread dvc/pkg.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq - does it avoid restoring the workspace as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein Not sure what you mean. Could you elaborate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean --no-checkout option of the git clone

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nope. So we do checkout to have a default version installed. If any dependency is using some particular version, it will get it. But if you've installed some package of specific version by default, then all dependencies that don't have version set explicitly, will use that default installed version.

Comment thread dvc/pkg.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, does it make sense to disable lock and sqllite for this case? just to make sure that this easy scenario works everywhere.

Copy link
Copy Markdown
Contributor Author

@efiop efiop Jun 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean nfs and friends, right? Good point, I'll take a look if we could disable state for that purpose. Also need to disable our own .dvc/lock as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not trivial, created #2135 for it. Might be easier to just support nfs after all.

Copy link
Copy Markdown
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looks great! A few minor comments and questions.

@jorgeorpinel please, review the help messages and command description. Let's work together to make then user-friendly.

Ruslan Kuprieiev added 2 commits June 17, 2019 18:12
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
This was referenced Jun 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants