Skip to content

[WIP] cmd: diff: don't load dir cache for diff#3468

Merged
efiop merged 1 commit into
treeverse:masterfrom
pared:3392_diff_cache
Mar 16, 2020
Merged

[WIP] cmd: diff: don't load dir cache for diff#3468
efiop merged 1 commit into
treeverse:masterfrom
pared:3392_diff_cache

Conversation

@pared
Copy link
Copy Markdown
Contributor

@pared pared commented Mar 10, 2020

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Fixes #3392

@pared pared requested review from gurobokum and skshetry March 11, 2020 09:13
Copy link
Copy Markdown
Contributor

@gurobokum gurobokum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S.
It looks like pythonic way but with simple loops it reads easier

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Mar 13, 2020

Sorry for a late question, but does this mean that we lose support for dirs from now on? The issue was that when one version has .dir cache file and the other one doesn't, then the output is weird, right?

@pared
Copy link
Copy Markdown
Contributor Author

pared commented Mar 13, 2020

@efiop Yes, in a way. After this change, dir out will have the same behavior as non-dir one:
it will be displayed as added/deleted/modified as a whole.

@pared
Copy link
Copy Markdown
Contributor Author

pared commented Mar 13, 2020

After discussing with @efiop, its probably the best to apply a hybrid approach:
when the cache is available, list differences, if not, treat dir as if they were not dirs.

@gurobokum gurobokum self-requested a review March 13, 2020 15:42
@pared pared force-pushed the 3392_diff_cache branch from 19ead80 to dfd829d Compare March 13, 2020 18:16
@pared pared changed the title cmd: diff: don't load dir cache for diff [WIP] cmd: diff: don't load dir cache for diff Mar 13, 2020
@shcheklein
Copy link
Copy Markdown
Contributor

@pared @efiop I have some doubts about this - hybrid approach means that users will see different results in different situations, and will be happening silently for them.

Or am I missing something about this hybrid approach? Can we highlight those dirs somehow, for example?

cc @dmpetrov .

@dmpetrov
Copy link
Copy Markdown
Contributor

@pared @efiop I have some doubts about this - hybrid approach means that users will see different results in different situations, and will be happening silently for them.

Totally agree with @shcheklein It will lead to unstable diffs.

I'd suggest ignoring all dirs content (option 3 from #3392). This option has additional advantages:

  1. No need in the cache and no need in data-remote credentials which is important for some CI/CD scenarios.
  2. Easy to implement.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Mar 13, 2020

Ok, I thought that dir cache support was desirable, but if it is not, then indeed let's just drop it. Thanks for clarifying. 🙏

@pared Sorry for this back and forth, please revert the hybrid changes.

@shcheklein
Copy link
Copy Markdown
Contributor

@efiop my 2cs - it's desirable, but it should be done along with support for dvc list and other commands that require it (what about gc for example?).

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Mar 15, 2020

it's desirable, but it should be done along with support for dvc list and other commands that require it

In that case, I'm not sure why can't we start with diff here, since it is on the surface already. And then proceed with dvc list and others.

what about gc for example?

Gc will attempt to pull .dir to analyze it, unlike diff/list/etc.

@shcheklein
Copy link
Copy Markdown
Contributor

In that case, I'm not sure why can't we start with diff here, since it is on the surface already

I guess there is some misalignment on what do we think can be considered as a start? There were a few different suggestions that @pared brought - from very simple (never do for dirs) to the actual implementation (try and pull dir file). I personally fine to start with either. The only one I had concerns about is the hybrid one - when we silently behave in a different way depending on cache availability.

@dmpetrov
Copy link
Copy Markdown
Contributor

I'm not sure why can't we start with diff here

First, because here a hybrid approach is proposed which does not give a reliable result.
Second, diff without cache is a required feature #2998 (closed due to this issue). It would be great to have asap because ci/cd needs it.

A good strategy might be - implement quickly the simple strategy (and make sure it does not require cache) and then prioritize the full solution with data dir check.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Mar 16, 2020

Ok, thanks for clarifying. Sure, let's start with the simplest one first to unblock ci/cd. 👍

@pared pared force-pushed the 3392_diff_cache branch from dfd829d to 3cec9a4 Compare March 16, 2020 10:11
@efiop efiop merged commit 87338c9 into treeverse:master Mar 16, 2020
@pared pared deleted the 3392_diff_cache branch March 24, 2020 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

diff: does not behave properly when cache is not available

5 participants