introduce data:status command by skshetry · Pull Request #7663 · treeverse/dvc

skshetry · 2022-04-29T09:24:46Z

This command is hidden and still requires a lock (which is not necessary and can be monkey-patched if required).
The data is collected from repo.diff/repo.status/repo.ls and repo.scm.status and merged.
It does not yet follow the schema that we have in the proposal (it is still unclear there too). For now, it tries to imitate what VSCode extension does (to some degree, I still need confirmation 😄).

This should not be hard to maintain as it uses higher-level APIs that are unlikely to change and should provide us with something to compare.

efiop · 2022-04-29T11:28:16Z

@skshetry It still lacks major features like status granularity for files in datasets and has major performance implications. I'm worried that it might be wasteful to try integrating it right now. This is a good start for a mockup, but lets not merge it for now. If one wants to test it with vscode, it is possible to just install from this PR.

skshetry · 2022-04-29T13:58:51Z

@skshetry It still lacks major features like status granularity for files in datasets and has major performance implications.

Let me share some numbers. In one scenario, #7659 dropped the total time to run those 3 commands for the extension from 26.4s (min) to 8.4s (min).

This single unified command drops that further to 2.6s (min). Half of that time is spent on cloning dataset-registry. So a persistent clone caching and an upcoming patch by @dtrifiro (#7597) might reduce this further.

"status granularity" is not being used by VSCode right now, and is not very important for a command that is hidden.
Regarding performance implications, it's the same story with the rest of the commands. The improvements are going to be in the lower level. Again, I would not worry about that for a hidden command.

I'd say that this gives us something to play with, understand what the current state is and test on different scenarios.
As we're trying to replace 3 different commands, right now, it's not even clear what the issues are for VSCode (at least I have no idea).
This also provides a baseline for us to work with and benchmark.

I'd really prefer this to be merged, and I have tried to keep the changes to other parts of DVC very minimal so that we can easily get rid of the command altogether in the future.

efiop · 2022-04-29T14:39:10Z

@skshetry Thanks for the research! 🙏

I agree with your points, but I'm still concerned that we are merging an untested hacky code that could be easily played with/tried out through installing from PR. But if you think that this will really help the progress, I'm not going to block it.

skshetry · 2022-04-29T15:39:16Z

Seeing those numbers, I am a bit confused to be honest why we have that proposal though. Yes, it's for just one simple scenario, but why didn't we look for solving low-hanging fruits first?

The optimizations that we'll be working on are going to improve this and other commands, but those are not going to have a drastic improvement for the kind of repositories that the extension may be targeting initially.

From the proposal, it seems to me that having a unified command is more important than improving the performance. Well, performance improvements are going to be gradual and targeted to different scenarios, I don't see why it's needed on day 1.

efiop · 2022-04-30T22:15:03Z

@skshetry We don't really have one typical repository that we are aiming at (though I know vscode team is working on some examples). Performance concerns are mostly about dvc itself, they are just even more annoying when using the extension. The idea is to build it up properly (to a reasonable extent) from the day 1, so that we don't have a long trail of major architectural issues that will be affecting new extension users. Underlying data/objects stuff can be hacked to kinda work right now, but it requires a few more steps until it is of proper quality and covered by proper granular tests/benchmarks. E.g. in #7597 it is still not a matter of 3-5 seconds for a large dataset, which is not ideal for the use in the extension 🙁

introduce data:status command

1892226

skshetry added skip-changelog Skips changelog A: cli Related to the CLI labels Apr 29, 2022

skshetry requested review from dberenbaum, dtrifiro, efiop and mattseddon April 29, 2022 09:24

skshetry self-assigned this Apr 29, 2022

skshetry requested a review from a team as a code owner April 29, 2022 09:24

patch dvc.scm.clone instead of scmrepo.git.Git

89ad11c

skshetry closed this Jun 8, 2022

skshetry deleted the data-status branch June 8, 2022 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce data:status command#7663

introduce data:status command#7663
skshetry wants to merge 2 commits into
treeverse:mainfrom
skshetry:data-status

skshetry commented Apr 29, 2022

Uh oh!

efiop commented Apr 29, 2022

Uh oh!

skshetry commented Apr 29, 2022 •

edited

Loading

Uh oh!

efiop commented Apr 29, 2022

Uh oh!

skshetry commented Apr 29, 2022

Uh oh!

efiop commented Apr 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skshetry commented Apr 29, 2022

Uh oh!

efiop commented Apr 29, 2022

Uh oh!

skshetry commented Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Apr 29, 2022

Uh oh!

skshetry commented Apr 29, 2022

Uh oh!

efiop commented Apr 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

skshetry commented Apr 29, 2022 •

edited

Loading