Skip to content

introduce data:status command#7663

Closed
skshetry wants to merge 2 commits into
treeverse:mainfrom
skshetry:data-status
Closed

introduce data:status command#7663
skshetry wants to merge 2 commits into
treeverse:mainfrom
skshetry:data-status

Conversation

@skshetry
Copy link
Copy Markdown
Collaborator

This command is hidden and still requires a lock (which is not necessary and can be monkey-patched if required).
The data is collected from repo.diff/repo.status/repo.ls and repo.scm.status and merged.
It does not yet follow the schema that we have in the proposal (it is still unclear there too). For now, it tries to imitate what VSCode extension does (to some degree, I still need confirmation 😄).

This should not be hard to maintain as it uses higher-level APIs that are unlikely to change and should provide us with something to compare.

@skshetry skshetry added skip-changelog Skips changelog A: cli Related to the CLI labels Apr 29, 2022
@skshetry skshetry self-assigned this Apr 29, 2022
@skshetry skshetry requested a review from a team as a code owner April 29, 2022 09:24
@efiop
Copy link
Copy Markdown
Contributor

efiop commented Apr 29, 2022

@skshetry It still lacks major features like status granularity for files in datasets and has major performance implications. I'm worried that it might be wasteful to try integrating it right now. This is a good start for a mockup, but lets not merge it for now. If one wants to test it with vscode, it is possible to just install from this PR.

@skshetry
Copy link
Copy Markdown
Collaborator Author

skshetry commented Apr 29, 2022

@skshetry It still lacks major features like status granularity for files in datasets and has major performance implications.

Let me share some numbers. In one scenario, #7659 dropped the total time to run those 3 commands for the extension from 26.4s (min) to 8.4s (min).

This single unified command drops that further to 2.6s (min). Half of that time is spent on cloning dataset-registry. So a persistent clone caching and an upcoming patch by @dtrifiro (#7597) might reduce this further.

"status granularity" is not being used by VSCode right now, and is not very important for a command that is hidden.
Regarding performance implications, it's the same story with the rest of the commands. The improvements are going to be in the lower level. Again, I would not worry about that for a hidden command.

I'd say that this gives us something to play with, understand what the current state is and test on different scenarios.
As we're trying to replace 3 different commands, right now, it's not even clear what the issues are for VSCode (at least I have no idea).
This also provides a baseline for us to work with and benchmark.

I'd really prefer this to be merged, and I have tried to keep the changes to other parts of DVC very minimal so that we can easily get rid of the command altogether in the future.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Apr 29, 2022

@skshetry Thanks for the research! 🙏

I agree with your points, but I'm still concerned that we are merging an untested hacky code that could be easily played with/tried out through installing from PR. But if you think that this will really help the progress, I'm not going to block it.

@skshetry
Copy link
Copy Markdown
Collaborator Author

Seeing those numbers, I am a bit confused to be honest why we have that proposal though. Yes, it's for just one simple scenario, but why didn't we look for solving low-hanging fruits first?

The optimizations that we'll be working on are going to improve this and other commands, but those are not going to have a drastic improvement for the kind of repositories that the extension may be targeting initially.

From the proposal, it seems to me that having a unified command is more important than improving the performance. Well, performance improvements are going to be gradual and targeted to different scenarios, I don't see why it's needed on day 1.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Apr 30, 2022

@skshetry We don't really have one typical repository that we are aiming at (though I know vscode team is working on some examples). Performance concerns are mostly about dvc itself, they are just even more annoying when using the extension. The idea is to build it up properly (to a reasonable extent) from the day 1, so that we don't have a long trail of major architectural issues that will be affecting new extension users. Underlying data/objects stuff can be hacked to kinda work right now, but it requires a few more steps until it is of proper quality and covered by proper granular tests/benchmarks. E.g. in #7597 it is still not a matter of 3-5 seconds for a large dataset, which is not ideal for the use in the extension 🙁

@skshetry skshetry closed this Jun 8, 2022
@skshetry skshetry deleted the data-status branch June 8, 2022 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A: cli Related to the CLI skip-changelog Skips changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants