introduce data:status command#7663
Conversation
|
@skshetry It still lacks major features like status granularity for files in datasets and has major performance implications. I'm worried that it might be wasteful to try integrating it right now. This is a good start for a mockup, but lets not merge it for now. If one wants to test it with vscode, it is possible to just install from this PR. |
Let me share some numbers. In one scenario, #7659 dropped the total time to run those 3 commands for the extension from 26.4s (min) to 8.4s (min). This single unified command drops that further to 2.6s (min). Half of that time is spent on cloning "status granularity" is not being used by VSCode right now, and is not very important for a command that is hidden. I'd say that this gives us something to play with, understand what the current state is and test on different scenarios. I'd really prefer this to be merged, and I have tried to keep the changes to other parts of DVC very minimal so that we can easily get rid of the command altogether in the future. |
|
@skshetry Thanks for the research! 🙏 I agree with your points, but I'm still concerned that we are merging an untested hacky code that could be easily played with/tried out through installing from PR. But if you think that this will really help the progress, I'm not going to block it. |
|
Seeing those numbers, I am a bit confused to be honest why we have that proposal though. Yes, it's for just one simple scenario, but why didn't we look for solving low-hanging fruits first? The optimizations that we'll be working on are going to improve this and other commands, but those are not going to have a drastic improvement for the kind of repositories that the extension may be targeting initially. From the proposal, it seems to me that having a unified command is more important than improving the performance. Well, performance improvements are going to be gradual and targeted to different scenarios, I don't see why it's needed on day 1. |
|
@skshetry We don't really have one typical repository that we are aiming at (though I know vscode team is working on some examples). Performance concerns are mostly about dvc itself, they are just even more annoying when using the extension. The idea is to build it up properly (to a reasonable extent) from the day 1, so that we don't have a long trail of major architectural issues that will be affecting new extension users. Underlying data/objects stuff can be hacked to kinda work right now, but it requires a few more steps until it is of proper quality and covered by proper granular tests/benchmarks. E.g. in #7597 it is still not a matter of 3-5 seconds for a large dataset, which is not ideal for the use in the extension 🙁 |
This command is hidden and still requires a lock (which is not necessary and can be monkey-patched if required).
The data is collected from
repo.diff/repo.status/repo.lsandrepo.scm.statusand merged.It does not yet follow the schema that we have in the proposal (it is still unclear there too). For now, it tries to imitate what VSCode extension does (to some degree, I still need confirmation 😄).
This should not be hard to maintain as it uses higher-level APIs that are unlikely to change and should provide us with something to compare.