remote: continue remote/cache/tree refactoring#4041
Conversation
* operates on obj.tree, so that GC can be done with both remote and cache instances
b27d57c to
aef4701
Compare
| def _remove_unpacked_dir(self, checksum): | ||
| pass | ||
|
|
||
| class CloudCache: |
There was a problem hiding this comment.
Quite strange why it is CloudCache and not just Cache? And why LocalCache is the only exception?
There was a problem hiding this comment.
Ah, right, because of the push/pull. Ok, makes sense.
There was a problem hiding this comment.
I went with CloudCache to avoid the collision with dvc.cache.Cache, but maybe this should be Cache, and the other one should be RepoCache or something along those lines.
There was a problem hiding this comment.
@pmrowla Makes sense. I'm fine with CloudCache for now.
There was a problem hiding this comment.
We can also get rid of LocalCache in the future by making push/pull/status more generic for syncing data between any Cache and any Remote, rather than having it be specific to LocalCache, but I don't think there's any pressing need for that right now.
| def get_remote(repo, **kwargs): | ||
| tree = get_cloud_tree(repo, **kwargs) | ||
| if tree.scheme == "local": | ||
| return LocalRemote(tree) | ||
| if tree.scheme == "ssh": | ||
| return SSHRemote(tree) | ||
| return Remote(tree) |
There was a problem hiding this comment.
For initializing caches, we now just do
tree = get_cloud_tree(<remote config>)
cache = CloudCache(tree)
But for remotes, we still have this helper method. It would be ideal if we could just always use Remote(tree) but LocalRemote and SSHRemote still have their own overridden batch exists methods. It may be worth investigating whether or not the custom ssh/local methods are still needed with the other performance improvements that we have now.
There was a problem hiding this comment.
Or maybe those custom methods could be moved to the trees. IIRC ssh uses some custom sftp channel handling, which might be abstracted away too. But surely could do that later. Maybe consider creating an issue for it.
There was a problem hiding this comment.
One other question I had was whether or not the indexing related code should also just go into the trees at some point instead of being DVC Remote specific
There was a problem hiding this comment.
@pmrowla Oh, that's a very good question! To me, it seems purely remote-related, as it solely relies on remote structure. But how do you see it?
There was a problem hiding this comment.
Since it's remote specific it makes sense to keep it in the Remote class for now, but I wasn't sure if indexing any kind of content (as opposed to just DVC Remotes) in a remote/cloud tree is something we might want to support in the future.
There was a problem hiding this comment.
I'm not sure there is a good way to index arbitrary data in that way, it is really tailored for our cache structure, so I really doubt that full index logic will ever make its way into the Tree. But, I'm sure that parts of it will, as many functions are actually quite generic (like walk_files replacing _list_cache).
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Related to #3882.
remote.Remoteandremote.Cachehelper functions with actualRemoteandCloudCacheclassesRemoteandCloudCacheconstructors take a RemoteTree parameterremote.get_cloud_treehelper method can be used to return the proper tree for a given URL