-
Notifications
You must be signed in to change notification settings - Fork 1.3k
perf: remote indexes #2147
Copy link
Copy link
Closed
Labels
enhancementEnhances DVCEnhances DVCfeature requestRequesting a new featureRequesting a new featureperformanceimprovement over resource / time consuming tasksimprovement over resource / time consuming tasksquestionI have a question?I have a question?research
Metadata
Metadata
Assignees
Labels
enhancementEnhances DVCEnhances DVCfeature requestRequesting a new featureRequesting a new featureperformanceimprovement over resource / time consuming tasksimprovement over resource / time consuming tasksquestionI have a question?I have a question?research
As we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.
However, our remotes don't provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:
index,1.idx2.<uuid>.idx,1.idx.This way in a case of a race we will have several index files.
Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.
File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.
What do you guys think? @shcheklein @dmpetrov @efiop @pared @MrOutis