Skip to content

remote/cache: consider de-duplication for .dir files #3791

@pmrowla

Description

@pmrowla

Currently, when handing directories, we need to generate the json .dir file containing the MD5 + relpath for every file in the directory. When we add or modify a file in that directory, we have to create a new .dir file containing the full directory contents for the new directory revision. When dealing with very large directories, this amounts to a significant amount of storage. For a dir with 1M files, even if only a single file has changed between two revisions, we currently need to generate and store two separate (nearly identical) json files each containing 1M entries.

Ideally we should not be duplicating data between these directory versions. For a new directory version, we should only be storing data for the files which have changed between revisions.
Essentially we want store the diff between two directory trees, rather than two full directory trees, but exactly how we should be storing that needs to be researched. One suggestion was that we look into how git versions directory trees (discord context).

This should especially be considered now that we are discussing other potential changes to our cache structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementEnhances DVCp2-mediumMedium priority, should be done, but less importantperformanceimprovement over resource / time consuming tasksresearch

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions