allow excluding parent and including child, fixes #2314#2322
allow excluding parent and including child, fixes #2314#2322ThomasWaldmann merged 1 commit intoborgbackup:masterfrom
Conversation
enkore
left a comment
There was a problem hiding this comment.
I think we should first discuss the approach. With this change Borg would recurse into all excluded directories, which can take significant time, even if no files would be added from these.
|
@enkore yup, that's bad. see my comment in the ticket. |
|
Any conclusions on how to move forward with this? Is TW's suggestion OK, or should something else be done? His suggestion, from #2314 is: |
|
See my comments in #2314, i.e. feel free to go ahead implementing it. Which character is which can be easily changed later, if need be. |
Codecov Report
@@ Coverage Diff @@
## master #2322 +/- ##
=========================================
+ Coverage 82.91% 83% +0.09%
=========================================
Files 20 20
Lines 7556 7575 +19
Branches 1288 1288
=========================================
+ Hits 6265 6288 +23
+ Misses 930 928 -2
+ Partials 361 359 -2
Continue to review full report at Codecov.
|
|
Ok, big update -- lots of refactoring, and the "!" feature is now implemented. Sorry for so many changes, but I'm hoping you'll see their beauty ;) Here's a summary of the important changes:
(still need to add tests, but that should be trivial -- for now, I'd like it if the code could be reviewed) |
|
@edgimar could you please use rebase onto master and not merge master? also, rephrase some commit comments so they are more meaningful than "refactoring" (e.g. state what code you refactored, what the goal was, ...). Maybe create a backup, update your local master from the main repo, switch back to your branch again and then do Also: are the commit comments starting with "-" intended to be kept that way? |
|
Sorry, I do intend to clean up the repo (rebasing, collapsing commits, good commit messages), but at this point it's not probably worth looking at prior commits since most of them are just taken as "snapshots" of the work in progress and not based on atomic sets of changes. The best is probably to just look at the overall diff between the latest commit and borgbackup:master. If any further changes are needed to the code, let me know and I'll try to update it. After everything looks OK, I'll deal with rebasing / etc. |
|
Ok, here's a complete list of the changes, hopefully in sufficient detail:
|
| return | ||
|
|
||
| if stat.S_ISDIR(st.st_mode) and matcher.recurse_dir: | ||
| recurse_excluded_dir = True |
There was a problem hiding this comment.
naming: isn't this rather a state flag than a command flag?
would recursing_excluded_dir better describe what it tells?
There was a problem hiding this comment.
it doesn't matter much to me what it is named, but I chose recurse (as a command verb) because in my mind the recursing hasn't happened yet at that point, and will only happen later on, based on the value of the flag which is set here.
src/borg/helpers.py
Outdated
| elif pattern.ptype is PatternStyle: | ||
| fallback = pattern.pattern | ||
| ie_pattern = parse_inclexcl_command(patternstr, fallback=fallback) | ||
| if ie_pattern.cmd is IECommand.RootPath: |
There was a problem hiding this comment.
not sure: is "is" the right way to compare to a enum member or rather "=="?
There was a problem hiding this comment.
Either will work -- but enum values are singletons, so is works fine, and I believe it is the "right way" to compare them.
|
|
||
|
|
||
| def get_pattern_style(prefix): | ||
| class IECommand(Enum): |
There was a problem hiding this comment.
naming: how about "PatternType" or so?
IECommand sounds strange (not only because IE=Internet Explorer, or IEC = International Electrotechnical Commission, but primarily because it is not a command).
There was a problem hiding this comment.
Yes, I agree that IE is a little bit cryptic -- it could simply be InclExcl instead of IE, but I thought that made referencing the values kind of a lot to type. I don't like "PatternType" because these really are more general commands (in some cases, which indicate what should be done with a particular pattern). Also, it can lead to confusion about what is meant by "pattern type" -- elsewhere in the code this would mean, for example, whether it is a re or sh pattern, etc.
src/borg/helpers.py
Outdated
| cmd_prefix_map = { | ||
| '-': IECommand.Exclude, # False | ||
| '!': IECommand.ExcludeNoRecurse, | ||
| '+': IECommand.Include, # True |
There was a problem hiding this comment.
is the "True" (and "False" 2 lines above) worth keeping? You just got rid of it, right?
There was a problem hiding this comment.
Yes, that's old stuff that should be removed.
|
Ok, I've updated this PR with what I think should be the final changes (apart from squashing commits). All the tests pass on my machine, so can someone explain what is causing Travis to fail? |
fe0392c to
d945dd1
Compare
|
Revisions are now squashed and rebased. What remains to be done to get this merged? |
|
@edgimar |
|
@ThomasWaldmann The problems with the tests should now be fixed. I wanted to avoid using a default value for the recurse_dir arg, but it seems easier to just have one. |
This fixes the problem raised by issue borgbackup#2314 by requiring that each root subtree be fully traversed. The problem occurs when a patterns file excludes a parent directory P later in the file, but earlier in the file a subdirectory S of P is included. Because a tree is processed recursively with a depth-first search, P is processed before S is. Previously, if P was excluded, then S would not even be considered. Now, it is possible to recurse into P nonetheless, while not adding P (as a directory entry) to the archive. With this commit, a `-` in a patterns-file will allow an excluded directory to be searched for matching descendants. If the old behavior is desired, it can be achieved by using a `!` in place of the `-`. The following is a list of specific changes made by this commit: * renamed InclExclPattern named-tuple -> CmdTuple (with names 'val' and 'cmd'), since it is used more generally for commands, and not only for representing patterns. * represent commands as IECommand enum types (RootPath, PatternStyle, Include, Exclude, ExcludeNoRecurse) * archiver: Archiver.build_matcher() paths arg renamed -> include_paths to prevent confusion as to whether the list of paths are to be included or excluded. * helpers: PatternMatcher has recurse_dir attribute that is used to communicate whether an excluded dir should be recursed (used by Archiver._process()) * archiver: Archiver.build_matcher() now only returns a PatternMatcher instance, and not an include_patterns list -- this list is now created and housed within the PatternMatcher instance, and can be accessed from there. * moved operation of finding unmatched patterns from Archiver to PatternMatcher.get_unmatched_include_patterns() * added / modified some documentation of code * renamed _PATTERN_STYLES -> _PATTERN_CLASSES since "style" is ambiguous and this helps clarify that the set contains classes and not instances. * have PatternBase subclass instances store whether excluded dirs are to be recursed. Because PatternBase objs are created corresponding to each +, -, ! command it is necessary to differentiate - from ! within these objects. * add test for '!' exclusion rule (which doesn't recurse)
|
@ThomasWaldmann, @enkore - any further changes needed? |
| if not dry_run: | ||
| status = archive.process_dir(path, st) | ||
| if not recurse_excluded_dir: | ||
| status = archive.process_dir(path, st) |
There was a problem hiding this comment.
there might be a problem here:
if the code recurses into an excluded dir (searching for stuff that is included again in directories deeper below),
it does not archive the intermediate directories, so we won't have their metadata (like uid/gid/mode/...) in the archive.
at extract time, borg will automatically create missing intermediate dirs, but it can not set the correct metadata on them (not even when doing a full extract) as it does not have them in the archive.
There was a problem hiding this comment.
Hm, does it matter if the intermediate dirs are archived before or after the nested included path? It could be possible to add the intermediate dirs (at the time a nested included path is archived) to a list of paths to be archived later.
There was a problem hiding this comment.
permissions might be important to set before filling the dir.
timestamps must be set after filling the dir.
(check the code dealing with dirs elsewhere)
There was a problem hiding this comment.
guess the upper-layer dir entries should precede the lower-layer stuff in the archive.
There was a problem hiding this comment.
I don't claim to know lots about the details of borg's storage method, but I'm assuming that each backup operation stores file chunks in some manner that doesn't depend on order, but tracks the backup itself in the form of some manifest of the backed-up items. If this is the case, then it could be solved as simply as just ensuring that the manifest is sorted before storing it such that parent folders come before their children. This would work in conjunction with the idea of maintaining a list of parents to process after a matching nested folder is found. Does this sound right or doable?
There was a problem hiding this comment.
Perhaps just use a heap-queue for the manifest list, and always add item-names to it via heapq.heappush()?
enkore
left a comment
There was a problem hiding this comment.
Merge as-is. Create new issue for the directory inclusion discussed above.
|
Guess we need a similar change in 1.0-maint, right? |
|
@edgimar can you check if we want that in 1.0 also and do a backport to 1.0-maint? |
|
I don't think more work should be invested into backporting patterns stuff to 1.0-maint until the other loose ends (cf. activity in the other ticket) are tied up. It may be better to remove it from that release line anyway. |
|
-> #2463 |
This fixes the problem raised by issue #2314 by requiring that each root subtree
be fully traversed.
The problem occurs when a patterns file excludes a parent directory P later in
the file, and earlier in the file a subdirectory S of P is included. Because a
tree is processed recursively, P is processed before S is, but if P is
excluded, then S used to not even be considered, because we wouldn't recurse
into P. With this fix, we now recurse into P nonetheless, but don't add P (as
a directory entry) to the archive.