Skip to content

allow excluding parent and including child, fixes #2314#2322

Merged
ThomasWaldmann merged 1 commit intoborgbackup:masterfrom
edgimar:master
Apr 23, 2017
Merged

allow excluding parent and including child, fixes #2314#2322
ThomasWaldmann merged 1 commit intoborgbackup:masterfrom
edgimar:master

Conversation

@edgimar
Copy link
Copy Markdown
Contributor

@edgimar edgimar commented Mar 21, 2017

This fixes the problem raised by issue #2314 by requiring that each root subtree
be fully traversed.

The problem occurs when a patterns file excludes a parent directory P later in
the file, and earlier in the file a subdirectory S of P is included. Because a
tree is processed recursively, P is processed before S is, but if P is
excluded, then S used to not even be considered, because we wouldn't recurse
into P. With this fix, we now recurse into P nonetheless, but don't add P (as
a directory entry) to the archive.

Copy link
Copy Markdown
Contributor

@enkore enkore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should first discuss the approach. With this change Borg would recurse into all excluded directories, which can take significant time, even if no files would be added from these.

@ThomasWaldmann
Copy link
Copy Markdown
Member

@enkore yup, that's bad. see my comment in the ticket.

@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Mar 26, 2017

Any conclusions on how to move forward with this? Is TW's suggestion OK, or should something else be done? His suggestion, from #2314 is:

+ include and recurse into
- exclude, but recurse into, searching for includes
! exclude, do not recurse into

@enkore
Copy link
Copy Markdown
Contributor

enkore commented Mar 26, 2017

See my comments in #2314, i.e. feel free to go ahead implementing it.

Which character is which can be easily changed later, if need be.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Mar 31, 2017

Codecov Report

Merging #2322 into master will increase coverage by 0.09%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2322      +/-   ##
=========================================
+ Coverage   82.91%     83%   +0.09%     
=========================================
  Files          20      20              
  Lines        7556    7575      +19     
  Branches     1288    1288              
=========================================
+ Hits         6265    6288      +23     
+ Misses        930     928       -2     
+ Partials      361     359       -2
Impacted Files Coverage Δ
src/borg/archive.py 81.73% <100%> (+0.15%) ⬆️
src/borg/archiver.py 82.24% <91.3%> (+0.17%) ⬆️
src/borg/helpers.py 87.86% <92.75%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ae4bf0...798127f. Read the comment docs.

@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Mar 31, 2017

Ok, big update -- lots of refactoring, and the "!" feature is now implemented. Sorry for so many changes, but I'm hoping you'll see their beauty ;)

Here's a summary of the important changes:

  • Use an enum to represent possible commands in --patterns-from files
  • rename various variables to reflect their true meaning / remove ambiguity about their datatype
  • refactor some stuff from archiver.py to PatternMatcher.add_includepaths()
  • probably other wonderful things I'm forgetting (e.g. the new functionality, the "!" command, though there aren't any other new features)

(still need to add tests, but that should be trivial -- for now, I'd like it if the code could be reviewed)

@ThomasWaldmann
Copy link
Copy Markdown
Member

ThomasWaldmann commented Apr 1, 2017

@edgimar could you please use rebase onto master and not merge master? also, rephrase some commit comments so they are more meaningful than "refactoring" (e.g. state what code you refactored, what the goal was, ...).

Maybe create a backup, update your local master from the main repo, switch back to your branch again and then do git rebase -i master.

Also: are the commit comments starting with "-" intended to be kept that way?

@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Apr 1, 2017

Sorry, I do intend to clean up the repo (rebasing, collapsing commits, good commit messages), but at this point it's not probably worth looking at prior commits since most of them are just taken as "snapshots" of the work in progress and not based on atomic sets of changes.

The best is probably to just look at the overall diff between the latest commit and borgbackup:master. If any further changes are needed to the code, let me know and I'll try to update it. After everything looks OK, I'll deal with rebasing / etc.

@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Apr 1, 2017

Ok, here's a complete list of the changes, hopefully in sufficient detail:

  • renamed InclExclPattern named-tuple -> CmdTuple (with names 'val' and 'cmd'), since it is used more generally for commands, and not only for representing patterns.
  • represent commands as IECommand enum types (RootPath, PatternStyle, Include, Exclude, ExcludeNoRecurse)
  • archiver: Archiver.build_matcher() paths arg renamed -> include_paths to prevent confusion as to whether the list of paths are to be included or excluded.
  • helpers: PatternMatcher has recurse_dir attribute that is used to communicate whether an excluded dir should be recursed (used by Archiver._process())
  • archiver: Archiver.build_matcher() now only returns a PatternMatcher instance, and not an include_patterns list -- this list is now created and housed within the PatternMatcher instance, and can be accessed from there.
  • moved operation of finding unmatched patterns from Archiver to PatternMatcher.get_unmatched_include_patterns()
  • added / modified some documentation of code
  • renamed helpers:parse_add_pattern() patterns arg -> ie_patterns to clarify that pattern type is not a subclass of PatternBase, and is rather a CmdTuple -- in fact, I should rename this further to ie_cmds (and other vars in this function similarly)...
  • renamed _PATTERN_STYLES -> _PATTERN_CLASSES since "style" is ambiguous and this helps clarify that the set contains classes and not instances.
  • have PatternBase subclass instances store whether excluded dirs are to be recursed. Because PatternBase objs are created corresponding to each +, -, ! command it is necessary to differentiate - from ! within these objects.

return

if stat.S_ISDIR(st.st_mode) and matcher.recurse_dir:
recurse_excluded_dir = True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: isn't this rather a state flag than a command flag?

would recursing_excluded_dir better describe what it tells?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't matter much to me what it is named, but I chose recurse (as a command verb) because in my mind the recursing hasn't happened yet at that point, and will only happen later on, based on the value of the flag which is set here.

elif pattern.ptype is PatternStyle:
fallback = pattern.pattern
ie_pattern = parse_inclexcl_command(patternstr, fallback=fallback)
if ie_pattern.cmd is IECommand.RootPath:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure: is "is" the right way to compare to a enum member or rather "=="?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either will work -- but enum values are singletons, so is works fine, and I believe it is the "right way" to compare them.



def get_pattern_style(prefix):
class IECommand(Enum):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: how about "PatternType" or so?

IECommand sounds strange (not only because IE=Internet Explorer, or IEC = International Electrotechnical Commission, but primarily because it is not a command).

Copy link
Copy Markdown
Contributor Author

@edgimar edgimar Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that IE is a little bit cryptic -- it could simply be InclExcl instead of IE, but I thought that made referencing the values kind of a lot to type. I don't like "PatternType" because these really are more general commands (in some cases, which indicate what should be done with a particular pattern). Also, it can lead to confusion about what is meant by "pattern type" -- elsewhere in the code this would mean, for example, whether it is a re or sh pattern, etc.

cmd_prefix_map = {
'-': IECommand.Exclude, # False
'!': IECommand.ExcludeNoRecurse,
'+': IECommand.Include, # True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the "True" (and "False" 2 lines above) worth keeping? You just got rid of it, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's old stuff that should be removed.

@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Apr 8, 2017

Ok, I've updated this PR with what I think should be the final changes (apart from squashing commits). All the tests pass on my machine, so can someone explain what is causing Travis to fail?

@edgimar edgimar force-pushed the master branch 2 times, most recently from fe0392c to d945dd1 Compare April 11, 2017 02:20
@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Apr 11, 2017

Revisions are now squashed and rebased. What remains to be done to get this merged?

@ThomasWaldmann
Copy link
Copy Markdown
Member

ThomasWaldmann commented Apr 11, 2017

@edgimar TypeError: __init__() missing 1 required positional argument: 'recurse_dir'

https://travis-ci.org/borgbackup/borg/jobs/220795799

@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Apr 12, 2017

@ThomasWaldmann The problems with the tests should now be fixed. I wanted to avoid using a default value for the recurse_dir arg, but it seems easier to just have one.

This fixes the problem raised by issue borgbackup#2314 by requiring that each root
subtree be fully traversed.

The problem occurs when a patterns file excludes a parent directory P later
in the file, but earlier in the file a subdirectory S of P is included.
Because a tree is processed recursively with a depth-first search, P is
processed before S is.  Previously, if P was excluded, then S would not even
be considered.  Now, it is possible to recurse into P nonetheless, while not
adding P (as a directory entry) to the archive.

With this commit, a `-` in a patterns-file will allow an excluded directory
to be searched for matching descendants.  If the old behavior is desired, it
can be achieved by using a `!` in place of the `-`.

The following is a list of specific changes made by this commit:

 * renamed InclExclPattern named-tuple -> CmdTuple (with names 'val' and 'cmd'), since it is used more generally for commands, and not only for representing patterns.
 * represent commands as IECommand enum types (RootPath, PatternStyle, Include, Exclude, ExcludeNoRecurse)
 * archiver: Archiver.build_matcher() paths arg renamed -> include_paths to prevent confusion as to whether the list of paths are to be included or excluded.
 * helpers: PatternMatcher has recurse_dir attribute that is used to communicate whether an excluded dir should be recursed (used by Archiver._process())
 * archiver: Archiver.build_matcher() now only returns a PatternMatcher instance, and not an include_patterns list -- this list is now created and housed within the PatternMatcher instance, and can be accessed from there.
 * moved operation of finding unmatched patterns from Archiver to PatternMatcher.get_unmatched_include_patterns()
 * added / modified some documentation of code
 * renamed _PATTERN_STYLES -> _PATTERN_CLASSES since "style" is ambiguous and this helps clarify that the set contains classes and not instances.
 * have PatternBase subclass instances store whether excluded dirs are to be recursed.  Because PatternBase objs are created corresponding to each +, -, ! command it is necessary to differentiate - from ! within these objects.
 * add test for '!' exclusion rule (which doesn't recurse)
@edgimar
Copy link
Copy Markdown
Contributor Author

edgimar commented Apr 17, 2017

@ThomasWaldmann, @enkore - any further changes needed?

if not dry_run:
status = archive.process_dir(path, st)
if not recurse_excluded_dir:
status = archive.process_dir(path, st)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there might be a problem here:

if the code recurses into an excluded dir (searching for stuff that is included again in directories deeper below),
it does not archive the intermediate directories, so we won't have their metadata (like uid/gid/mode/...) in the archive.

at extract time, borg will automatically create missing intermediate dirs, but it can not set the correct metadata on them (not even when doing a full extract) as it does not have them in the archive.

Copy link
Copy Markdown
Contributor Author

@edgimar edgimar Apr 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does it matter if the intermediate dirs are archived before or after the nested included path? It could be possible to add the intermediate dirs (at the time a nested included path is archived) to a list of paths to be archived later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permissions might be important to set before filling the dir.
timestamps must be set after filling the dir.

(check the code dealing with dirs elsewhere)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess the upper-layer dir entries should precede the lower-layer stuff in the archive.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't claim to know lots about the details of borg's storage method, but I'm assuming that each backup operation stores file chunks in some manner that doesn't depend on order, but tracks the backup itself in the form of some manifest of the backed-up items. If this is the case, then it could be solved as simply as just ensuring that the manifest is sorted before storing it such that parent folders come before their children. This would work in conjunction with the idea of maintaining a list of parents to process after a matching nested folder is found. Does this sound right or doable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just use a heap-queue for the manifest list, and always add item-names to it via heapq.heappush()?

Copy link
Copy Markdown
Contributor

@enkore enkore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge as-is. Create new issue for the directory inclusion discussed above.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.0b5, 1.0.11rc1 Apr 23, 2017
@ThomasWaldmann ThomasWaldmann merged commit 6f47b79 into borgbackup:master Apr 23, 2017
@ThomasWaldmann
Copy link
Copy Markdown
Member

Guess we need a similar change in 1.0-maint, right?

@ThomasWaldmann
Copy link
Copy Markdown
Member

@edgimar can you check if we want that in 1.0 also and do a backport to 1.0-maint?

@enkore
Copy link
Copy Markdown
Contributor

enkore commented Apr 28, 2017

I don't think more work should be invested into backporting patterns stuff to 1.0-maint until the other loose ends (cf. activity in the other ticket) are tied up.

It may be better to remove it from that release line anyway.

@ThomasWaldmann
Copy link
Copy Markdown
Member

-> #2463

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants