-
Notifications
You must be signed in to change notification settings - Fork 20
build: Add --exclude-from-upload and --exclude-from-download options
#370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
13064be to
8bc1632
Compare
|
Ah, this has a bug around absolute vs. relative path handling and pattern specification. I'll address it and repush. Otherwise, though, still ready for 👀 (just not testing). |
Excluding files from upload is very handy for ad-hoc skipping of large ancillary files or previous output files in the build directory that the user wants to ignore for the remote build on AWS Batch (i.e. to start it "fresh"). Existing workarounds for the lack of an exclusion mechanism include git worktrees to obtain a clean state and moving files or directories temporarily out of the build directory. A future improvement would be adding support to also specify these patterns via a file, which can then be checked in to a build repo and shared by all users. As a broader improvement, I'd like to design away this issue by separating out the workflow-as-program (rules, code, etc.) from the workflow-as-state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this "excluded files" need too. For example, instead of relying on the implicit state combined with the code like now in a single build directory, we'd instead explicitly pass an empty starting state separate from the workflow program. This change includes a small bug fix for --download to allow _only_ negated patterns. Resolves: <#219>
8bc1632 to
6d465f0
Compare
|
Bug squashed in repush. |
|
@lmoncla @jameshadfield @trvrb If you want to test this out before merge/release to make sure it works for you, you can temporarily install the CLI built from this PR with: Ignore the shell setup instructions printed at the end. To test, run |
jameshadfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good by inspection. Nice to see the implementation wasn't too far from how I thought.
| wildcards (``**``), extended globbing (``@(…)``, ``+(…)``, etc.), | ||
| and negation (``!…``). | ||
|
|
||
| Patterns should be relative to the build directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially tangential to this PR
Reading this code my understanding is that only the build directory is uploaded, and so patterns are relative to that. How does this work when the build directory is (e.g.) ingest/ but within that the Snakefile refers back to a parent directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a great point. The directory here that matters—the one that's uploaded and one that patterns must be relative to—is influenced by the nextstrain-pathogen.yaml file. For example, running nextstrain build ingest/ in a directory with nextstrain-pathogen.yaml will require patterns be relative to the directory containing the marker file, not the ingest/ directory, i.e. to exclude something in ingest/ would require the pattern to be ingest/x/y/z not just x/y/z.
I'll update the docs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I want to leave the option docs/help as-is for now for this PR. I'm realizing there's larger documentation reworking to do to clarify terminology here as a result of #355 (something I'd postponed in that PR). Writing a clear help snippet for these pattern options will be much easier and less inconsistent/confusing once the larger terminology shift and doc updates happen. I should prioritize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 totally forgot about this subtlety. Is this behaviour mirrored in what directories are bind-mounted into the docker container (when running locally)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
Before #355 the directory given to nextstrain build ended up as /nextstrain/build in the container and /nextstrain/build was the fixed working directory for the container process.
After #355, the /nextstrain/build mount is either the directory given to nextstrain build or, if present, the closest ancestor of the given directory which contains a nextstrain-pathogen.yaml. The working directory for the container process is the /nextstrain/build path corresponding to the directory given to nextstrain build.
|
This feature will be in the being-released-as-we-speak 8.4.0. |
Excluding files from upload is very handy for ad-hoc skipping of large ancillary files or previous output files in the build directory that the user wants to ignore for the remote build on AWS Batch (i.e. to start it "fresh"). Existing workarounds for the lack of an exclusion mechanism include git worktrees to obtain a clean state and moving files or directories temporarily out of the build directory.
A future improvement would be adding support to also specify these patterns via a file, which can then be checked in to a build repo and shared by all users.
As a broader improvement, I'd like to design away this issue by separating out the workflow-as-program (rules, code, etc.) from the workflow-as-state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this "excluded files" need too. For example, instead of relying on the implicit state combined with the code like now in a single build directory, we'd instead explicitly pass an empty starting state separate from the workflow program.
This change includes a small bug fix for --download to allow only negated patterns.
Resolves: #219
Checklist