-
Notifications
You must be signed in to change notification settings - Fork 282
Refactor workflow to stateless run manifest architecture and SA input design coordination #3708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…g in read.ensemble.output to fix I/O bottleneck
…g in read.sa.output to fix I/O bottleneck
base/workflow/R/run.write.configs.R
Outdated
| #' @param input_design DEPRECATED. Use input_design_ens and input_design_sa instead | ||
| #' @param input_design_ens Input design matrix for ensemble analysis | ||
| #' @param input_design_sa Input design matrix for sensitivity analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of having this be three separate arguments. What if we kept input_design, expecting a list with entries ensemble and sensitivity and also accepting a single design object (interpreting it as being for the ensemble)?
(We discussed live that it'd be nice to encode SA and ensemble setup into the design such that run.write.configs doesn't even need to know which one it's working with, but that's still well in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree with @infotroph that we shouldn't have multiple input_design arguments. Where I disagree is that I don't think it makes sense to have ensemble and sensitivity elements. I think we should move to a place where a single call to run.write.configs runs a single input_design. If you want to run both an ensemble projection and a SA then you should call run.write.configs twice with different input_designs (one for SA one for ensemble). In other words, I think the SA/ensemble distinction should be deprecated at this point w.r.t. run.write.configs. For example, a Sobol SA is already actually executed via the "ensemble" pathway.
| invisible(settings) | ||
| return(settings) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please don't remove terminating newlines. These days the rule is mostly just to avoid needless extra changes, but there do exist some (mostly ancient 😅 ) tools that complain if it isn't there.
base/workflow/R/run.write.configs.R
Outdated
|
|
||
| manifest.file <- file.path(settings$outdir, "runs_manifest.csv") | ||
|
|
||
| if (nrow(run_manifest_df) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK for the manifest to be empty? If so I'd favor writing a zero-row file so it's easy to tell that afterwards; if not OK then should we throw an error here instead of skipping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here the better approach would be to always write manifest;
distinguishes "workflow completed with no runs" from "workflow crashed before writing", and let the downstream consumers handles it.
base/workflow/R/run.write.configs.R
Outdated
| if (overwrite && file.exists(manifest.file)) { | ||
| unlink(manifest.file) | ||
| } | ||
|
|
||
| utils::write.table(run_manifest_df, | ||
| file = manifest.file, | ||
| sep = ",", | ||
| row.names = FALSE, | ||
| col.names = !file.exists(manifest.file), | ||
| append = file.exists(manifest.file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write.table removes existing files unless append=TRUE, so this unlink isn't needed
| if (overwrite && file.exists(manifest.file)) { | |
| unlink(manifest.file) | |
| } | |
| utils::write.table(run_manifest_df, | |
| file = manifest.file, | |
| sep = ",", | |
| row.names = FALSE, | |
| col.names = !file.exists(manifest.file), | |
| append = file.exists(manifest.file)) | |
| utils::write.table(run_manifest_df, | |
| file = manifest.file, | |
| sep = ",", | |
| row.names = FALSE, | |
| col.names = !file.exists(manifest.file), | |
| append = !overwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, its defensive that adds complexity without benefit
base/workflow/R/run.write.configs.R
Outdated
| invisible(settings) | ||
| return(settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your fault but fixing this while I see it -- invisible() was doing nothing here and usually only makes sense as a return value
| invisible(settings) | |
| return(settings) | |
| return(invisible(settings)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should know this, but what's actually been changed in the returned settings object. I'm wondering whether the settings is still the right thing to return at this point (e.g., are we moving to a place where the manifest might be the thing a user actually wants back?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed: at least ensemble IDs and PFT outdirs (if they weren't there in the input settings).
Return value: My gut is settings is still the right thing to return. I'm saying this without checking, but I think at the moment every runModule.*() takes a MultiSettings and returns a possibly modified MultiSettings, so changing that would be an even bigger overhaul. But maybe this is an argument for putting (a copy of) the manifest into the returned settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe this is an argument for putting (a copy of) the manifest into the returned settings?
If you are saying (settings$runs_manifest <- run_manifest_df), then manifest written to pecan.CONFIGS.xml via listToXml(), which would serialize every row as nested XML elements. But for for large SDA runs would increase the disk size (may be in giga byte). Slow down read.settings() / write.settings()
modules/uncertainty/R/ensemble.R
Outdated
| for (i in seq_len(nrow(ens.run.ids))) { | ||
|
|
||
| # Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id') | ||
| run.id <- if ("run_id" %in% names(ens.run.ids)) ens.run.ids$run_id[i] else ens.run.ids$id[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work to iterate over the IDs directly instead of by index?
| for (i in seq_len(nrow(ens.run.ids))) { | |
| # Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id') | |
| run.id <- if ("run_id" %in% names(ens.run.ids)) ens.run.ids$run_id[i] else ens.run.ids$id[i] | |
| # Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id') | |
| run_ids <- ens.run.ids$run_id %||% ens.run.ids$id | |
| for (run_id in run_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index i is needed for output list indexing, but we could simplify by extracting run_ids once before the loop; and run_id mean the actual id's which seems confusing;
you suggested for (run_id in run_ids) but opted for the index-based approach because ensemble.output[[as.character(i)]] uses position based keys not run ids and seq_along() provides automatic increment without having to define another variable for manual idx <- idx + 1
| if (nrow(subset_df) == 1) { | ||
| run.id <- subset_df$run_id | ||
| } else if (nrow(subset_df) > 1) { | ||
| PEcAn.logger::logger.warn("Multiple runs found for", trait, quantile, "- using the last one.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen and should it be an error instead of a warning?
modules/uncertainty/R/sensitivity.R
Outdated
| next | ||
| } | ||
|
|
||
| pass_pft <- switch(per.pft + 1, NULL, pft.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly phrased -- I see this was existing code but I wonder if I'm missing a reason it wasn't pass_pft <- if(isTRUE(per.pft)) pft.name else NULL, which I'd find easier to read.
| pass_pft <- switch(per.pft + 1, NULL, pft.name) | ||
|
|
||
| # Pass ALL variables at once to avoid repeated file opening. And call read.output once | ||
| out.tmp <- PEcAn.utils::read.output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be for a different PR, but since this is getting converted to data frame a few lines down it might be worth benchmarking the existing approach against passing dataframe = TRUE into read.output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have added TODO comment; defer to future PR if complex filtering needed;
because i suspect keeping dataframe = FALSE make sense, cuz the current aggregation doesn't benefit from data.frame structure. If we add more complex post-processing then we can then benchmark.
| ens_input_design <- design_result$X | ||
| } | ||
|
|
||
| # handle SA design - check if already provided before generating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now turning into a fair amount of code repeated in both legs of the settings/multisettings if/else. Could we do some of this unconditionally outside the if? Or define a helper function to encapsulate it?
…nto helper function
…rocessing filtering needed
|
|
||
| # extract designs from input_design list | ||
| input_design_ens <- if (!is.null(input_design)) input_design$ensemble else NULL | ||
| input_design_sa <- if (!is.null(input_design)) input_design$sensitivity else NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per earlier comments, I'd recommend only having one input_design, not separate ens and sa designs
| #' @param sa_input_design Input design matrix for SA (internal use) | ||
| #' @param input_design Optional. Input design specification. Can be: | ||
| #' \itemize{ | ||
| #' \item A list with \code{ensemble} and/or \code{sensitivity} entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' \item A list with \code{ensemble} and/or \code{sensitivity} entries |
Description
fixes #3692
run.write.configsto generate aruns_manifest.csvfile containing run metadata (run_id, site_id, pft_name, trait, quantile, type) instead of appending to samples.Rdata.fixes #3693
read.sa.outputandread.ensemble.outputto pass the full vector of requested variables to read.output in a single call.Additionally this PR separates input design generation for SA and ensemble runs, fixing dimension mismatch errors in multisite workflows.
previously both SA and ensemble used a single
input_designmatrix, causing failures when:causes in dimension mismatch errors in input design coordination.
runModule.run.write.configsgenerates separateens_input_designandsa_input_designmatrices.run.write.configsaccepts bothinput_design_ensandinput_design_saparameters.Old
input_designparameter still works with deprecation warning.Motivation and Context
Review Time Estimate
Types of changes
Checklist: