Skip to content

Conversation

@divine7022
Copy link
Collaborator

@divine7022 divine7022 commented Dec 7, 2025

Description

fixes #3692

  • updated run.write.configs to generate a runs_manifest.csv file containing run metadata (run_id, site_id, pft_name, trait, quantile, type) instead of appending to samples.Rdata.
  • samples.Rdata is now treated as a read-only "Master parameter definition" file.
  • updated write.sa.configs and write.ensemble.configs to return manifest dataframes.

fixes #3693

  • refactored read.sa.output and read.ensemble.output to pass the full vector of requested variables to read.output in a single call.
  • eliminated the inefficient inner loop that opened/closed netcdf files for every individual variable.

Additionally this PR separates input design generation for SA and ensemble runs, fixing dimension mismatch errors in multisite workflows.
previously both SA and ensemble used a single input_design matrix, causing failures when:

  • Ensemble needed N runs (e.g. 50)
  • SA needed 1 (median) + (traits * quantiles) runs (e.g. 145)

causes in dimension mismatch errors in input design coordination.

runModule.run.write.configs generates separate ens_input_design and sa_input_design matrices.
run.write.configs accepts both input_design_ens and input_design_sa parameters.
Old input_design parameter still works with deprecation warning.

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Comment on lines 15 to 17
#' @param input_design DEPRECATED. Use input_design_ens and input_design_sa instead
#' @param input_design_ens Input design matrix for ensemble analysis
#' @param input_design_sa Input design matrix for sensitivity analysis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of having this be three separate arguments. What if we kept input_design, expecting a list with entries ensemble and sensitivity and also accepting a single design object (interpreting it as being for the ensemble)?

(We discussed live that it'd be nice to encode SA and ensemble setup into the design such that run.write.configs doesn't even need to know which one it's working with, but that's still well in the future)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree with @infotroph that we shouldn't have multiple input_design arguments. Where I disagree is that I don't think it makes sense to have ensemble and sensitivity elements. I think we should move to a place where a single call to run.write.configs runs a single input_design. If you want to run both an ensemble projection and a SA then you should call run.write.configs twice with different input_designs (one for SA one for ensemble). In other words, I think the SA/ensemble distinction should be deprecated at this point w.r.t. run.write.configs. For example, a Sobol SA is already actually executed via the "ensemble" pathway.

invisible(settings)
return(settings)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please don't remove terminating newlines. These days the rule is mostly just to avoid needless extra changes, but there do exist some (mostly ancient 😅 ) tools that complain if it isn't there.


manifest.file <- file.path(settings$outdir, "runs_manifest.csv")

if (nrow(run_manifest_df) > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK for the manifest to be empty? If so I'd favor writing a zero-row file so it's easy to tell that afterwards; if not OK then should we throw an error here instead of skipping?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here the better approach would be to always write manifest;
distinguishes "workflow completed with no runs" from "workflow crashed before writing", and let the downstream consumers handles it.

Comment on lines 236 to 245
if (overwrite && file.exists(manifest.file)) {
unlink(manifest.file)
}

utils::write.table(run_manifest_df,
file = manifest.file,
sep = ",",
row.names = FALSE,
col.names = !file.exists(manifest.file),
append = file.exists(manifest.file))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.table removes existing files unless append=TRUE, so this unlink isn't needed

Suggested change
if (overwrite && file.exists(manifest.file)) {
unlink(manifest.file)
}
utils::write.table(run_manifest_df,
file = manifest.file,
sep = ",",
row.names = FALSE,
col.names = !file.exists(manifest.file),
append = file.exists(manifest.file))
utils::write.table(run_manifest_df,
file = manifest.file,
sep = ",",
row.names = FALSE,
col.names = !file.exists(manifest.file),
append = !overwrite)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, its defensive that adds complexity without benefit

Comment on lines 251 to 252
invisible(settings)
return(settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your fault but fixing this while I see it -- invisible() was doing nothing here and usually only makes sense as a return value

Suggested change
invisible(settings)
return(settings)
return(invisible(settings))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should know this, but what's actually been changed in the returned settings object. I'm wondering whether the settings is still the right thing to return at this point (e.g., are we moving to a place where the manifest might be the thing a user actually wants back?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed: at least ensemble IDs and PFT outdirs (if they weren't there in the input settings).

Return value: My gut is settings is still the right thing to return. I'm saying this without checking, but I think at the moment every runModule.*() takes a MultiSettings and returns a possibly modified MultiSettings, so changing that would be an even bigger overhaul. But maybe this is an argument for putting (a copy of) the manifest into the returned settings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe this is an argument for putting (a copy of) the manifest into the returned settings?

If you are saying (settings$runs_manifest <- run_manifest_df), then manifest written to pecan.CONFIGS.xml via listToXml(), which would serialize every row as nested XML elements. But for for large SDA runs would increase the disk size (may be in giga byte). Slow down read.settings() / write.settings()

Comment on lines 32 to 35
for (i in seq_len(nrow(ens.run.ids))) {

# Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id')
run.id <- if ("run_id" %in% names(ens.run.ids)) ens.run.ids$run_id[i] else ens.run.ids$id[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to iterate over the IDs directly instead of by index?

Suggested change
for (i in seq_len(nrow(ens.run.ids))) {
# Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id')
run.id <- if ("run_id" %in% names(ens.run.ids)) ens.run.ids$run_id[i] else ens.run.ids$id[i]
# Handle column name difference (manifest has 'run_id', legacy 'ens.run.ids' might have 'id')
run_ids <- ens.run.ids$run_id %||% ens.run.ids$id
for (run_id in run_ids) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index i is needed for output list indexing, but we could simplify by extracting run_ids once before the loop; and run_id mean the actual id's which seems confusing;
you suggested for (run_id in run_ids) but opted for the index-based approach because ensemble.output[[as.character(i)]] uses position based keys not run ids and seq_along() provides automatic increment without having to define another variable for manual idx <- idx + 1

if (nrow(subset_df) == 1) {
run.id <- subset_df$run_id
} else if (nrow(subset_df) > 1) {
PEcAn.logger::logger.warn("Multiple runs found for", trait, quantile, "- using the last one.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen and should it be an error instead of a warning?

next
}

pass_pft <- switch(per.pft + 1, NULL, pft.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly phrased -- I see this was existing code but I wonder if I'm missing a reason it wasn't pass_pft <- if(isTRUE(per.pft)) pft.name else NULL, which I'd find easier to read.

pass_pft <- switch(per.pft + 1, NULL, pft.name)

# Pass ALL variables at once to avoid repeated file opening. And call read.output once
out.tmp <- PEcAn.utils::read.output(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be for a different PR, but since this is getting converted to data frame a few lines down it might be worth benchmarking the existing approach against passing dataframe = TRUE into read.output

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have added TODO comment; defer to future PR if complex filtering needed;
because i suspect keeping dataframe = FALSE make sense, cuz the current aggregation doesn't benefit from data.frame structure. If we add more complex post-processing then we can then benchmark.

ens_input_design <- design_result$X
}

# handle SA design - check if already provided before generating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now turning into a fair amount of code repeated in both legs of the settings/multisettings if/else. Could we do some of this unconditionally outside the if? Or define a helper function to encapsulate it?

@github-actions github-actions bot added the Tests label Dec 31, 2025

# extract designs from input_design list
input_design_ens <- if (!is.null(input_design)) input_design$ensemble else NULL
input_design_sa <- if (!is.null(input_design)) input_design$sensitivity else NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per earlier comments, I'd recommend only having one input_design, not separate ens and sa designs

#' @param sa_input_design Input design matrix for SA (internal use)
#' @param input_design Optional. Input design specification. Can be:
#' \itemize{
#' \item A list with \code{ensemble} and/or \code{sensitivity} entries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' \item A list with \code{ensemble} and/or \code{sensitivity} entries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants