Skip to content

Conversation

@DongchenZ
Copy link
Contributor

In this PR, I added a new logical argument gen.samples in the generate_joint_ensemble_design function. The flag gen.samples will be determined within the SDA main workflow so that the samples file will be generated only if we don't pass the ensemble.samples and we don't have the samples.Rdata file in the desired directory.
Beyond this, I also updated the change log and document to include the new continental SDA features and settings.

Description

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@DongchenZ DongchenZ changed the title Smal bug fixes for the creation of samples file Small bug fixes and updates for the PR #3634. Dec 2, 2025

generate_joint_ensemble_design <- function(settings,
ensemble_size,
gen.samples = FALSE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • style suggestion: call the arg generate_samples instead of gen.samples
  • My instinct is that the default should be TRUE -- at least in the forward run, this is intended to be the single place samples are generated, so skipping it should need explicit user input
  • What should happen when generate_samples is TRUE but samples.Rdata already exists? I'd favor skipping it rather than overwriting.
Suggested change
gen.samples = FALSE,
generate_samples = TRUE,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# get the joint input design.
# here we are looping over sites
# to make sure we are grabbing the complete input lists.
for (i in seq_along(settings)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the logic here. First, why is this in a loop over sites? Second, you've added a lot of checks to make sure that there's a complete mapping between the input space and the samplingspace, but you're not returning any information to the user if that's not the case. Related, I'm not following the expectation that if the samplingspace and inputs don't match for the first site that we should just keep looping over sites -- why would the input space change by site, and if it did you'd still have the problem that the input_design would work for some sites but not others.

I'll also note that it's not obvious where the input_design is being saved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The for-loop over sites is because under some edge cases where some sites may not have the entire input lists, which are required in the generate_joint_ensemble_design function. For example, we should have parameters, ICs, met, and soil_physics as inputs, but some sites may not have soil_physics. Therefore, we will not have the complete inputs' index table if we use any of those sites to run the generate_joint_ensemble_design function.
  2. I don't think we need to return any information for sites that don't have any of the above inputs, except for the met inputs, for which the user will see the error in the SIPNET logfile. It's natural for some sites that are missing any of the inputs, cause the datasets that are used for creating those files are not available everywhere on Earth.
  3. The SDA will fail only if any site doesn't have the met files. Otherwise, the workflow should just work as expected. Even for some sites that don't have some inputs (soil_physics or ICs), the workflow should just pick the default value of SIPNET and not fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. If a site doesn't have multiple soil_physics files, but then the input_design tells ensemble member i to run with soil_physics file k, then the write.configs is going to crash when it tries to access a non-existent file path. I'm also not following why we'd be doing an ensemble run where some sites were missing inputs, that seems like something that should throw an error not something we should create a workaround for. The behavior of silently reverting to the default SIPNET values is a dangerous one -- those defaults are not reliable general-purpose values but the parameter values specific to Niwot Ridge, CO. And just because you've created that workaround in SIPNET doesn't mean the same workaround exists in the write.configs for other models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I just reverted all the changes to the SDA workflow. The error will be reported in the write.config function if there is any mismatch between the samplingspace and input lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean that the checks should be dropped -- I actually thought that the checks were a useful feature. It seems better to me to catch a mismatch here during the setup than after spawning 100 x 8000 runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I included code that checks for any mismatch between the sampling space and the site's input list. If there is any mismatch, the user will know which site is missing which input from the mis.match.table.

}
}
# if we generated new samples file within the `generate_joint_ensemble_design` function.
if (generate_samples) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the logic here either. It seems like you're counting on generate_joint_ensemble_design to generate a samples.Rdata. If that's the case, why would you this be conditional on generate_samples being TRUE? I'm not following why there would be a case where you would want to use the old ensemble.samples? And even if you did, why does the loading of that need to be determined conditionally before caling generate_joint_ensemble_design. There's a lot of unspoken/undocumented assumptions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed it before that we both think the get.parameter.samples should be executed within the generate_joint_ensemble_design function. And that's why I think we should be careful about whether we should regenerate the samples.Rdata, followed by some checks in this PR.
The logic of this change of code is as follows:

  1. Detect if we pass the ensemble.samples from outside (check if the ensemble.samples object is NULL or not).
  2. If we don't pass the ensemble.samples (ensemble.samples == NULL), we will need to detect if a samples.Rdata exists under the desired directory (settings$outdir).
  3. If we don't have the samples.Rdata in the desired directory, we will then set the generate_samples as TRUE, because we need to generate the samples.Rdata in the generate_joint_ensemble_design function.
  4. If we generate new samples.Rdata in the generate_joint_ensemble_design function, we will need to load the samples.Rdata file after running the generate_joint_ensemble_design function.

For the question I'm not following why there would be a case where you would want to use the old ensemble.samples?, I think the answer is I have to keep using the same ensemble.samples passed outside of the SDA workflow otherwise we will generate 40 different samples.Rdata files if I submit 40 jobs for the SDA.

Finally, for the question why does the loading of that need to be determined conditionally before calling generate_joint_ensemble_design, I have changed the code to make sure we only load the samples.Rdata once after running the generate_joint_ensemble_design function. And still, we need to check if we have already passed the ensemble.samples to determine if we want to load the new parameters instead of the existing object.

#'
#' @param settings A PEcAn settings object containing ensemble configuration
#' @param ensemble_size Integer specifying the number of ensemble members
#' @param generate_samples Logical: logical variable determine if we want to generate the samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear what this argument is intended to do as the whole point of the function is to generate input design samples -- why would this ever be false? What would that mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate_samples will be false if we don't need to generate the samples.Rdata file (e.g., we already have the samples.Rdata file or we pass the ensemble.samples outside of the SDA function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always set the generate_samples as TRUE, we will generate 40 different samples.Rdata files if we submit 40 separate SDA jobs to the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's also true of the input_design! The issue is that generate_joint_ensemble_design should only be called once, not that generate_joint_ensemble_design should be called 40 times with the same fixed parameter samples. I'm still not seeing the scenario when you'd need the proposed generate_samples argument to be false.

@dlebauer dlebauer requested a review from divine7022 December 3, 2025 04:27
Comment on lines +354 to +355
# find a site that has all registered inputs except for the parameter field.
if (all(names.sampler %in% names.site.input)) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty if, plus checks catch that names.site.input is out of scope here. Can this line be deleted?

Suggested change
# find a site that has all registered inputs except for the parameter field.
if (all(names.sampler %in% names.site.input)) {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants