Small bug fixes and updates for the PR #3634. #3690

DongchenZ · 2025-12-02T01:22:48Z

In this PR, I added a new logical argument gen.samples in the generate_joint_ensemble_design function. The flag gen.samples will be determined within the SDA main workflow so that the samples file will be generated only if we don't pass the ensemble.samples and we don't have the samples.Rdata file in the desired directory.
Beyond this, I also updated the change log and document to include the new continental SDA features and settings.

Description

Motivation and Context

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…A_data

infotroph · 2025-12-02T04:15:05Z

modules/uncertainty/R/generate_joint_ensemble_design.R


 generate_joint_ensemble_design <- function(settings,
                                           ensemble_size,
+                                           gen.samples = FALSE,


style suggestion: call the arg generate_samples instead of gen.samples

My instinct is that the default should be TRUE -- at least in the forward run, this is intended to be the single place samples are generated, so skipping it should need explicit user input

What should happen when generate_samples is TRUE but samples.Rdata already exists? I'd favor skipping it rather than overwriting.

Suggested change

gen.samples = FALSE,

generate_samples = TRUE,

…SDA_data

mdietze · 2025-12-02T17:07:10Z

modules/assim.sequential/R/sda.enkf_MultiSite.R

+    # get the joint input design.
+    # here we are looping over sites
+    # to make sure we are grabbing the complete input lists.
+    for (i in seq_along(settings)) {


I'm not following the logic here. First, why is this in a loop over sites? Second, you've added a lot of checks to make sure that there's a complete mapping between the input space and the samplingspace, but you're not returning any information to the user if that's not the case. Related, I'm not following the expectation that if the samplingspace and inputs don't match for the first site that we should just keep looping over sites -- why would the input space change by site, and if it did you'd still have the problem that the input_design would work for some sites but not others.

I'll also note that it's not obvious where the input_design is being saved

The for-loop over sites is because under some edge cases where some sites may not have the entire input lists, which are required in the generate_joint_ensemble_design function. For example, we should have parameters, ICs, met, and soil_physics as inputs, but some sites may not have soil_physics. Therefore, we will not have the complete inputs' index table if we use any of those sites to run the generate_joint_ensemble_design function.

I don't think we need to return any information for sites that don't have any of the above inputs, except for the met inputs, for which the user will see the error in the SIPNET logfile. It's natural for some sites that are missing any of the inputs, cause the datasets that are used for creating those files are not available everywhere on Earth.

The SDA will fail only if any site doesn't have the met files. Otherwise, the workflow should just work as expected. Even for some sites that don't have some inputs (soil_physics or ICs), the workflow should just pick the default value of SIPNET and not fail.

I don't agree. If a site doesn't have multiple soil_physics files, but then the input_design tells ensemble member i to run with soil_physics file k, then the write.configs is going to crash when it tries to access a non-existent file path. I'm also not following why we'd be doing an ensemble run where some sites were missing inputs, that seems like something that should throw an error not something we should create a workaround for. The behavior of silently reverting to the default SIPNET values is a dangerous one -- those defaults are not reliable general-purpose values but the parameter values specific to Niwot Ridge, CO. And just because you've created that workaround in SIPNET doesn't mean the same workaround exists in the write.configs for other models.

Ok. I just reverted all the changes to the SDA workflow. The error will be reported in the write.config function if there is any mismatch between the samplingspace and input lists.

I didn't mean that the checks should be dropped -- I actually thought that the checks were a useful feature. It seems better to me to catch a mismatch here during the setup than after spawning 100 x 8000 runs.

Fixed. I included code that checks for any mismatch between the sampling space and the site's input list. If there is any mismatch, the user will know which site is missing which input from the mis.match.table.

mdietze · 2025-12-02T17:11:33Z

modules/assim.sequential/R/sda.enkf_MultiSite.R

+      }
+    }
+    # if we generated new samples file within the `generate_joint_ensemble_design` function.
+    if (generate_samples) {


I'm not following the logic here either. It seems like you're counting on generate_joint_ensemble_design to generate a samples.Rdata. If that's the case, why would you this be conditional on generate_samples being TRUE? I'm not following why there would be a case where you would want to use the old ensemble.samples? And even if you did, why does the loading of that need to be determined conditionally before caling generate_joint_ensemble_design. There's a lot of unspoken/undocumented assumptions here.

I think we discussed it before that we both think the get.parameter.samples should be executed within the generate_joint_ensemble_design function. And that's why I think we should be careful about whether we should regenerate the samples.Rdata, followed by some checks in this PR.
The logic of this change of code is as follows:

Detect if we pass the ensemble.samples from outside (check if the ensemble.samples object is NULL or not).

If we don't pass the ensemble.samples (ensemble.samples == NULL), we will need to detect if a samples.Rdata exists under the desired directory (settings$outdir).

If we don't have the samples.Rdata in the desired directory, we will then set the generate_samples as TRUE, because we need to generate the samples.Rdata in the generate_joint_ensemble_design function.

If we generate new samples.Rdata in the generate_joint_ensemble_design function, we will need to load the samples.Rdata file after running the generate_joint_ensemble_design function.

For the question I'm not following why there would be a case where you would want to use the old ensemble.samples?, I think the answer is I have to keep using the same ensemble.samples passed outside of the SDA workflow otherwise we will generate 40 different samples.Rdata files if I submit 40 jobs for the SDA.

Finally, for the question why does the loading of that need to be determined conditionally before calling generate_joint_ensemble_design, I have changed the code to make sure we only load the samples.Rdata once after running the generate_joint_ensemble_design function. And still, we need to check if we have already passed the ensemble.samples to determine if we want to load the new parameters instead of the existing object.

mdietze · 2025-12-02T17:14:12Z

modules/uncertainty/R/generate_joint_ensemble_design.R

 #'
 #' @param settings A PEcAn settings object containing ensemble configuration
 #' @param ensemble_size Integer specifying the number of ensemble members
+#' @param generate_samples Logical: logical variable determine if we want to generate the samples.


it's unclear what this argument is intended to do as the whole point of the function is to generate input design samples -- why would this ever be false? What would that mean?

The generate_samples will be false if we don't need to generate the samples.Rdata file (e.g., we already have the samples.Rdata file or we pass the ensemble.samples outside of the SDA function).

If we always set the generate_samples as TRUE, we will generate 40 different samples.Rdata files if we submit 40 separate SDA jobs to the cluster.

But that's also true of the input_design! The issue is that generate_joint_ensemble_design should only be called once, not that generate_joint_ensemble_design should be called 40 times with the same fixed parameter samples. I'm still not seeing the scenario when you'd need the proposed generate_samples argument to be false.

infotroph · 2025-12-12T09:22:25Z

modules/assim.sequential/R/sda.enkf_MultiSite.R

+    # find a site that has all registered inputs except for the parameter field.
+    if (all(names.sampler %in% names.site.input)) {}


Empty if, plus checks catch that names.site.input is out of scope here. Can this line be deleted?

Suggested change

# find a site that has all registered inputs except for the parameter field.

if (all(names.sampler %in% names.site.input)) {}

Dongchen Zhang added 5 commits December 1, 2025 15:31

Update computation config.

c1f6aa7

Merge branch 'SDA_data' of https://github.com/DongchenZ/pecan into SD…

655c756

…A_data

Add a flag to decide whether to generate the samples file.

1eeaeff

Update the change log for the previous PR (PecanProject#3634).

6f92baa

Update the document for the continental SDA settings.

19e8ef4

github-actions bot added Modules Documentation labels Dec 2, 2025

DongchenZ changed the title ~~Smal bug fixes for the creation of samples file~~ Small bug fixes and updates for the PR #3634. Dec 2, 2025

infotroph reviewed Dec 2, 2025

View reviewed changes

Dongchen Zhang added 2 commits December 2, 2025 10:10

Switch argument name.

0a06b57

Merge branch 'develop' of https://github.com/PecanProject/pecan into …

3a964cf

…SDA_data

mdietze requested changes Dec 2, 2025

View reviewed changes

Dongchen Zhang added 4 commits December 2, 2025 15:20

Change the logic of determining whether to load the samples.Rdata file.

cd81287

Revert changes.

3d4af80

Revert change.

06b78e0

Add the input check.

90da095

dlebauer requested a review from divine7022 December 3, 2025 04:27

mdietze approved these changes Dec 3, 2025

View reviewed changes

infotroph reviewed Dec 12, 2025

View reviewed changes

		# find a site that has all registered inputs except for the parameter field.
		if (all(names.sampler %in% names.site.input)) {}

Small bug fixes and updates for the PR #3634. #3690

Are you sure you want to change the base?

Small bug fixes and updates for the PR #3634. #3690

Conversation

DongchenZ commented Dec 2, 2025

Description

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants