-
Notifications
You must be signed in to change notification settings - Fork 282
Description
Currently run.write.configs acts as a "stateful" logger, appending run information to samples.Rdata. As we scale to hundreds or thousands of sites, this approach becomes inefficient and tightly couples the configuration step with the analysis step.
This issue proposes refactoring the downstream analysis functions ( e.g. read.ensemble.output, read.sa.output , etc) to adopt a stateless design that reads from a manifest, removing the dependency on runtime mutation of samples.Rdata.
run.write.configs generates run ids (e.g. ENS-0001-siteID) and physically saves them into the runs.samples list within samples.Rdata.
The samples.Rdata file grows linearly with the number of sites. The analysis modules depend on this file being constantly updated/mutated by the write step.
Proposed workaround :
Refactor the workflow to decouple "parameter definition" from "execution logging" by introducing a lightweight, Manifest file.
-
samples.Rdata becomes static treat samples.Rdata strictly as a "Master parameter definition" file (generated upstream by
get.parameter.samples). It should be immutable during the run.write.configs step. -
Introduce runs_manifest.csv instead of modifying the RData file, run.write.configs will generate a structured CSV file in the output directory. This file explicitly maps run ids to their design parameters.
proposed structure (runs_manifest.csv):
| run_id | site_id | pft_name | trait | quantile | type |
|---|---|---|---|---|---|
| ENS-0001-siteA | siteA | NA | NA | NA | Ensemble |
| SA-median-siteA | siteA | grass | NA | 0.5 | Sensitivity |
| SA-pft_name-T1-Q1-siteA | siteA | grass | SLA | 0.158 | Sensitivity |
- update downstream analysis functions (read.ensemble.output, read.sa.output, etc..) to read this CSV manifest.
logic: look up therun_idwheresite_id == Xandtrait == Y