Skip to content

Conversation

@divine7022
Copy link
Collaborator

Description

Refactored extract_soil_gssurgo() to replace point-based WFS queries with raster-based Web Coverage Service (WCS) approach, enabling accurate area-weighted sampling and eliminating spatial coverage gaps.

  • Implemented bounding box raster query using soilDB::mukey.wcs() to extract map unit keys with complete spatial coverage at 30m resolution. This provides accurate area weighting through pixel counts and eliminates spatial coverage gaps.
  • Utilized soilDB::get_SDA_property() with "Weighted Average" aggregation method. This retrieves soil properties (sand, silt, clay, organic matter, bulk density) integrated across specified depth ranges with component weighting in a single batch query.
  • Integrated soilDB::fetchSDA() to obtain complete rock fragment data (fragvol_r) representing total volume across all size classes: 2-75mm (pebbles), 75-250mm (cobbles), 250-600mm (stones), and >600mm (boulders). Applied proper depth and component weighting for fragments:

Motivation and Context

Fixes #3609

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@infotroph infotroph changed the title Fix spatial sampling and improve data aggregation accuracy gSSURGO: Fix spatial sampling and improve data aggregation accuracy Oct 10, 2025
Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements the features specified in #3609. Please check comments and suggestions, then will be ready to merge, thanks!

# all size classes: 2-75mm (pebbles), 75-250mm (cobbles), 250-600mm (stones), and >600mm (boulders).
# plus component weighting needed for aggregation
sda_data <- tryCatch({
soilDB::fetchSDA(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every call to fetchSDA returns all horizons from all components in the requested map units, so this does not need to be inside the loop over depths.

Also, in my local testing it seems all(c("sandtotal_r", "silttotal_r", "claytotal_r", "om_r", "dbthirdbar_r") %in% names(sda_data)) is TRUE, so it may make sense to skip the get_SDA_property call entirely and work just from this result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented both suggestions in the recent commit, now fetchSDA retrieves all horizons once, then filters by depth in the loop - much more efficient for complex query

soc_sd <- stats::sd(soc_mean, na.rm = TRUE)
n_depths <- length(soc_mean)

if (n_depths == 1 || is.na(soc_sd) || soc_sd == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why NA when n_depths == 1? I would have expected that 1 layer still means one datapoint to be had

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was visioning back then ,if we have one record it's oboes that the sd become NA.
But good catch!, returning NA for a single observation discards valid data unnecessarily
i have update code to :

if (is.na(soc_sd) || soc_sd == 0) {
  # No variability - use measured value (all ensemble members identical)
  simulated_soc <- rep(soc_mean, size)
} else {
  # Has variability - sample from gamma distribution
  shape <- (soc_mean^2) / (soc_sd^2)
  rate <- soc_mean / (soc_sd^2)
  simulated_soc <- stats::rgamma(size, shape = shape, rate = rate)
}

single component (n=1) sd = NA all ensemble members will get measured value. Multiple different components (sd > 0) sample from distribution.
make sense now ?

}

# calling the query function sending the mapunit keys
soilprop <- gSSURGO.Query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the approach used here completely replace gSSURGO.Query? If so we might be able to deprecate it now for removal in a future release -- as far as I know it is mostly (only?) used inside extract_soil_nc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes new approach completely replaces gSSURGO.Query and uses fetchSDA, As far as I can tell, gSSURGO.Query is only called within extract_soil_nc.R (now removed), so user impact should be minimal.


# let's fit dirichlet for each depth level separately
simulated.soil.props<-soilprop.new.grouped %>%
split(list(soilprop.new.grouped$DepthL, soilprop.new.grouped$mukey)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divine7022 can you explain more of the intention here? It looks like removing the split by depth switches this from fitting a texture for each depth layer to fitting one texture across the entire depth profile, which is not what we want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question! suspecting this is the point where i was got stuck in with the decision

I reverted back to splitting by both depth AND mukey.
The initial change was an experiment to increase sample size for Dirichlet fitting, and get vertical variability.

Context for why I tried it:
The old code used get_SDA_property which returned one pre-aggregated value per (mukey, depth) - essentially 1 observation per group, with n=1 dirichlet parameter estimation is unstable (requires min ~3-5 samples for reasonable MLE convergence).
The new changes uses single fetchSDA call with component-level aggregation, which preserves multiple components per (mukey, depth).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting. For future work, changes that fundamentally alter a function's behavior ought to be called out in the PR description, especially if they're experimental -- I didn't notice this until I tried a local test run and wondered why it produced files with all layers identical.

Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks for all of your work on this!

@infotroph it looks like @divine7022 has addressed your suggestions and answered your questions. This seems to be a substantial improvement. Are there any errors that would prevent this from being merged?

Copy link
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments here. I haven't finished my re-review yet, but I can already say that given the magnitude of the changes it would be very helpful to include some test cases so we can verify the function is behaving as intended.


# let's fit dirichlet for each depth level separately
simulated.soil.props<-soilprop.new.grouped %>%
split(list(soilprop.new.grouped$DepthL, soilprop.new.grouped$mukey)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting. For future work, changes that fundamentally alter a function's behavior ought to be called out in the PR description, especially if they're experimental -- I didn't notice this until I tried a local test run and wondered why it produced files with all layers identical.

@github-actions github-actions bot added the Tests label Nov 29, 2025

for (i in seq_along(depths_cm)) {
if (i == 1) {
top_depth <- 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you could eliminate this if by requiring the first depth to be 0 and then shifting the indexing. This isn't unreasonable as it's the exact same behaviour as hist when you set a vector of breaks (i.e. there is one more break than there is bins)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, so here i would go for adding a minimal fix -- rather then adding a new value to default parameter depths and break the folks using legacy workflow ?

simplest fix would be to remove the if block and :

depths_cm <- c(0, depths * 100) 

for (i in seq_len(length(depths_cm) - 1)) {
  top_depth <- depths_cm[i]
  bottom_depth <- depths_cm[i + 1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case where it's fine to break the legacy workflow, just (a) fix the defaults and (b) find the places where the function is being called and update the arguments. But if you do attempt to prepend the vector with 0, it would be good to first check that the first argument isn't already 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restore Area-Weighted Sampling for SSURGO

4 participants