Code repository for:
Rafique, Hamza and Muhammad, Abubakr, “A deep learning framework for farm-scale soil moisture retrievals: A case study for a data-scarce region,” IGARSS 2026 — 2026 IEEE International Geoscience and Remote Sensing Symposium.
This repo provides a Python (PyTorch) workflow to train and evaluate an LSTM-based model (with Monte‑Carlo Dropout uncertainty) for predicting soil moisture using SMAP coarse-resolution (9 km) soil moisture features.
- Overview
- Repository structure
- Data
- Installation
- Quickstart
- Configuration reference (
config.py) - Model notes
- Outputs
- Citation
- Contact
At a high level, this project:
- Loads multiple site CSV files (each CSV corresponds to a sensor/site).
- Builds sliding-window sequences of length
SEQ_LENper site. - Trains an LSTM to predict soil moisture (scaled to m³/m³ internally).
- Runs Monte‑Carlo Dropout at inference time to quantify predictive uncertainty.
- Saves trained model weights, scalers, and evaluation metrics/plots.
Key scripts:
train.py: training (standard and MC-dropout variants) + per-site metrics on the training set.test.py: evaluation on a folder of site CSVs (standard and MC-dropout variants).witsms_farm_stats.py: aggregates sensor-level metrics to farm-level metrics and makes boxplots.config.py: experiment configuration (data folders, feature selection, model mode, etc.).model.py: LSTM model definitions (standard + MC-dropout).utils.py: data loading, sequence generation, scaling, metrics, plotting, cleaning utilities.
.
├── README.md
├── config.py
├── model.py
├── train.py
├── test.py
├── utils.py
├── witsms_farm_stats.py
├── training/
│ ├── train/ # (you provide) holdout training set CSVs
│ ├── test/ # (you provide) holdout test set CSVs
│ └── complete/ # (optional) full dataset CSVs
└── checkpoints/
├── holdout/ # saved models/scalers for holdout runs
└── complete/ # saved models/scalers for complete runs
Notes:
training/is expected to contain your CSV files (see Data).checkpoints/will be created/used automatically for model + scaler.- Results are written into
results/holdoutorresults/completedepending on configuration.
Each site CSV should contain:
- A timestamp column (default):
TimeStamp - One target column:
VolumetricWaterContent1 - One or more feature columns depending on
MODE(seeconfig.py):SM_AM_9kmfor AM modeSM_PM_9kmfor PM modeSM_9kmfor combined mode
Controlled by SET in config.py:
-
If
SET = "holdout":training/train/contains training CSVstraining/test/contains test CSVs
-
If
SET = "complete":training/complete/contains CSVs used for both training and evaluation
This is a Python-only project using common scientific libraries and PyTorch.
Recommended:
- Python 3.9+ (or newer)
- PyTorch (CPU or CUDA)
- NumPy, Pandas, scikit-learn, SciPy, Matplotlib
Using venv:
python -m venv .venv
# Linux / macOS
source .venv/bin/activate
# Windows (PowerShell)
# .\.venv\Scripts\Activate.ps1Install dependencies (typical):
pip install numpy pandas scipy scikit-learn matplotlib joblib torchIf you use CUDA, install the correct PyTorch build per the official PyTorch instructions.
Edit config.py:
MODE:"AM","PM", or"comb"SET:"holdout"or"complete"SEQ_LEN: sequence length (default 7)EPOCHS: training epochs (default 5000)MODEL_UNCERTAINTY:Truefor MC-dropout model, else standard modelTRAIN_ON_TRUE_VALUES:True: train to predict soil moisture directlyFalse: train to predict residuals w.r.t. persistence baseline (see notes below)
USE_SCALING: standardize inputs/targets/residuals usingStandardScaler
Also verify these are correct for your data:
TARGET_COL = "VolumetricWaterContent1"DATE_COL = "TimeStamp"FEATURE_COLSare set automatically based onMODE
Run:
python train.pyWhat it does (default path in train.py main):
- Trains an MC-dropout model using
train_global_model_mc() - Evaluates per-site on the dataset it trained on via
evaluate_per_site_mc(...) - Writes metrics CSV into
OUT_DIR - Saves model and scalers into
MODEL_DIR
Run:
python test.pyWhat it does:
- Loads the saved model + scalers from
MODEL_DIR - Evaluates:
- Either deterministic evaluation, or MC-dropout evaluation (depending on
MODEL_UNCERTAINTY)
- Either deterministic evaluation, or MC-dropout evaluation (depending on
- For MC-dropout evaluation on all sites, it uses
TEST_DATA_FOLDER(fromconfig.py) - Saves a metrics CSV into
OUT_DIRwith a name like:results/holdout/AM_metrics_per_site_mc_test.csv(example)
If you produced a sensor-level metrics CSV from test.py, you can aggregate to farm-level metrics:
python witsms_farm_stats.pyThis script expects:
- Input metrics CSV path based on
OUT_DIRandMODE, e.g.:results/holdout/AM_metrics_per_site_mc_test.csv
It produces:
- Farm-level aggregated metrics CSV in:
results/<set>/farmlevel/
- Boxplots saved alongside the CSV.
Key settings:
-
Experiment selection
MODE = "AM" | "PM" | "comb"SET = "holdout" | "complete"
-
Data locations
DATA_FOLDER:- holdout:
training/train - complete:
training/complete
- holdout:
TEST_DATA_FOLDER:- holdout:
training/test - complete: same as
DATA_FOLDER
- holdout:
-
Model/training
HIDDEN_DIM,NUM_LAYERS,LR,EPOCHSTRAIN_ON_TRUE_VALUES: train on true values vs residualsMODEL_UNCERTAINTY: use MC-dropout model or standard model
-
Scaling
USE_SCALING: usesStandardScaleron inputs and targets/residuals
-
Device
DEVICE = cuda if available else cpu
-
Output paths
OUT_DIR: results folderMODEL_DIR: checkpoints folderMODEL_NAME,SCALARS_NAME: depend onMODE
- Standard model is
LSTMviamake_model(...)inmodel.py - MC-dropout model is
LSTM_MCviamake_model_mc(...)inmodel.py- During MC evaluation, the model is set to
.train()mode to activate dropout. - Multiple forward passes are taken (
n_mc), then:- mean prediction:
preds_mean - uncertainty proxy:
preds_std(std across MC samples)
- mean prediction:
- During MC evaluation, the model is set to
The repo supports two target formulations:
-
TRAIN_ON_TRUE_VALUES = True:- Model predicts soil moisture directly.
-
TRAIN_ON_TRUE_VALUES = False:- A persistence-like baseline is computed:
- baseline = last value of the last timestep feature (
X_seq[:, -1, 0])
- baseline = last value of the last timestep feature (
- The model predicts the residual:
- residual = y_true - baseline
- Final prediction is:
- y_pred = clip(baseline + residual_pred, 0, 1)
- A persistence-like baseline is computed:
Make sure the first (or only) feature column is the intended baseline series if using residual mode.
Depending on run mode, you should expect:
- Model weights:
checkpoints/<set>/<MODE>_model_mc.pt(or similar)
- Scalers:
checkpoints/<set>/<MODE>_scalars.pkl
Examples:
- Training per-site metrics:
results/<set>/AM_metrics_per_site_train.csv(name depends onMODE)
- Test per-site metrics (MC):
results/<set>/AM_metrics_per_site_mc_test.csv
Generated by evaluation scripts depending on flags and calls, for example:
- Scatter plots (observed vs predicted)
- Uncertainty distribution
- Uncertainty time series
- Time series with uncertainty band
If you use this code, please cite the IGARSS 2026 paper:
@inproceedings{rafique2026smap2farmnet,
title = {A deep learning framework for farm-scale soil moisture retrievals: A case study for a data-scarce region},
author = {Rafique, Hamza and Muhammad, Abubakr},
booktitle = {Proceedings of the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)},
year = {2026},
note = {Code: https://github.com/LUMS-WIT/SMAP2FarmNet}
}The soil moisture sensor data used for this research is under the ownership of Centre for Water Informatics & Technology (WIT) LUMS, Lahore
For questions or collaborations, please contact the the authors via [LinkdIn] (www.linkedin.com/in/hamza-rafique-ac952) or email.