Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception
Pei Deng, Wenqian Zhou, Hanlin Wu*
DeltaVLM introduces Remote Sensing Image Change Analysis (RSICA) — a paradigm combining change detection and visual question answering for multi-turn, instruction-guided exploration of bi-temporal remote sensing images.
Capabilities: Change captioning, binary classification, quantification, localization, open-ended QA, multi-turn dialogue.
Architecture Details
- Bi-temporal Vision Encoder (Bi-VE): EVA-ViT-g/14 with last 2 blocks fine-tuned
- IDPM with CSRM: Cross-Semantic Relation Measuring to filter irrelevant variations
- Instruction-guided Q-Former: Aligns visual differences with user instructions
- Frozen Vicuna-7B: Language decoder for response generation
git clone https://github.com/hanlinwu/DeltaVLM.git
cd DeltaVLM
conda create -n deltavlm python=3.10 -y
conda activate deltavlm
pip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install -e .mkdir -p pretrained
# Vicuna-7B (LLM backbone)
huggingface-cli download lmsys/vicuna-7b-v1.5 --local-dir pretrained/vicuna-7b-v1.5
# BERT (Q-Former tokenizer)
huggingface-cli download bert-base-uncased --local-dir pretrained/bert-base-uncased
# DeltaVLM checkpoint
huggingface-cli download hanlinwu/DeltaVLM --local-dir pretrained/deltavlmpython scripts/predict.py \
--image_A path/to/before.png \
--image_B path/to/after.png \
--checkpoint pretrained/deltavlm/checkpoint_best.pth \
--llm_model pretrained/vicuna-7b-v1.5 \
--bert_model pretrained/bert-base-uncasedExpected Output
Using device: cuda
Loading model from pretrained/deltavlm/checkpoint_best.pth...
Model loaded successfully!
Preprocessing images...
Prompt: Please briefly describe the changes in these two images.
Generating response...
==================================================
Generated Description:
A new building has appeared in the lower right area.
==================================================
Download ChangeChat-105k:
python data/download_changechat.py --output_dir ./data/changechatExpected Data Structure
data/changechat/
├── images/
│ ├── train/
│ ├── val/
│ └── test/
└── annotations/
├── train.json
├── val.json
└── test.json
# Single GPU
python scripts/train.py --cfg_path configs/train_stage2.yaml
# Multi-GPU (4 GPUs)
torchrun --nproc_per_node=4 scripts/train.py --cfg_path configs/train_stage2.yamlpython scripts/evaluate.py --cfg_path configs/evaluate.yaml| Metric | Description |
|---|---|
| BLEU-1/2/3/4 | N-gram precision |
| METEOR | Semantic similarity |
| ROUGE-L | Longest common subsequence |
| CIDEr | Consensus-based image description |
| Instruction Type | Train | Test |
|---|---|---|
| Change Captioning | 34,075 | 1,929 |
| Binary Classification | 6,815 | 1,929 |
| Change Quantification | 6,815 | 1,929 |
| Change Localization | 6,815 | 1,929 |
| Open-ended QA | 26,600 | 7,527 |
| Multi-turn Dialogue | 6,815 | 1,929 |
| Total | 87,935 | 17,172 |
If you find this work useful, please cite:
@article{deltavlm2024,
title={DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception},
author={Deng, Pei and Zhou, Wenqian and Wu, Hanlin},
journal={IEEE Transactions on Geoscience and Remote Sensing},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.

