Skip to content
This repository was archived by the owner on Nov 1, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions R-EQA.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## ========================= HM3D =========================
#### RAG
CUDA_VISIBLE_DEVICES=6,7,0,1,2,3 python openeqa/baselines/llama_rag.py --source hm3d -m meta-llama/Llama-3.1-70B --prompt vlm_rag --captioning-model qwen

### unifrom sampling
CUDA_VISIBLE_DEVICES=6,7,0,1,2,3 python openeqa/baselines/llama_uniform_sampling.py --source hm3d -m meta-llama/Llama-3.1-70B --prompt vlm_uniform_sampling --captioning-model qwen

## ========================= scannet =========================
#### RAG
python openeqa/baselines/llama_rag.py --source scannet -m meta-llama/Llama-3.1-70B --prompt ferret_rag --captioning-model ferret

### uniform sampling
python openeqa/baselines/llama_uniform_sampling.py --source scannet -m meta-llama/Llama-3.1-70B --prompt ferret_uniform_sampling --captioning-model ferret
87 changes: 76 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,33 @@
<div align="center">
<h2>R-EQA: Retrieval-Augmented Generation for Embodied Question Answering</h2>

[**Hyobin Ong**](https://scholar.google.co.kr/citations?user=_7yFVacAAAAJ&hl=ko)<sup>1,2</sup> [**Minsu Jang**](https://zebehn.github.io/)<sup>1,2†</sup>

<sup>1</sup>UST <sup>2</sup>ETRI

†corresponding author

**CVPR 2025 EmbodiedAI workshop Accepted Paper Highlights ✨**

[paper](source/R-EQA.pdf)

![figure](source/cvprw_poster.jpg)
</div>

This source can be used for experiments with the openEQA benchmark. Please follow the setup instructions provided by openEQA.

# openEQA setup
<details>
<summary>openEQA</summary>

# OpenEQA: Embodied Question Answering in the Era of Foundation Models

[[paper](https://open-eqa.github.io/assets/pdfs/paper.pdf)]
[[project](https://open-eqa.github.io)]
[[dataset](data)]
[[bibtex](#citing-openeqa)]
[[project](https://open-eqa.github.io/)]
[[dataset](https://www.notion.so/data)]
[[bibtex](https://www.notion.so/CVPR-workshop-19dcf3c4d54a80229c65dbd0e6aa690f?pvs=21)]

<https://github.com/facebookresearch/open-eqa/assets/10211521/1de3ded4-ff51-4ffe-801d-4abf269e4320>
https://github.com/facebookresearch/open-eqa/assets/10211521/1de3ded4-ff51-4ffe-801d-4abf269e4320

## Abstract

Expand All @@ -15,9 +37,9 @@ We present a modern formulation of Embodied Question Answering (EQA) as the task

The OpenEQA dataset consists of 1600+ question answer pairs $(Q,A^*)$ and corresponding episode histories $H$.

The question-answer pairs are available in [data/open-eqa-v0.json](data/open-eqa-v0.json) and the episode histories can be downloaded by following the instructions [here](data).
The question-answer pairs are available in [data/open-eqa-v0.json](https://www.notion.so/data/open-eqa-v0.json) and the episode histories can be downloaded by following the instructions [here](https://www.notion.so/data).

**Preview:** A simple tool to view samples in the dataset is provided [here](viewer).
**Preview:** A simple tool to view samples in the dataset is provided [here](https://www.notion.so/viewer).

## Baselines and Automatic Evaluation

Expand All @@ -30,43 +52,86 @@ conda create -n openeqa python=3.9
conda activate openeqa
pip install -r requirements.txt
pip install -e .

```

### Running baselines

Several baselines are implemented in [openeqa/baselines](openeqa/baselines). In general, baselines are run as follows:
Several baselines are implemented in [openeqa/baselines](https://www.notion.so/openeqa/baselines). In general, baselines are run as follows:

```bash
# set an environment variable to your personal API key for the baseline
python openeqa/baselines/<baseline>.py --dry-run # remove --dry-run to process the full benchmark

```

See [openeqa/baselines/README.md](openeqa/baselines/README.md) for more details.
See [openeqa/baselines/README.md](https://www.notion.so/openeqa/baselines/README.md) for more details.

### Running evaluations

Automatic evaluation is implemented with GPT-4 using the prompts found [here](prompts/mmbench.txt) and [here](prompts/mmbench-extra.txt).
Automatic evaluation is implemented with GPT-4 using the prompts found [here](https://www.notion.so/prompts/mmbench.txt) and [here](https://www.notion.so/prompts/mmbench-extra.txt).

```bash
# set the OPENAI_API_KEY environment variable to your personal API key
python evaluate-predictions.py <path/to/results/file.json> --dry-run # remove --dry-run to evaluate on the full benchmark

```

## License

OpenEQA is released under the [MIT License](LICENSE).
OpenEQA is released under the [MIT License](https://www.notion.so/LICENSE).

## Contributors

Arjun Majumdar*, Anurag Ajay*, Xiaohan Zhang*, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran

## Citing OpenEQA

```tex
```
@inproceedings{majumdar2023openeqa,
author={Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran},
title={{OpenEQA: Embodied Question Answering in the Era of Foundation Models}},
booktitle={{CVPR}},
year={2024},
}

```

</details>



Once the openEQA setup is complete, including ScanNet and HM3D, you can run the pipeline as follows:

# Setup for Inference
First, generate image captions and embeddings for all frames.
(Note: The arguments should be adjusted according to the user’s needs. And this step may take a significant amount of time, as it involves captioning every frame.)

```
# image captioning
python openeqa/baselines/captioning_qwen.py

# embedding
python extract_emb.py

```

# Inference
Second, embed each question from `data/open-eqa-v0.json` and compute the cosine similarity with the embeddings of the episode history.

Finally, convert the top-3 most similar captions into natural language and parse them as in-context examples for the LLM input prompt.

```
# using RAG
python openeqa/baselines/llama_rag.py

# using Uniform Sampling
python openeqa/baselines/llama_uniform_sampling.py

```

# Evaluation
Evaluation is performed using `evaluate-predictions.py` provided by openEQA.

# FAQ
Please feel free to contact us (ohnghb@etri.re.kr) with any question or concerns.
12 changes: 8 additions & 4 deletions evaluate-predictions.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,17 +78,19 @@ def main(args: argparse.Namespace):
assert set(dataset_question_ids) == set(results_question_ids)

# load scores
all_scores = {}
all_scores = []
if args.output_path.exists():
all_scores = json.load(args.output_path.open("r"))
print("found {:,} existing scores".format(len(all_scores)))
complete_question_id = [item["question_id"] for item in all_scores]

# evaluate predictions
for idx, question_id in enumerate(tqdm(results_question_ids)):
for idx, item in enumerate(tqdm(results)):
if args.dry_run and idx >= 5:
break

if question_id in all_scores:
question_id = item["question_id"]
if question_id in complete_question_id:
continue

item = question_id_to_item[question_id]
Expand All @@ -112,8 +114,10 @@ def main(args: argparse.Namespace):
all_scores[question_id] = score
json.dump(all_scores, args.output_path.open("w"), indent=2)

all_scores_converted = {item["question_id"]: item["score"] for item in all_scores}

# calculate final score
scores = np.array(list(all_scores.values()))
scores = np.array(list(all_scores_converted.values()))
scores = 100.0 * (np.clip(scores, 1, 5) - 1) / 4
print("final score: {:.1f}".format(np.mean(scores)))

Expand Down
102 changes: 102 additions & 0 deletions extract_emb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
from pathlib import Path
from tqdm import tqdm
import os
import re
import json
import pickle
import argparse

def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument(
'--dataset',
type=Path,
default='data/open-eqa-v0.json',
)
parser.add_argument(
'--output_directory',
type=Path,
default='data/results',
)
parser.add_argument(
"--frames-directory",
type=Path,
default="data/frames/",
help="path image frames (default: data/frames/)",
)

args = parser.parse_args()

return args

def extract_emb(sber, tokenizer, path, save_dir):
"""extract_sentence_embedding from txt trajectory files """
if not os.path.exists(save_dir):
os.makedirs(save_dir)

with open(path) as file:
text_traj = file.read()

parsing_result = parsing_text_traj(text_traj)
task_goal_text = parsing_result['task_goal']
goal_embedding = sbert.encode(task_goal_text.split('Your task is to: ')[1])

tokens = tokenizer(text_traj)['input_ids']
token_count = len(tokens)
encode_name = path.split('/')[-1].replace('.txt', '.pkl')

encoding = {'text_trajectory': text_traj,
'embedding': goal_embedding,
'text_traj_path': path,
'token_count': token_count}

em_encod_path = os.path.join(save_dir, encode_name)


with open(em_encod_path, 'wb') as pickle_file:
pickle.dump(encoding, pickle_file)

def main(args: argparse.Namespace):
embedding_model='all-MiniLM-L6-v2'
sbert = SentenceTransformer(embedding_model)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B')

dataset = json.load(args.dataset.open("r"))

for idx, item in enumerate(tqdm(dataset)):
# extact scene paths
if 'hm3d' in item["episode_history"]:
# pass
# else:
folder = args.frames_directory / item["episode_history"]
frames = sorted(folder.glob("*qwen.txt"))
paths = [str(frames[i]) for i in range(len(frames))]

for text_path in tqdm(paths):
with open(text_path) as file:
text_traj = file.read()
embedding = sbert.encode(text_traj)

tokens = tokenizer(text_traj)
token_count = len(tokens)
encode_name = text_path.split('/')[-1].replace('.txt', '.pkl')

encoding = {'embedding' : embedding,
'text_traj_path' : text_path,
'token_count' : token_count}

save_dir = os.path.join(folder, encode_name)

if os.path.exists(save_dir):
print(f'{save_dir} is existing file')
pass
else:
with open(save_dir, 'wb') as pickle_file:
pickle.dump(encoding, pickle_file)
print(f'saved: {save_dir}')


if __name__=="__main__":
main(parse_args())
Loading