facebookresearch · ohnghb99 · Apr 9, 2025 · Jun 12, 2025 · Jul 11, 2025 · Jul 11, 2025
diff --git a/R-EQA.sh b/R-EQA.sh
@@ -0,0 +1,13 @@
+## ========================= HM3D =========================
+#### RAG
+CUDA_VISIBLE_DEVICES=6,7,0,1,2,3 python openeqa/baselines/llama_rag.py --source hm3d -m meta-llama/Llama-3.1-70B --prompt vlm_rag --captioning-model qwen
+
+### unifrom sampling
+ CUDA_VISIBLE_DEVICES=6,7,0,1,2,3 python openeqa/baselines/llama_uniform_sampling.py --source hm3d -m meta-llama/Llama-3.1-70B --prompt vlm_uniform_sampling --captioning-model qwen
+
+## ========================= scannet =========================
+#### RAG
+python openeqa/baselines/llama_rag.py --source scannet -m meta-llama/Llama-3.1-70B --prompt ferret_rag --captioning-model ferret
+
+### uniform sampling
+python openeqa/baselines/llama_uniform_sampling.py --source scannet -m meta-llama/Llama-3.1-70B --prompt ferret_uniform_sampling --captioning-model ferret
diff --git a/README.md b/README.md
@@ -1,11 +1,33 @@
+<div align="center">
+<h2>R-EQA: Retrieval-Augmented Generation for Embodied Question Answering</h2>
+
+[**Hyobin Ong**](https://scholar.google.co.kr/citations?user=_7yFVacAAAAJ&hl=ko)<sup>1,2</sup> [**Minsu Jang**](https://zebehn.github.io/)<sup>1,2†</sup>
+
+<sup>1</sup>UST <sup>2</sup>ETRI
+
+†corresponding author
+
+**CVPR 2025 EmbodiedAI workshop Accepted Paper Highlights ✨**
+
+[paper](source/R-EQA.pdf)
+
+![figure](source/cvprw_poster.jpg)
+</div>
+
+This source can be used for experiments with the openEQA benchmark. Please follow the setup instructions provided by openEQA.
+
+# openEQA setup
+<details>
+<summary>openEQA</summary>
+
 # OpenEQA: Embodied Question Answering in the Era of Foundation Models
 
 [[paper](https://open-eqa.github.io/assets/pdfs/paper.pdf)]
-[[project](https://open-eqa.github.io)]
-[[dataset](data)]
-[[bibtex](#citing-openeqa)]
+[[project](https://open-eqa.github.io/)]
+[[dataset](https://www.notion.so/data)]
+[[bibtex](https://www.notion.so/CVPR-workshop-19dcf3c4d54a80229c65dbd0e6aa690f?pvs=21)]
 
-<https://github.com/facebookresearch/open-eqa/assets/10211521/1de3ded4-ff51-4ffe-801d-4abf269e4320>
+https://github.com/facebookresearch/open-eqa/assets/10211521/1de3ded4-ff51-4ffe-801d-4abf269e4320
 
 ## Abstract
 
@@ -15,9 +37,9 @@ We present a modern formulation of Embodied Question Answering (EQA) as the task
 
 The OpenEQA dataset consists of 1600+ question answer pairs $(Q,A^*)$ and corresponding episode histories $H$.
 
-The question-answer pairs are available in [data/open-eqa-v0.json](data/open-eqa-v0.json) and the episode histories can be downloaded by following the instructions [here](data).
+The question-answer pairs are available in [data/open-eqa-v0.json](https://www.notion.so/data/open-eqa-v0.json) and the episode histories can be downloaded by following the instructions [here](https://www.notion.so/data).
 
-**Preview:** A simple tool to view samples in the dataset is provided [here](viewer).
+**Preview:** A simple tool to view samples in the dataset is provided [here](https://www.notion.so/viewer).
 
 ## Baselines and Automatic Evaluation
 
@@ -30,43 +52,86 @@ conda create -n openeqa python=3.9
 conda activate openeqa
 pip install -r requirements.txt
 pip install -e .
+
 ```
 
 ### Running baselines
 
-Several baselines are implemented in [openeqa/baselines](openeqa/baselines). In general, baselines are run as follows:
+Several baselines are implemented in [openeqa/baselines](https://www.notion.so/openeqa/baselines). In general, baselines are run as follows:
 
 ```bash
 # set an environment variable to your personal API key for the baseline
 python openeqa/baselines/<baseline>.py --dry-run  # remove --dry-run to process the full benchmark
+
 ```
 
-See [openeqa/baselines/README.md](openeqa/baselines/README.md) for more details.
+See [openeqa/baselines/README.md](https://www.notion.so/openeqa/baselines/README.md) for more details.
 
 ### Running evaluations
 
-Automatic evaluation is implemented with GPT-4 using the prompts found [here](prompts/mmbench.txt) and [here](prompts/mmbench-extra.txt).
+Automatic evaluation is implemented with GPT-4 using the prompts found [here](https://www.notion.so/prompts/mmbench.txt) and [here](https://www.notion.so/prompts/mmbench-extra.txt).
 
 ```bash
 # set the OPENAI_API_KEY environment variable to your personal API key
 python evaluate-predictions.py <path/to/results/file.json> --dry-run  # remove --dry-run to evaluate on the full benchmark
+
 ```
 
 ## License
 
-OpenEQA is released under the [MIT License](LICENSE).
+OpenEQA is released under the [MIT License](https://www.notion.so/LICENSE).
 
 ## Contributors
 
 Arjun Majumdar*, Anurag Ajay*, Xiaohan Zhang*, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran
 
 ## Citing OpenEQA
 
-```tex
+```
 @inproceedings{majumdar2023openeqa,
   author={Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran},
   title={{OpenEQA: Embodied Question Answering in the Era of Foundation Models}},
   booktitle={{CVPR}},
   year={2024},
 }
+
 ```
+
+</details>
+
+
+
+Once the openEQA setup is complete, including ScanNet and HM3D, you can run the pipeline as follows:
+
+# Setup for Inference
+First, generate image captions and embeddings for all frames.
+(Note: The arguments should be adjusted according to the user’s needs. And this step may take a significant amount of time, as it involves captioning every frame.)
+
+```
+# image captioning
+python openeqa/baselines/captioning_qwen.py
+
+# embedding
+python extract_emb.py
+
+```
+
+# Inference
+Second, embed each question from `data/open-eqa-v0.json` and compute the cosine similarity with the embeddings of the episode history.
+
+Finally, convert the top-3 most similar captions into natural language and parse them as in-context examples for the LLM input prompt.
+
+```
+# using RAG
+python openeqa/baselines/llama_rag.py
+
+# using Uniform Sampling
+python openeqa/baselines/llama_uniform_sampling.py
+
+```
+
+# Evaluation
+Evaluation is performed using `evaluate-predictions.py` provided by openEQA.
+
+# FAQ
+Please feel free to contact us (ohnghb@etri.re.kr) with any question or concerns.
diff --git a/evaluate-predictions.py b/evaluate-predictions.py
@@ -78,17 +78,19 @@ def main(args: argparse.Namespace):
         assert set(dataset_question_ids) == set(results_question_ids)
 
     # load scores
-    all_scores = {}
+    all_scores = []
     if args.output_path.exists():
         all_scores = json.load(args.output_path.open("r"))
         print("found {:,} existing scores".format(len(all_scores)))
+    complete_question_id = [item["question_id"] for item in all_scores]
 
     # evaluate predictions
-    for idx, question_id in enumerate(tqdm(results_question_ids)):
+    for idx, item in enumerate(tqdm(results)):
         if args.dry_run and idx >= 5:
             break
 
-        if question_id in all_scores:
+        question_id = item["question_id"]
+        if question_id in complete_question_id:
             continue
 
         item = question_id_to_item[question_id]
@@ -112,8 +114,10 @@ def main(args: argparse.Namespace):
         all_scores[question_id] = score
         json.dump(all_scores, args.output_path.open("w"), indent=2)
 
+    all_scores_converted = {item["question_id"]: item["score"] for item in all_scores}
+
     # calculate final score
-    scores = np.array(list(all_scores.values()))
+    scores = np.array(list(all_scores_converted.values()))
     scores = 100.0 * (np.clip(scores, 1, 5) - 1) / 4
     print("final score: {:.1f}".format(np.mean(scores)))
 

diff --git a/extract_emb.py b/extract_emb.py
@@ -0,0 +1,102 @@
+from transformers import AutoTokenizer
+from sentence_transformers import SentenceTransformer
+from pathlib import Path
+from tqdm import tqdm
+import os
+import re
+import json
+import pickle
+import argparse
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--dataset',
+        type=Path,
+        default='data/open-eqa-v0.json',
+    )
+    parser.add_argument(
+        '--output_directory',
+        type=Path,
+        default='data/results',
+    )
+    parser.add_argument(
+        "--frames-directory",
+        type=Path,
+        default="data/frames/",
+        help="path image frames (default: data/frames/)",
+    )
+
+    args = parser.parse_args()
+
+    return args
+
+def extract_emb(sber, tokenizer, path, save_dir):
+    """extract_sentence_embedding from txt trajectory files """
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+
+    with open(path) as file:
+        text_traj = file.read()
+
+    parsing_result = parsing_text_traj(text_traj)
+    task_goal_text = parsing_result['task_goal']
+    goal_embedding = sbert.encode(task_goal_text.split('Your task is to: ')[1])
+
+    tokens = tokenizer(text_traj)['input_ids']
+    token_count = len(tokens)
+    encode_name = path.split('/')[-1].replace('.txt', '.pkl')
+
+    encoding = {'text_trajectory': text_traj,
+            'embedding': goal_embedding,
+            'text_traj_path': path,
+            'token_count': token_count}
+
+    em_encod_path = os.path.join(save_dir, encode_name)
+
+
+    with open(em_encod_path, 'wb') as pickle_file:
+        pickle.dump(encoding, pickle_file)
+
+def main(args: argparse.Namespace):
+    embedding_model='all-MiniLM-L6-v2'
+    sbert = SentenceTransformer(embedding_model)
+    tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B')
+
+    dataset = json.load(args.dataset.open("r"))
+
+    for idx, item in enumerate(tqdm(dataset)):
+        # extact scene paths
+        if 'hm3d' in item["episode_history"]:
+        #     pass
+        # else:
+            folder = args.frames_directory / item["episode_history"]
+            frames = sorted(folder.glob("*qwen.txt"))
+            paths = [str(frames[i]) for i in range(len(frames))]
+
+            for text_path in tqdm(paths):
+                with open(text_path) as file:
+                    text_traj = file.read()
+                embedding = sbert.encode(text_traj)
+
+                tokens = tokenizer(text_traj)
+                token_count = len(tokens)
+                encode_name = text_path.split('/')[-1].replace('.txt', '.pkl')
+
+                encoding = {'embedding' : embedding,
+                            'text_traj_path' : text_path,
+                            'token_count' : token_count}
+
+                save_dir = os.path.join(folder, encode_name)
+
+                if os.path.exists(save_dir):
+                    print(f'{save_dir} is existing file')
+                    pass
+                else:
+                    with open(save_dir, 'wb') as pickle_file:
+                        pickle.dump(encoding, pickle_file)
+                        print(f'saved: {save_dir}')
+
+
+if __name__=="__main__":
+    main(parse_args())