diff --git a/R-EQA.sh b/R-EQA.sh
new file mode 100644
index 0000000..2b866f8
--- /dev/null
+++ b/R-EQA.sh
@@ -0,0 +1,13 @@
+## ========================= HM3D =========================
+#### RAG
+CUDA_VISIBLE_DEVICES=6,7,0,1,2,3 python openeqa/baselines/llama_rag.py --source hm3d -m meta-llama/Llama-3.1-70B --prompt vlm_rag --captioning-model qwen
+
+### unifrom sampling
+ CUDA_VISIBLE_DEVICES=6,7,0,1,2,3 python openeqa/baselines/llama_uniform_sampling.py --source hm3d -m meta-llama/Llama-3.1-70B --prompt vlm_uniform_sampling --captioning-model qwen
+
+## ========================= scannet =========================
+#### RAG
+python openeqa/baselines/llama_rag.py --source scannet -m meta-llama/Llama-3.1-70B --prompt ferret_rag --captioning-model ferret
+
+### uniform sampling
+python openeqa/baselines/llama_uniform_sampling.py --source scannet -m meta-llama/Llama-3.1-70B --prompt ferret_uniform_sampling --captioning-model ferret
\ No newline at end of file
diff --git a/README.md b/README.md
index 74ef740..98a15b5 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,33 @@
+<div align="center">
+<h2>R-EQA: Retrieval-Augmented Generation for Embodied Question Answering</h2>
+
+[**Hyobin Ong**](https://scholar.google.co.kr/citations?user=_7yFVacAAAAJ&hl=ko)<sup>1,2</sup> [**Minsu Jang**](https://zebehn.github.io/)<sup>1,2†</sup>
+
+<sup>1</sup>UST <sup>2</sup>ETRI
+
+†corresponding author
+
+**CVPR 2025 EmbodiedAI workshop Accepted Paper Highlights ✨**
+
+[paper](source/R-EQA.pdf)
+
+![figure](source/cvprw_poster.jpg)
+</div>
+
+This source can be used for experiments with the openEQA benchmark. Please follow the setup instructions provided by openEQA.
+
+# openEQA setup
+<details>
+<summary>openEQA</summary>
+
 # OpenEQA: Embodied Question Answering in the Era of Foundation Models
 
 [[paper](https://open-eqa.github.io/assets/pdfs/paper.pdf)]
-[[project](https://open-eqa.github.io)]
-[[dataset](data)]
-[[bibtex](#citing-openeqa)]
+[[project](https://open-eqa.github.io/)]
+[[dataset](https://www.notion.so/data)]
+[[bibtex](https://www.notion.so/CVPR-workshop-19dcf3c4d54a80229c65dbd0e6aa690f?pvs=21)]
 
-<https://github.com/facebookresearch/open-eqa/assets/10211521/1de3ded4-ff51-4ffe-801d-4abf269e4320>
+https://github.com/facebookresearch/open-eqa/assets/10211521/1de3ded4-ff51-4ffe-801d-4abf269e4320
 
 ## Abstract
 
@@ -15,9 +37,9 @@ We present a modern formulation of Embodied Question Answering (EQA) as the task
 
 The OpenEQA dataset consists of 1600+ question answer pairs $(Q,A^*)$ and corresponding episode histories $H$.
 
-The question-answer pairs are available in [data/open-eqa-v0.json](data/open-eqa-v0.json) and the episode histories can be downloaded by following the instructions [here](data).
+The question-answer pairs are available in [data/open-eqa-v0.json](https://www.notion.so/data/open-eqa-v0.json) and the episode histories can be downloaded by following the instructions [here](https://www.notion.so/data).
 
-**Preview:** A simple tool to view samples in the dataset is provided [here](viewer).
+**Preview:** A simple tool to view samples in the dataset is provided [here](https://www.notion.so/viewer).
 
 ## Baselines and Automatic Evaluation
 
@@ -30,31 +52,34 @@ conda create -n openeqa python=3.9
 conda activate openeqa
 pip install -r requirements.txt
 pip install -e .
+
 ```
 
 ### Running baselines
 
-Several baselines are implemented in [openeqa/baselines](openeqa/baselines). In general, baselines are run as follows:
+Several baselines are implemented in [openeqa/baselines](https://www.notion.so/openeqa/baselines). In general, baselines are run as follows:
 
 ```bash
 # set an environment variable to your personal API key for the baseline
 python openeqa/baselines/<baseline>.py --dry-run  # remove --dry-run to process the full benchmark
+
 ```
 
-See [openeqa/baselines/README.md](openeqa/baselines/README.md) for more details.
+See [openeqa/baselines/README.md](https://www.notion.so/openeqa/baselines/README.md) for more details.
 
 ### Running evaluations
 
-Automatic evaluation is implemented with GPT-4 using the prompts found [here](prompts/mmbench.txt) and [here](prompts/mmbench-extra.txt).
+Automatic evaluation is implemented with GPT-4 using the prompts found [here](https://www.notion.so/prompts/mmbench.txt) and [here](https://www.notion.so/prompts/mmbench-extra.txt).
 
 ```bash
 # set the OPENAI_API_KEY environment variable to your personal API key
 python evaluate-predictions.py <path/to/results/file.json> --dry-run  # remove --dry-run to evaluate on the full benchmark
+
 ```
 
 ## License
 
-OpenEQA is released under the [MIT License](LICENSE).
+OpenEQA is released under the [MIT License](https://www.notion.so/LICENSE).
 
 ## Contributors
 
@@ -62,11 +87,51 @@ Arjun Majumdar*, Anurag Ajay*, Xiaohan Zhang*, Pranav Putta, Sriram Yenamandra,
 
 ## Citing OpenEQA
 
-```tex
+```
 @inproceedings{majumdar2023openeqa,
   author={Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran},
   title={{OpenEQA: Embodied Question Answering in the Era of Foundation Models}},
   booktitle={{CVPR}},
   year={2024},
 }
+
 ```
+
+</details>
+
+
+
+Once the openEQA setup is complete, including ScanNet and HM3D, you can run the pipeline as follows:
+
+# Setup for Inference
+First, generate image captions and embeddings for all frames.
+(Note: The arguments should be adjusted according to the user’s needs. And this step may take a significant amount of time, as it involves captioning every frame.)
+
+```
+# image captioning
+python openeqa/baselines/captioning_qwen.py
+
+# embedding
+python extract_emb.py
+
+```
+
+# Inference
+Second, embed each question from `data/open-eqa-v0.json` and compute the cosine similarity with the embeddings of the episode history.
+
+Finally, convert the top-3 most similar captions into natural language and parse them as in-context examples for the LLM input prompt.
+
+```
+# using RAG
+python openeqa/baselines/llama_rag.py
+
+# using Uniform Sampling
+python openeqa/baselines/llama_uniform_sampling.py
+
+```
+
+# Evaluation
+Evaluation is performed using `evaluate-predictions.py` provided by openEQA.
+
+# FAQ
+Please feel free to contact us (ohnghb@etri.re.kr) with any question or concerns.
\ No newline at end of file
diff --git a/evaluate-predictions.py b/evaluate-predictions.py
index c025d57..c35c0e5 100644
--- a/evaluate-predictions.py
+++ b/evaluate-predictions.py
@@ -78,17 +78,19 @@ def main(args: argparse.Namespace):
         assert set(dataset_question_ids) == set(results_question_ids)
 
     # load scores
-    all_scores = {}
+    all_scores = []
     if args.output_path.exists():
         all_scores = json.load(args.output_path.open("r"))
         print("found {:,} existing scores".format(len(all_scores)))
+    complete_question_id = [item["question_id"] for item in all_scores]
 
     # evaluate predictions
-    for idx, question_id in enumerate(tqdm(results_question_ids)):
+    for idx, item in enumerate(tqdm(results)):
         if args.dry_run and idx >= 5:
             break
 
-        if question_id in all_scores:
+        question_id = item["question_id"]
+        if question_id in complete_question_id:
             continue
 
         item = question_id_to_item[question_id]
@@ -112,8 +114,10 @@ def main(args: argparse.Namespace):
         all_scores[question_id] = score
         json.dump(all_scores, args.output_path.open("w"), indent=2)
 
+    all_scores_converted = {item["question_id"]: item["score"] for item in all_scores}
+
     # calculate final score
-    scores = np.array(list(all_scores.values()))
+    scores = np.array(list(all_scores_converted.values()))
     scores = 100.0 * (np.clip(scores, 1, 5) - 1) / 4
     print("final score: {:.1f}".format(np.mean(scores)))
 
diff --git a/extract_emb.py b/extract_emb.py
new file mode 100644
index 0000000..dca6e8a
--- /dev/null
+++ b/extract_emb.py
@@ -0,0 +1,102 @@
+from transformers import AutoTokenizer
+from sentence_transformers import SentenceTransformer
+from pathlib import Path
+from tqdm import tqdm
+import os
+import re
+import json
+import pickle
+import argparse
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--dataset',
+        type=Path,
+        default='data/open-eqa-v0.json',
+    )
+    parser.add_argument(
+        '--output_directory',
+        type=Path,
+        default='data/results',
+    )
+    parser.add_argument(
+        "--frames-directory",
+        type=Path,
+        default="data/frames/",
+        help="path image frames (default: data/frames/)",
+    )
+
+    args = parser.parse_args()
+
+    return args
+
+def extract_emb(sber, tokenizer, path, save_dir):
+    """extract_sentence_embedding from txt trajectory files """
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+    
+    with open(path) as file:
+        text_traj = file.read()
+
+    parsing_result = parsing_text_traj(text_traj)
+    task_goal_text = parsing_result['task_goal']
+    goal_embedding = sbert.encode(task_goal_text.split('Your task is to: ')[1])
+
+    tokens = tokenizer(text_traj)['input_ids']
+    token_count = len(tokens)
+    encode_name = path.split('/')[-1].replace('.txt', '.pkl')
+
+    encoding = {'text_trajectory': text_traj,
+            'embedding': goal_embedding,
+            'text_traj_path': path,
+            'token_count': token_count}
+    
+    em_encod_path = os.path.join(save_dir, encode_name)
+
+
+    with open(em_encod_path, 'wb') as pickle_file:
+        pickle.dump(encoding, pickle_file)
+
+def main(args: argparse.Namespace):
+    embedding_model='all-MiniLM-L6-v2'
+    sbert = SentenceTransformer(embedding_model)
+    tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B')
+
+    dataset = json.load(args.dataset.open("r"))
+
+    for idx, item in enumerate(tqdm(dataset)):
+        # extact scene paths
+        if 'hm3d' in item["episode_history"]:
+        #     pass
+        # else:
+            folder = args.frames_directory / item["episode_history"]
+            frames = sorted(folder.glob("*qwen.txt"))
+            paths = [str(frames[i]) for i in range(len(frames))]
+
+            for text_path in tqdm(paths):
+                with open(text_path) as file:
+                    text_traj = file.read()
+                embedding = sbert.encode(text_traj)
+
+                tokens = tokenizer(text_traj)
+                token_count = len(tokens)
+                encode_name = text_path.split('/')[-1].replace('.txt', '.pkl')
+
+                encoding = {'embedding' : embedding,
+                            'text_traj_path' : text_path,
+                            'token_count' : token_count}
+
+                save_dir = os.path.join(folder, encode_name)
+
+                if os.path.exists(save_dir):
+                    print(f'{save_dir} is existing file')
+                    pass
+                else:
+                    with open(save_dir, 'wb') as pickle_file:
+                        pickle.dump(encoding, pickle_file)
+                        print(f'saved: {save_dir}')
+
+
+if __name__=="__main__":
+    main(parse_args())
\ No newline at end of file
diff --git a/openeqa/baselines/captioning_qwen.py b/openeqa/baselines/captioning_qwen.py
new file mode 100644
index 0000000..94da36a
--- /dev/null
+++ b/openeqa/baselines/captioning_qwen.py
@@ -0,0 +1,214 @@
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+
+import argparse
+import json
+import os
+import traceback
+from pathlib import Path
+from typing import List, Optional
+from PIL import Image
+import numpy as np
+import tqdm
+import time, datetime
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--model_path',
+        type=Path,
+        default='data/open-eqa-v0.json',
+    )
+    parser.add_argument(
+        '--model_base',
+    )
+    parser.add_argument(
+        '--image_path',
+        type=Path,
+        default='data/refcoco/train2014',
+    )
+    parser.add_argument(
+        '--data_path',
+        type=Path,
+        default='data/annotations/finetune_refcoco_testA.json',
+    )
+    parser.add_argument(
+        '--answers_file',
+        type=Path,
+        default='refexp_result/refcoco_testA',
+    )
+    parser.add_argument(
+        '--conv_mode',
+        type=str,
+        default='llava_v1',
+    )
+    parser.add_argument(
+        '--num_chunks',
+        type=int,
+        default='1',
+    )
+    parser.add_argument(
+        '--chunk_idx',
+        type=int,
+        default='0',
+    )
+    parser.add_argument(
+        '--image_w',
+        type=int,
+        default='336',
+    )
+    parser.add_argument(
+        '--image_h',
+        type=int,
+        default='336',
+    )
+    parser.add_argument(
+        '--add_region_feature',
+        type=str,
+        default='True',
+    )
+    parser.add_argument(
+        '--temperature',
+        type=float,
+        default='1',
+    )
+    parser.add_argument(
+        '--top_p',
+    )
+    parser.add_argument(
+        '--num_beams',
+        type=int,
+        default='1',
+    )
+    parser.add_argument(
+        '--dataset',
+        type=Path,
+        default='data/open-eqa-v0.json',
+    )
+    parser.add_argument(
+        '--output_directory',
+        type=Path,
+        default='data/results',
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="only process the first 5 questions",
+    )
+    parser.add_argument(
+        "--frames-directory",
+        type=Path,
+        default="data/frames/",
+        help="path image frames (default: data/frames/)",
+    )
+    parser.add_argument(
+        "--num-frames",
+        type=int,
+        default=4,
+        help="num frames in gpt4v (default: 50)",
+    )
+    parser.add_argument(
+        "--single-image",
+        action="store_true",
+    )
+    args = parser.parse_args()
+    args.output_directory.mkdir(parents=True, exist_ok=True)
+    args.output_path = args.output_directory / (args.dataset.stem + "-qwen.json")
+
+    return args
+
+
+def ask_question(args, frame, model, processor):
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "image": f"{frame}",
+                },
+                {"type": "text", "text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."},
+            ],
+        }
+    ]
+
+    # Preparation for inference
+    text = processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+
+    image_inputs, _ = process_vision_info(messages)
+    inputs = processor(
+        text=[text],
+        images=image_inputs,
+        padding=True,
+        return_tensors="pt",
+    )
+
+    inputs = inputs.to("cuda")
+
+    # Inference: Generation of the output
+    generated_ids = model.generate(**inputs, max_new_tokens=1024)
+    generated_ids_trimmed = [
+        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+    ]
+    output_text = processor.batch_decode(
+        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+
+    return output_text
+
+
+def main(args: argparse.Namespace):
+    # load dataset
+    dataset = json.load(args.dataset.open("r"))
+    print("found {:,} questions".format(len(dataset)))
+
+    # default: Load the model on the available device(s)
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto")
+
+    # default processer
+    processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
+
+    # load results
+    results = []
+    if args.output_path.exists():
+        results = json.load(args.output_path.open())
+        print("found {:,} existing results".format(len(results)))
+
+    start = time.time()
+    time.sleep(1)
+
+    # process data
+    for idx, item in enumerate(tqdm.tqdm(dataset)):
+        if args.dry_run and idx >= 5:
+            break
+        
+        question = item['question']
+
+        # extract scene paths
+        folder = args.frames_directory / item["episode_history"]
+        frames = sorted(folder.glob("*-rgb.png"))
+        paths = [str(frames[i]) for i in range(len(frames))]
+
+
+        for img in tqdm.tqdm(paths):
+            file_path = img.split('/')[-1]
+            file_path = file_path.split('.')[0]
+            save_file = str(folder) + '/' + file_path + '-qwen.txt'
+            #image = Image.open(img)
+            
+            if os.path.exists(save_file):
+                print(f'{save_file} is existing result')
+                pass
+            else:                
+                answer = ask_question(args, img, model, processor)
+                with open(save_file , 'w') as file:
+                        if type(answer) is list:
+                            answer = ''.join(answer)
+                            file.write(answer)
+                        else:
+                            file.write(answer)
+
+if __name__ == "__main__":
+    main(parse_args())
\ No newline at end of file
diff --git a/openeqa/baselines/ferret/constants.py b/openeqa/baselines/ferret/constants.py
new file mode 100644
index 0000000..be8cf02
--- /dev/null
+++ b/openeqa/baselines/ferret/constants.py
@@ -0,0 +1,12 @@
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+WORKER_HEART_BEAT_INTERVAL = 15
+
+LOGDIR = "."
+
+# Model Constants
+IGNORE_INDEX = -100
+IMAGE_TOKEN_INDEX = -200
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
+DEFAULT_IM_START_TOKEN = "<im_start>"
+DEFAULT_IM_END_TOKEN = "<im_end>"
diff --git a/openeqa/baselines/ferret/conversation.py b/openeqa/baselines/ferret/conversation.py
new file mode 100644
index 0000000..23f90a8
--- /dev/null
+++ b/openeqa/baselines/ferret/conversation.py
@@ -0,0 +1,275 @@
+import dataclasses
+from enum import auto, Enum
+from typing import List, Tuple
+
+VOCAB_IMAGE_W = 1000  # 224
+VOCAB_IMAGE_H = 1000  # 224
+
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+    TWO = auto()
+    MPT = auto()
+    PLAIN = auto()
+    LLAMA_2 = auto()
+
+
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+
+    skip_next: bool = False
+    first_round: bool = True
+
+
+    def get_prompt(self):
+        messages = self.messages
+        if len(messages) > 0 and type(messages[0][1]) is tuple:
+            messages = self.messages.copy()
+            init_role, init_msg = messages[0].copy()
+            init_msg = init_msg[0].replace("<image>", "").strip()
+            if 'mmtag' in self.version:
+                messages[0] = (init_role, init_msg)
+                messages.insert(0, (self.roles[0], "<Image><image></Image>"))
+                messages.insert(1, (self.roles[1], "Received."))
+            else:
+                messages[0] = (init_role, "<image>\n" + init_msg)
+
+        if self.sep_style == SeparatorStyle.SINGLE:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + self.sep
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.TWO:
+            seps = [self.sep, self.sep2]
+            ret = self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.MPT:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + message + self.sep
+                else:
+                    ret += role
+        elif self.sep_style == SeparatorStyle.LLAMA_2:
+            wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
+            wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
+            ret = ""
+
+            for i, (role, message) in enumerate(messages):
+                if i == 0:
+                    assert message, "first message should not be none"
+                    assert role == self.roles[0], "first message should come from user"
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    if i == 0: message = wrap_sys(self.system) + message
+                    if i % 2 == 0:
+                        message = wrap_inst(message)
+                        ret += self.sep + message
+                    else:
+                        ret += " " + message + " " + self.sep2
+                else:
+                    ret += ""
+            ret = ret.lstrip(self.sep)
+        elif self.sep_style == SeparatorStyle.PLAIN:
+            seps = [self.sep, self.sep2]
+            ret = self.system
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += message + seps[i % 2]
+                else:
+                    ret += ""
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+        return ret
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def get_images(self, return_pil=False):
+        images = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+                    from PIL import Image
+                    msg, image, image_process_mode = msg
+                    if image_process_mode == "Pad":
+                        def expand2square(pil_img, background_color=(122, 116, 104)):
+                            width, height = pil_img.size
+                            if width == height:
+                                return pil_img
+                            elif width > height:
+                                result = Image.new(pil_img.mode, (width, width), background_color)
+                                result.paste(pil_img, (0, (width - height) // 2))
+                                return result
+                            else:
+                                result = Image.new(pil_img.mode, (height, height), background_color)
+                                result.paste(pil_img, ((height - width) // 2, 0))
+                                return result
+                        image = expand2square(image)
+                    elif image_process_mode == "Crop":
+                        pass
+                    elif image_process_mode == "Raw+Processor":
+                        pass
+                    elif image_process_mode == "Resize":
+                        image = image.resize((336, 336))
+                    else:
+                        raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
+
+                    if image_process_mode != "Raw+Processor":
+                        max_hw, min_hw = max(image.size), min(image.size)
+                        aspect_ratio = max_hw / min_hw
+                        max_len, min_len = 800, 400
+                        shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                        longest_edge = int(shortest_edge * aspect_ratio)
+                        W, H = image.size
+                        if H > W:
+                            H, W = longest_edge, shortest_edge
+                        else:
+                            H, W = shortest_edge, longest_edge
+                        image = image.resize((W, H))
+                    print('Input Image Size:{}'.format(image.size))
+
+                    if return_pil:
+                        images.append(image)
+                    else:
+                        buffered = BytesIO()
+                        image.save(buffered, format="PNG")
+                        img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                        images.append(img_b64_str)
+        return images
+
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+                    msg, image, image_process_mode = msg
+                    if image_process_mode != "Raw+Processor":
+                        max_hw, min_hw = max(image.size), min(image.size)
+                        aspect_ratio = max_hw / min_hw
+                        max_len, min_len = 800, 400
+                        shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                        longest_edge = int(shortest_edge * aspect_ratio)
+                        W, H = image.size
+                        if H > W:
+                            H, W = longest_edge, shortest_edge
+                        else:
+                            H, W = shortest_edge, longest_edge
+                        image = image.resize((W, H))
+                    buffered = BytesIO()
+                    image.save(buffered, format="JPEG")
+                    img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                    img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
+                    ret.append([img_str, None])
+                    msg = msg.replace('<image>', '').strip()
+                    if len(msg) > 0:
+                        ret.append([msg, None])
+                else:
+                    ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        return ret
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            version=self.version)
+
+    def dict(self):
+        if len(self.get_images()) > 0:
+            return {
+                "system": self.system,
+                "roles": self.roles,
+                "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
+                "offset": self.offset,
+                "sep": self.sep,
+                "sep2": self.sep2,
+            }
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+
+
+
+ferret_conv_vicuna_v1_original_system = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "Assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. "
+           "In images, points are represented by coordinates [x, y]. The top-left corner is [0, 0]. The bottom-right corner is [width-1, height-1]. "
+           "Increasing x moves right across the image while increasing y moves down. "
+           "A bounding box is marked by [x1, y1, x2, y2] with the top-left and bottom-right points being [x1, y1] and [x2, y2] respectively. "
+           f"The image size is assumed to be ({VOCAB_IMAGE_W}, {VOCAB_IMAGE_H}), i.e., width={VOCAB_IMAGE_W}, height={VOCAB_IMAGE_H}. "
+           "Follow the instructions carefully. ",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+ferret_conv_vicuna_v1 = Conversation(
+    system="A chat between a human and an AI that understands visuals. "
+           "In images, [x, y] denotes points: top-left [0, 0], bottom-right [width-1, height-1]. "
+           "Increasing x moves right; y moves down. "
+           f"Bounding box: [x1, y1, x2, y2]. Image size: {VOCAB_IMAGE_W}x{VOCAB_IMAGE_H}. "
+           "Follow instructions. ",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+
+default_conversation = ferret_conv_vicuna_v1
+conv_templates = {
+    "v1": ferret_conv_vicuna_v1,
+    "ferret_v1": ferret_conv_vicuna_v1,
+}
+
+
+if __name__ == "__main__":
+    print(default_conversation.get_prompt())
diff --git a/openeqa/baselines/ferret/mm_utils.py b/openeqa/baselines/ferret/mm_utils.py
new file mode 100644
index 0000000..fbcecd1
--- /dev/null
+++ b/openeqa/baselines/ferret/mm_utils.py
@@ -0,0 +1,74 @@
+from PIL import Image
+from io import BytesIO
+import base64
+
+import torch
+from transformers import StoppingCriteria
+from .constants import IMAGE_TOKEN_INDEX
+
+
+def load_image_from_base64(image):
+    return Image.open(BytesIO(base64.b64decode(image)))
+
+
+def process_images(images, image_processor, model_cfg):
+    return image_processor(images, return_tensors='pt')['pixel_values']
+
+
+def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
+    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
+
+    def insert_separator(X, sep):
+        return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
+
+    input_ids = []
+    offset = 0
+    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
+        offset = 1
+        input_ids.append(prompt_chunks[0][0])
+
+    for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
+        input_ids.extend(x[offset:])
+
+    if return_tensors is not None:
+        if return_tensors == 'pt':
+            return torch.tensor(input_ids, dtype=torch.long)
+        raise ValueError(f'Unsupported tensor type: {return_tensors}')
+    return input_ids
+
+
+def get_model_name_from_path(model_path):
+    model_path = model_path.strip("/")
+    model_paths = model_path.split("/")
+    if model_paths[-1].startswith('checkpoint-') or model_paths[-1].endswith('checkpoint'):
+        return model_paths[-2] + "_" + model_paths[-1]
+    else:
+        return model_paths[-1]
+
+
+
+
+class KeywordsStoppingCriteria(StoppingCriteria):
+    def __init__(self, keywords, tokenizer, input_ids):
+        self.keywords = keywords
+        self.keyword_ids = []
+        for keyword in keywords:
+            cur_keyword_ids = tokenizer(keyword).input_ids
+            if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
+                cur_keyword_ids = cur_keyword_ids[1:]
+            self.keyword_ids.append(torch.tensor(cur_keyword_ids))
+        self.tokenizer = tokenizer
+        self.start_len = input_ids.shape[1]
+
+    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)"  # TODO
+        offset = min(output_ids.shape[1] - self.start_len, 3)
+        self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
+        for keyword_id in self.keyword_ids:
+            if output_ids[0, -keyword_id.shape[0]:] == keyword_id:
+                return True
+        outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
+        for keyword in self.keywords:
+            if keyword in outputs:
+                return True
+        return False
diff --git a/openeqa/baselines/ferret/model/__init__.py b/openeqa/baselines/ferret/model/__init__.py
new file mode 100644
index 0000000..a4d920d
--- /dev/null
+++ b/openeqa/baselines/ferret/model/__init__.py
@@ -0,0 +1 @@
+from .language_model.ferret_llama import FERRETLlamaForCausalLM, FERRETConfig
diff --git a/openeqa/baselines/ferret/model/apply_delta.py b/openeqa/baselines/ferret/model/apply_delta.py
new file mode 100644
index 0000000..76a400a
--- /dev/null
+++ b/openeqa/baselines/ferret/model/apply_delta.py
@@ -0,0 +1,70 @@
+"""
+Usage:
+# 7B
+python3 -m ferret.model.apply_delta \
+    --base ./model/vicuna-7b-v1-3 \
+    --target ./model/ferret-7b-v1-3 \
+    --delta ./checkpoints/ferret_ft_clipL336_vicunaV1-3-7b_3Ep_dataV16_RSamplerV2/ferret-7b-delta
+
+# 13B
+python3 -m ferret.model.apply_delta \
+    --base ./model/vicuna-13b-v1-3 \
+    --target ./model/ferret-13b-v1-3 \
+    --delta ./checkpoints/ferret_ft_clipL336_vicunaV1-3-13b_3Ep_dataV16_RSamplerV2/ferret-13b-delta
+"""
+import argparse
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from openeqa.baselines.ferret_transform import FERRETLlamaForCausalLM
+
+
+exclude_name_lists = ['model.mm_projector.weight', 'model.mm_projector.bias', 
+                    'model.region_geo_sampler.agg_projector_list.0.net.0.bias', 'model.region_geo_sampler.agg_projector_list.0.net.0.weight', 
+                    'model.region_geo_sampler.agg_projector_list.0.norm.bias', 'model.region_geo_sampler.agg_projector_list.0.norm.weight', 
+                    'model.region_geo_sampler.agg_projector_list.1.net.0.bias', 'model.region_geo_sampler.agg_projector_list.1.net.0.weight', 
+                    'model.region_geo_sampler.agg_projector_list.1.norm.bias', 'model.region_geo_sampler.agg_projector_list.1.norm.weight', 
+                    'model.region_geo_sampler.diff_projector_list.0.bias', 'model.region_geo_sampler.diff_projector_list.0.weight', 
+                    'model.region_geo_sampler.diff_projector_list.1.bias', 'model.region_geo_sampler.diff_projector_list.1.weight', 
+                    'model.region_geo_sampler.dim_projector.bias', 'model.region_geo_sampler.dim_projector.weight', 
+                    'model.region_geo_sampler.flatten_projector.bias', 'model.region_geo_sampler.flatten_projector.weight'
+                    ]
+
+
+def apply_delta(base_model_path, target_model_path, delta_path):
+    print("Loading base model")
+    base = AutoModelForCausalLM.from_pretrained(
+        base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Loading delta")
+    delta = FERRETLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
+
+    print("Applying delta")
+    for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
+        if name not in base.state_dict():
+            assert name in exclude_name_lists, f'{name} not in base model'
+            continue
+        if param.data.shape == base.state_dict()[name].shape:
+            param.data += base.state_dict()[name]
+        else:
+            assert name in ['model.embed_tokens.weight', 'lm_head.weight'], \
+                f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
+            bparam = base.state_dict()[name]
+            param.data[:bparam.shape[0], :bparam.shape[1]] += bparam
+
+    print("Saving target model")
+    delta.save_pretrained(target_model_path)
+    delta_tokenizer.save_pretrained(target_model_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+
+    args = parser.parse_args()
+
+    apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
diff --git a/openeqa/baselines/ferret/model/builder.py b/openeqa/baselines/ferret/model/builder.py
new file mode 100644
index 0000000..1b8d7f7
--- /dev/null
+++ b/openeqa/baselines/ferret/model/builder.py
@@ -0,0 +1,147 @@
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+import os
+import shutil
+import pdb
+
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
+import torch
+from ferret.model import *
+from ferret.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto"):
+    kwargs = {"device_map": device_map}
+
+    if load_8bit:
+        kwargs['load_in_8bit'] = True
+    elif load_4bit:
+        kwargs['load_in_4bit'] = True
+        kwargs['quantization_config'] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type='nf4'
+        )
+    else:
+        kwargs['torch_dtype'] = torch.float16
+
+    if 'llava' in model_name.lower() or 'ferret' in model_name.lower():
+        #print('1')
+        # Load LLaVA/FERRET model
+        if 'lora' in model_name.lower() and model_base is not None:
+            #print('2')
+            lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            print('Loading LLaVA/FERRET from base model...')
+            model = FERRETLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
+            token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
+            if model.lm_head.weight.shape[0] != token_num:
+                model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+                model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+
+            print('Loading additional LLaVA/FERRET weights...')
+            if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
+                non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
+            else:
+                # this is probably from HF Hub
+                from huggingface_hub import hf_hub_download
+                def load_from_hf(repo_id, filename, subfolder=None):
+                    cache_file = hf_hub_download(
+                        repo_id=repo_id,
+                        filename=filename,
+                        subfolder=subfolder)
+                    return torch.load(cache_file, map_location='cpu')
+                non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
+            non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
+            if any(k.startswith('model.model.') for k in non_lora_trainables):
+                non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
+            model.load_state_dict(non_lora_trainables, strict=False)
+
+            from peft import PeftModel
+            print('Loading LoRA weights...')
+            model = PeftModel.from_pretrained(model, model_path)
+            print('Merging LoRA weights...')
+            model = model.merge_and_unload()
+            print('Model is loaded...')
+        elif model_base is not None:
+            #print('3')
+            # this may be mm projector only
+            print('Loading LLaVA/FERRET from base model...')
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            cfg_pretrained = AutoConfig.from_pretrained(model_path)
+            model = FERRETLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
+
+            mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
+            mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+            model.load_state_dict(mm_projector_weights, strict=False)
+        else:
+            ##################### default is here!!
+            print('4')
+            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+            model = FERRETLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
+    else:
+        # Load language model
+        if model_base is not None:
+            #print('5')
+            # PEFT model
+            from peft import PeftModel
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
+            print(f"Loading LoRA weights from {model_path}")
+            model = PeftModel.from_pretrained(model, model_path)
+            print(f"Merging weights")
+            model = model.merge_and_unload()
+            print('Convert to FP16...')
+            model.to(torch.float16)
+        else:
+            #print('6')
+            use_fast = False
+            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+            model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
+
+    image_processor = None
+
+    if 'llava' in model_name.lower() or 'ferret' in model_name.lower():
+        print('7')
+        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
+        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
+        mm_im_region_fea_token = getattr(model.config, "im_region_fea_token", None)
+        if mm_use_im_patch_token:
+            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
+        if mm_im_region_fea_token is not None:
+            tokenizer.add_tokens([DEFAULT_REGION_FEA_TOKEN], special_tokens=True)
+        if mm_use_im_start_end:
+            tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
+        model.resize_token_embeddings(len(tokenizer))
+
+        vision_tower = model.get_vision_tower()
+        vision_tower_path = os.path.join(model_path, 'vision_tower')
+        if not vision_tower.is_loaded or os.path.exists(vision_tower_path):
+            if os.path.exists(vision_tower_path):
+                print(f'Start Loading vision tower from {vision_tower_path}')
+                vision_tower.load_model(vision_tower_path=vision_tower_path)
+                print(f'Finish Loading vision tower from {vision_tower_path}')
+            else:
+                vision_tower.load_model()
+
+        vision_tower.to(device='cuda', dtype=torch.float16)
+        image_processor = vision_tower.image_processor
+
+    if hasattr(model.config, "max_sequence_length"):
+        context_len = model.config.max_sequence_length
+    else:
+        context_len = 2048
+
+    return tokenizer, model, image_processor, context_len
diff --git a/openeqa/baselines/ferret/model/consolidate.py b/openeqa/baselines/ferret/model/consolidate.py
new file mode 100644
index 0000000..8d516bc
--- /dev/null
+++ b/openeqa/baselines/ferret/model/consolidate.py
@@ -0,0 +1,29 @@
+"""
+Usage:
+python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
+"""
+import argparse
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from ferret.model import *
+from ferret.model.utils import auto_upgrade
+
+
+def consolidate_ckpt(src_path, dst_path):
+    print("Loading model")
+    auto_upgrade(src_path)
+    src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
+    src_model.save_pretrained(dst_path)
+    src_tokenizer.save_pretrained(dst_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--src", type=str, required=True)
+    parser.add_argument("--dst", type=str, required=True)
+
+    args = parser.parse_args()
+
+    consolidate_ckpt(args.src, args.dst)
diff --git a/openeqa/baselines/ferret/model/ferret_arch.py b/openeqa/baselines/ferret/model/ferret_arch.py
new file mode 100644
index 0000000..97c83fb
--- /dev/null
+++ b/openeqa/baselines/ferret/model/ferret_arch.py
@@ -0,0 +1,687 @@
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from abc import ABC, abstractmethod
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import math
+from .multimodal_encoder.builder import build_vision_tower
+import pdb
+
+from ferret.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+def rand_sample(x, max_len):
+    if x.shape[0] <= max_len:
+        return x
+    else:
+        rand_idx = torch.randperm(x.shape[0])[:max_len]
+    return x[rand_idx, :]
+
+def rand_sample_repeat(x, max_len):
+    if x.shape[0] < max_len:
+        indices = torch.randint(0, x.shape[0], (max_len-x.shape[0],))
+        # pdb.set_trace()
+        return torch.cat((x, x[indices]), dim=0)
+    elif x.shape[0] == max_len:
+        return x
+    else:
+        rand_idx = torch.randperm(x.shape[0])[:max_len]
+        return x[rand_idx, :]
+
+def point_sample(input, point_coords, return_dtype, **kwargs):
+    """
+    A wrapper around :function:`torch.nn.functional.grid_sample` to support 3D point_coords tensors.
+    Unlike :function:`torch.nn.functional.grid_sample` it assumes `point_coords` to lie inside
+    [0, 1] x [0, 1] square.
+
+    Args:
+        input (Tensor): A tensor of shape (N, C, H, W) that contains features map on a H x W grid.
+        point_coords (Tensor): A tensor of shape (N, P, 2) or (N, Hgrid, Wgrid, 2) that contains
+        [0, 1] x [0, 1] normalized point coordinates.
+
+    Returns:
+        output (Tensor): A tensor of shape (N, C, P) or (N, C, Hgrid, Wgrid) that contains
+            features for points in `point_coords`. The features are obtained via bilinear
+            interplation from `input` the same way as :function:`torch.nn.functional.grid_sample`.
+    """
+    add_dim = False
+    if point_coords.dim() == 3:
+        add_dim = True
+        point_coords = point_coords.unsqueeze(2)
+    # output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
+    output = F.grid_sample(input.float(), (2.0 * point_coords - 1.0).float(), **kwargs)
+    output = output.to(return_dtype)
+    if add_dim:
+        output = output.squeeze(3)
+    return output
+
+
+def farthest_point_sample(xyz, npoint):
+    """
+    Input:
+        xyz: pointcloud data, [B, N, 2]
+        npoint: number of samples
+    Return:
+        centroids: sampled pointcloud index, [B, npoint]
+    """
+    device = xyz.device
+    B, N, C = xyz.shape
+    centroids = torch.zeros(B, npoint, dtype=torch.long).to(device)
+    distance = torch.ones(B, N).to(device) * 1e10
+    farthest = torch.randint(0, N, (B,), dtype=torch.long).to(device)
+    batch_indices = torch.arange(B, dtype=torch.long).to(device)
+    for i in range(npoint):
+        centroids[:, i] = farthest
+        centroid = xyz[batch_indices, farthest, :].view(B, 1, 2)
+        dist = torch.sum((xyz - centroid) ** 2, -1)
+        distance = torch.min(distance, dist)
+        farthest = torch.max(distance, -1)[1]
+    return centroids
+
+
+def index_points(points, idx):
+    """
+    Input:
+        points: input points data, [B, N, C]
+        idx: sample index data, [B, S]
+    Return:
+        new_points:, indexed points data, [B, S, C]
+    """
+    device = points.device
+    B = points.shape[0]
+    view_shape = list(idx.shape)
+    view_shape[1:] = [1] * (len(view_shape) - 1)
+    repeat_shape = list(idx.shape)
+    repeat_shape[0] = 1
+    batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape)
+    new_points = points[batch_indices, idx, :]
+    return new_points
+
+
+def square_distance(src, dst):
+    """
+    Calculate Euclid distance between each two points.
+    src^T * dst = xn * xm + yn * ym + zn * zm；
+    sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn;
+    sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm;
+    dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2
+         = sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst
+    Input:
+        src: source points, [B, N, C]
+        dst: target points, [B, M, C]
+    Output:
+        dist: per-point square distance, [B, N, M]
+    """
+    B, N, _ = src.shape
+    _, M, _ = dst.shape
+    dist = -2 * torch.matmul(src, dst.permute(0, 2, 1))
+    dist += torch.sum(src ** 2, -1).view(B, N, 1)
+    dist += torch.sum(dst ** 2, -1).view(B, 1, M)
+    return dist
+
+
+def knn_point(nsample, xyz, new_xyz):
+    """
+    Input:
+        nsample: max sample number in local region
+        xyz: all points, [B, N, C]
+        new_xyz: query points, [B, S, C]
+    Return:
+        group_idx: grouped points index, [B, S, nsample]
+    """
+    sqrdists = square_distance(new_xyz, xyz)
+    _, group_idx = torch.topk(sqrdists, nsample, dim=-1, largest=False, sorted=False)
+    return group_idx
+
+
+class ConvReLULN1D(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, bias=True):
+        super(ConvReLULN1D, self).__init__()
+        self.act = nn.ReLU(inplace=True)
+        self.net = nn.Sequential(
+            nn.Conv1d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, bias=bias),
+            self.act
+        )
+        self.norm = nn.LayerNorm(out_channels)
+
+    def forward(self, x):
+        # (B, C, N) -> (B, C_1, N)
+        x = self.net(x)
+        x = x.permute(0, 2, 1)
+        x = self.norm(x)
+        x = x.permute(0, 2, 1)
+        
+        return x
+    
+
+def normal_init(module, mean=0, std=1, bias=0):
+    if hasattr(module, 'weight') and module.weight is not None:
+        nn.init.normal_(module.weight, mean, std)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+class GeoRegionSampler(nn.Module):
+    print('GeoRegionSampler')
+    def __init__(self, 
+                 input_dim,
+                 output_dim,
+                 num_init_point,
+                 num_sub_point,
+                 num_neighbor,
+                 pooler_mode='mean'):
+        super(GeoRegionSampler, self).__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.num_init_point = num_init_point
+        self.num_sub_point = num_sub_point
+        self.num_neighbor = num_neighbor
+
+        self.diff_projector_list = nn.ModuleList()
+        self.agg_projector_list = nn.ModuleList()
+        self.pooler_list = nn.ModuleList()
+
+        for ii in range(len(num_sub_point)):
+            self.diff_projector_list.append(nn.Linear(self.input_dim + 2, self.input_dim + 2))
+            self.agg_projector_list.append(ConvReLULN1D(in_channels=2*(self.input_dim + 2),
+                                                        out_channels=self.input_dim,
+                                                        ))
+            if pooler_mode == 'mean':
+                self.pooler_list.append(nn.AvgPool1d(kernel_size=num_neighbor[ii]))
+            elif pooler_mode =='max':
+                self.pooler_list.append(nn.AdaptiveMaxPool1d(output_size=1))
+            else:
+                raise NotImplementedError(f'{self.pooler_mode} is not supported.')
+
+        self.flatten_projector = nn.Linear(self.input_dim * num_sub_point[-1], self.input_dim)
+        self.dim_projector = nn.Linear(self.input_dim, self.output_dim)
+
+        self.norm_init_weights()
+
+    #  self.dtype = torch.float32
+    def norm_init_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                normal_init(m, 0, 0.01)
+
+
+    def forward(self, 
+                feature_map, 
+                region_masks, 
+                original_dtype,
+                return_dtype):
+
+        assert len(feature_map) == len(region_masks)
+
+        all_points = []
+        all_points_fea = []
+        all_points_img_ids = []
+        # Sample points and their features
+        for img_idx, (region_feature_map_i, region_masks_list_i) in enumerate(zip(feature_map, region_masks)):
+            if len(region_masks_list_i) != 0:
+                # (w, h)
+                ori_image_wh = torch.tensor([region_masks_list_i[0].shape[0], region_masks_list_i[0].shape[1]], device=region_masks_list_i[0].device)[None,]
+                # list of elements of shape [num_sample_point, 2] 
+                # pdb.set_trace()
+                cur_non_zero_pos = [rand_sample_repeat((m.nonzero()/ori_image_wh), self.num_init_point) for m in region_masks_list_i]
+                # list -> [num_mask, num_sample_point, 2]
+                cur_non_zero_pos = torch.stack(cur_non_zero_pos)
+                # [HxW, C] -> [H, W, C] -> [C, H, W] -> [N, C, H, W]
+                h = w = int(math.sqrt(region_feature_map_i.shape[0]))
+                c = region_feature_map_i.shape[-1]
+                dup_region_feature_map_i = region_feature_map_i.reshape(h, w, c).permute(2, 0, 1)
+                dup_region_feature_map_i = dup_region_feature_map_i.unsqueeze(0).repeat(cur_non_zero_pos.shape[0], 1, 1, 1)
+                # [num_mask, C, H, W] x [num_mask, num_sample_point, 2] -> [num_mask, C, num_sample_point] -> [num_mask, num_sample_point, C]
+                # F.grid_sample doesn't support BF16. Need to tranform into float32 then transform back.
+                dup_region_feature_map_i_ori_type = dup_region_feature_map_i.to(original_dtype)
+                region_feature_i = point_sample(dup_region_feature_map_i_ori_type, 
+                                                cur_non_zero_pos.flip(dims=(2,)).type(original_dtype), 
+                                                return_dtype,
+                                                align_corners=True,
+                                                )
+                # region_feature_i = region_feature_i.to(dup_region_feature_map_i.dtype)
+                region_feature_i = region_feature_i.transpose(-2, -1)
+
+                cur_img_ids = [img_idx] * len(cur_non_zero_pos)
+                # save to global list
+                all_points.append(cur_non_zero_pos)
+                all_points_fea.append(region_feature_i)
+                all_points_img_ids.extend(cur_img_ids)
+
+        # pdb.set_trace()
+        # No region found, return list of None.
+        if len(all_points) == 0:
+            return [None] * len(region_masks)
+        
+        all_points = torch.cat(all_points, dim=0).to(return_dtype)  # [B*num_mask, num_sample_point, 2]
+        all_points_fea = torch.cat(all_points_fea, dim=0)  # [B*num_mask, num_sample_point, C]
+        all_points_img_ids = torch.tensor(all_points_img_ids, device=all_points_fea.device)
+        # pdb.set_trace()
+        assert all_points_fea.shape[:-1] == all_points_fea.shape[:-1]
+        
+        # Processing.
+        for stage_i in range(len(self.num_sub_point)):
+            cur_num_sub_point = self.num_sub_point[stage_i]
+            cur_num_neighbor = self.num_neighbor[stage_i]
+
+            all_points = all_points.contiguous()  # xy [btach, points, xy]
+            fps_idx = farthest_point_sample(all_points, cur_num_sub_point).long()
+
+            new_points = index_points(all_points, fps_idx)  # [B, npoint, 2]
+            new_points_fea = index_points(all_points_fea, fps_idx)  # [B, npoint, d]
+
+            idx = knn_point(cur_num_neighbor, all_points, new_points)
+            grouped_points = index_points(all_points, idx)  # [B, npoint, k, 2]
+            grouped_points_fea = index_points(all_points_fea, idx)  # [B, npoint, k, d]
+
+            # pdb.set_trace()
+            local_points_fea = torch.cat([grouped_points_fea, grouped_points],dim=-1)  # [B, npoint, k, d+2]
+            anchor_points_fea = torch.cat([new_points_fea, new_points],dim=-1).unsqueeze(-2)
+            diff_points_fea = local_points_fea-anchor_points_fea
+
+            diff_points_fea = self.diff_projector_list[stage_i](diff_points_fea)
+            gather_points_fea = torch.cat([diff_points_fea, anchor_points_fea.repeat(1, 1, cur_num_neighbor, 1)], dim=-1)  # [B, npoint, k, 2(d+2)]
+
+            # pdb.set_trace()
+            b, n, s, d = gather_points_fea.size() 
+            gather_points_fea = gather_points_fea.permute(0, 1, 3, 2)   # [B, npoint, 2(d+2), k]
+            gather_points_fea = gather_points_fea.reshape(-1, d, s)   # [B*npoint, 2(d+2), k]
+            gather_points_fea = self.agg_projector_list[stage_i](gather_points_fea) # [B*npoint, d, k]
+            # pdb.set_trace()
+            batch_size, new_dim, _ = gather_points_fea.size()
+            gather_points_fea = self.pooler_list[stage_i](gather_points_fea).view(batch_size, new_dim) # [B*npoint, d]
+            # gather_points_fea = F.adaptive_max_pool1d(gather_points_fea, 1).view(batch_size, -1) # [B*npoint, d]
+            # pdb.set_trace()
+            gather_points_fea = gather_points_fea.reshape(b, n, -1)     # [B, npoint, d]
+            # pdb.set_trace()
+
+            all_points = new_points
+            all_points_fea = gather_points_fea
+
+        # pdb.set_trace()
+        x = all_points_fea.flatten(1, -1)  # [B, npoint x d]
+        x = self.flatten_projector(x)
+        all_region_fea = self.dim_projector(x)  # [B, d]
+
+        output_region_fea = []
+        for img_idx in range(len(region_masks)):
+            cur_mask = all_points_img_ids == img_idx
+            # pdb.set_trace()
+            if not cur_mask.any():
+                output_region_fea.append(None)
+            else:
+                output_region_fea.append(all_region_fea[cur_mask])
+
+        # pdb.set_trace()
+        return output_region_fea
+
+
+
+class FERRETMetaModel:
+
+    def __init__(self, config):
+        super(FERRETMetaModel, self).__init__(config)
+        self.max_sample_point = 512
+
+        if hasattr(config, "mm_vision_tower"):
+            self.vision_tower = build_vision_tower(config, delay_load=True)
+            self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+        if hasattr(config, "region_fea_adapter"):
+            self.region_fea_adapter = nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+        if hasattr(config, "region_geo_sampler"):
+            # pdb.set_trace()
+            self.region_geo_sampler = GeoRegionSampler(input_dim=config.mm_hidden_size,
+                                                       output_dim=config.hidden_size,
+                                                       num_init_point=self.max_sample_point,
+                                                       num_sub_point=[128, 32],
+                                                       num_neighbor=[24, 24],
+                                                       pooler_mode=config.sampler_pooler_mode
+                                                       )
+
+    def get_vision_tower(self):
+        vision_tower = getattr(self, 'vision_tower', None)
+        if type(vision_tower) is list:
+            vision_tower = vision_tower[0]
+        return vision_tower
+
+    def initialize_vision_modules(self, model_args, fsdp=None, add_region_feature=False, region_geo_sampler=False, sampler_pooler_mode='mean'):
+        vision_tower = model_args.vision_tower
+        mm_vision_select_layer = model_args.mm_vision_select_layer
+        mm_vision_select_feature = model_args.mm_vision_select_feature
+        pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
+
+        self.config.mm_vision_tower = vision_tower
+
+        vision_tower = build_vision_tower(model_args)
+
+        if fsdp is not None and len(fsdp) > 0:
+            self.vision_tower = [vision_tower]
+        else:
+            self.vision_tower = vision_tower
+
+        self.config.use_mm_proj = True
+        self.config.mm_hidden_size = vision_tower.hidden_size
+        self.config.mm_vision_select_layer = mm_vision_select_layer
+        self.config.mm_vision_select_feature = mm_vision_select_feature
+
+        if not hasattr(self, 'mm_projector'):
+            self.mm_projector = nn.Linear(self.config.mm_hidden_size, self.config.hidden_size)
+
+        if add_region_feature:
+            if region_geo_sampler:
+                self.config.region_geo_sampler = True
+                self.config.sampler_pooler_mode = sampler_pooler_mode
+                # pdb.set_trace()
+                if not hasattr(self, 'region_geo_sampler'):
+                    self.region_geo_sampler = GeoRegionSampler(input_dim=self.config.mm_hidden_size,
+                                                            output_dim=self.config.hidden_size,
+                                                            num_init_point=self.max_sample_point,
+                                                            num_sub_point=[128, 32],
+                                                            num_neighbor=[24, 24],
+                                                            pooler_mode=sampler_pooler_mode
+                                                            )
+            else:
+                self.config.region_fea_adapter = True
+                if not hasattr(self, 'region_fea_adapter'):
+                    self.region_fea_adapter = nn.Linear(self.config.mm_hidden_size, self.config.hidden_size)
+
+        if pretrain_mm_mlp_adapter is not None:
+            mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
+            def get_w(weights, keyword):
+                return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
+
+            self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
+
+
+class FERRETMetaForCausalLM(ABC):
+
+    @abstractmethod
+    def get_model(self):
+        pass
+
+    def get_vision_tower(self):
+        return self.get_model().get_vision_tower()
+
+    def encode_images(self, images, region_flag=False, region_geo_sampler=False):
+        image_features = self.get_model().get_vision_tower()(images)
+        target_dtype = self.get_model().mm_projector.weight.dtype
+        image_features = image_features.to(target_dtype)
+
+        projected_image_features = self.get_model().mm_projector(image_features)
+        #print(f'projected_image_features: {projected_image_features}')
+
+        if region_flag:
+            if region_geo_sampler:
+                new_region_feature_map = image_features
+            else:
+                new_region_feature_map = self.get_model().region_fea_adapter(image_features)
+        else:
+            new_region_feature_map = None
+
+        return image_features, projected_image_features, new_region_feature_map
+
+    def extract_region_feature(self, region_feature_map, region_masks, original_dtype, return_dtype):
+        all_region_features = []
+        assert len(region_feature_map) == len(region_masks)
+        for region_feature_map_i, region_masks_list_i in zip(region_feature_map, region_masks):
+            if len(region_masks_list_i) == 0:
+                all_region_features.append(None)
+            else:
+                # (w, h)
+                ori_image_wh = torch.tensor([region_masks_list_i[0].shape[0], region_masks_list_i[0].shape[1]], device=region_masks_list_i[0].device)[None,]
+                # list of elements of shape [num_sample_point, 2]
+                non_zero_pos = [rand_sample((m.nonzero()/ori_image_wh), self.get_model().max_sample_point) for m in region_masks_list_i]
+                # [num_mask, num_sample_point(padded), 2]
+                non_zero_pos = nn.utils.rnn.pad_sequence(non_zero_pos, padding_value=-1, batch_first=True)
+                non_zero_pos_mask = ~(non_zero_pos.sum(dim=-1) < 0)
+                # [HxW, C] -> [H, W, C] -> [C, H, W] -> [N, C, H, W]
+                h = w = int(math.sqrt(region_feature_map_i.shape[0]))
+                c = region_feature_map_i.shape[-1]
+                dup_region_feature_map_i = region_feature_map_i.reshape(h, w, c).permute(2, 0, 1)
+                dup_region_feature_map_i = dup_region_feature_map_i.unsqueeze(0).repeat(non_zero_pos.shape[0], 1, 1, 1)
+                # [num_mask, C, H, W] x [num_mask, num_sample_point(padded), 2] -> [num_mask, C, num_sample_point(padded)]
+                # F.grid_sample doesn't support BF16. Need to tranform into float32 then transform back.
+                dup_region_feature_map_i_ori_type = dup_region_feature_map_i.to(original_dtype)
+                # pdb.set_trace()
+                region_feature_i = point_sample(dup_region_feature_map_i_ori_type, 
+                                                non_zero_pos.flip(dims=(2,)).type(original_dtype), 
+                                                return_dtype, 
+                                                align_corners=True
+                                                )
+                region_feature_i = region_feature_i.to(dup_region_feature_map_i.dtype)
+                # [num_mask, C]
+                region_feature_i = torch.stack([x[m].mean(dim=0) for x, m in zip(region_feature_i.transpose(1,2), non_zero_pos_mask)]).nan_to_num()
+                all_region_features.append(region_feature_i)
+        
+        return all_region_features
+    
+
+    def prepare_inputs_labels_for_multimodal(self, input_ids, attention_mask, past_key_values, labels, images, region_masks):
+        if region_masks is not None:
+            region_flag = True
+        else:
+            region_flag = False
+        region_geo_sampler = region_flag and getattr(self.config, 'region_geo_sampler', False)
+
+        vision_tower = self.get_vision_tower()
+        if vision_tower is None or images is None or input_ids.shape[1] == 1:
+            #print('ferret_arch prepare_inputs_labels_for_multimodal 1')
+            if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
+                attention_mask = torch.ones((attention_mask.shape[0], past_key_values[-1][-1].shape[-2] + 1), dtype=attention_mask.dtype, device=attention_mask.device)
+            return input_ids, attention_mask, past_key_values, None, labels
+
+        if type(images) is list or images.ndim == 5:
+            ## multiple image
+            #print('ferret_arch prepare_inputs_labels_for_multimodal 2')
+            assert region_flag == False
+            concat_images = torch.cat([image for image in images], dim=0)
+            raw_image_features, image_features, region_feature_map = self.encode_images(concat_images, region_flag, region_geo_sampler)
+            #image_features = self.encode_images(concat_images)
+            split_sizes = [image.shape[0] for image in images]
+            image_features = torch.split(image_features, split_sizes, dim=0)
+            image_features = [x.flatten(0, 1) for x in image_features]
+            print(f'prepare_inputs_labels_for_multimodal 2 image_features: {image_features}')
+        else:
+            ## single image
+            #print('ferret_arch prepare_inputs_labels_for_multimodal 3')
+            raw_image_features, image_features, region_feature_map = self.encode_images(images, region_flag, region_geo_sampler)
+
+        if region_flag:
+            if region_geo_sampler:
+                # pdb.set_trace()
+                region_features = self.get_model().region_geo_sampler(region_feature_map, region_masks, 
+                                                                      original_dtype=raw_image_features.dtype,
+                                                                      return_dtype=image_features.dtype)
+            else:
+                region_features = self.extract_region_feature(region_feature_map, region_masks, 
+                                                              original_dtype=raw_image_features.dtype,
+                                                              return_dtype=image_features.dtype)
+            assert len(region_features) == len(input_ids)
+
+        new_input_embeds = []
+        new_labels = [] if labels is not None else None
+        cur_image_idx = 0
+        for batch_idx, cur_input_ids in enumerate(input_ids):
+            if (cur_input_ids == IMAGE_TOKEN_INDEX).sum() == 0:
+                # multimodal LLM, but the current sample is not multimodal
+                cur_input_embeds = self.get_model().embed_tokens(cur_input_ids)
+                cur_input_embeds = cur_input_embeds + (0. * self.get_model().mm_projector(vision_tower.dummy_feature)).sum()
+                new_input_embeds.append(cur_input_embeds)
+                if labels is not None:
+                    new_labels.append(labels[batch_idx])
+                cur_image_idx += 1
+                continue
+            image_token_indices = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0]
+            cur_new_input_embeds = []
+            if labels is not None:
+                cur_labels = labels[batch_idx]
+                cur_new_labels = []
+                assert cur_labels.shape == cur_input_ids.shape
+            while image_token_indices.numel() > 0:
+                cur_image_features = image_features[cur_image_idx]
+                image_token_start = image_token_indices[0]
+                if region_flag:
+                    assert (cur_input_ids[:image_token_start] == self.config.im_region_fea_token).sum() == 0
+                # If not use start-end token, pt ckpt saved only has mm projector.
+                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:image_token_start-1]).detach())
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[image_token_start-1:image_token_start]))
+                    cur_new_input_embeds.append(cur_image_features)
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[image_token_start+1:image_token_start+2]))
+                    if labels is not None:
+                        cur_new_labels.append(cur_labels[:image_token_start])
+                        cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
+                        cur_new_labels.append(cur_labels[image_token_start:image_token_start+1])
+                        cur_labels = cur_labels[image_token_start+2:]
+                else:
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:image_token_start]))
+                    cur_new_input_embeds.append(cur_image_features)
+                    if labels is not None:
+                        cur_new_labels.append(cur_labels[:image_token_start])
+                        cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
+                        cur_labels = cur_labels[image_token_start+1:]
+                cur_image_idx += 1
+                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+                    cur_input_ids = cur_input_ids[image_token_start+2:]
+                else:
+                    cur_input_ids = cur_input_ids[image_token_start+1:]
+                image_token_indices = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0]
+            if cur_input_ids.numel() > 0:
+                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+                    text_input_embeds = self.get_model().embed_tokens(cur_input_ids).detach()
+                else:
+                    text_input_embeds = self.get_model().embed_tokens(cur_input_ids)
+                if labels is not None:
+                    cur_new_labels.append(cur_labels)
+
+                # Add region feature into text feature embeddings.
+                assert batch_idx+1 == cur_image_idx
+                if region_flag and region_features[batch_idx] is not None:
+                    region_embs = torch.zeros_like(text_input_embeds)
+                    region_replace_mask = (cur_input_ids == self.config.im_region_fea_token)
+                    # pdb.set_trace()
+                    region_embs[region_replace_mask] = region_features[batch_idx].to(text_input_embeds.dtype)
+                    text_input_embeds = text_input_embeds * (~region_replace_mask).to(text_input_embeds.dtype)[:, None] + region_embs                    
+                    # print('region_embs[..., 0].nonzero()', region_embs[..., 0].nonzero())
+                    # raise NotImplementedError()
+                    # pdb.set_trace()
+                else:
+                    if hasattr(self.config, 'im_region_fea_token'):
+                        assert (cur_input_ids == self.config.im_region_fea_token).sum() == 0
+
+                cur_new_input_embeds.append(text_input_embeds)
+            cur_new_input_embeds = [x.to(device=self.device) for x in cur_new_input_embeds]
+            cur_new_input_embeds = torch.cat(cur_new_input_embeds, dim=0)
+            new_input_embeds.append(cur_new_input_embeds)
+            if labels is not None:
+                cur_new_labels = torch.cat(cur_new_labels, dim=0)
+                new_labels.append(cur_new_labels)
+
+        if any(x.shape != new_input_embeds[0].shape for x in new_input_embeds):
+            max_len = max(x.shape[0] for x in new_input_embeds)
+
+            new_input_embeds_align = []
+            for cur_new_embed in new_input_embeds:
+                cur_new_embed = torch.cat((cur_new_embed, torch.zeros((max_len - cur_new_embed.shape[0], cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0)
+                new_input_embeds_align.append(cur_new_embed)
+            new_input_embeds = torch.stack(new_input_embeds_align, dim=0)
+
+            if labels is not None:
+                new_labels_align = []
+                _new_labels = new_labels
+                for cur_new_label in new_labels:
+                    cur_new_label = torch.cat((cur_new_label, torch.full((max_len - cur_new_label.shape[0],), IGNORE_INDEX, dtype=cur_new_label.dtype, device=cur_new_label.device)), dim=0)
+                    new_labels_align.append(cur_new_label)
+                new_labels = torch.stack(new_labels_align, dim=0)
+
+            if attention_mask is not None:
+                new_attention_mask = []
+                for cur_attention_mask, cur_new_labels, cur_new_labels_align in zip(attention_mask, _new_labels, new_labels):
+                    new_attn_mask_pad_left = torch.full((cur_new_labels.shape[0] - labels.shape[1],), True, dtype=attention_mask.dtype, device=attention_mask.device)
+                    new_attn_mask_pad_right = torch.full((cur_new_labels_align.shape[0] - cur_new_labels.shape[0],), False, dtype=attention_mask.dtype, device=attention_mask.device)
+                    cur_new_attention_mask = torch.cat((new_attn_mask_pad_left, cur_attention_mask, new_attn_mask_pad_right), dim=0)
+                    new_attention_mask.append(cur_new_attention_mask)
+                attention_mask = torch.stack(new_attention_mask, dim=0)
+                assert attention_mask.shape == new_labels.shape
+        else:
+            new_input_embeds = torch.stack(new_input_embeds, dim=0)
+            if labels is not None:
+                new_labels  = torch.stack(new_labels, dim=0)
+
+            if attention_mask is not None:
+                new_attn_mask_pad_left = torch.full((attention_mask.shape[0], new_input_embeds.shape[1] - input_ids.shape[1]), True, dtype=attention_mask.dtype, device=attention_mask.device)
+                attention_mask = torch.cat((new_attn_mask_pad_left, attention_mask), dim=1)
+                assert attention_mask.shape == new_input_embeds.shape[:2]
+
+        return None, attention_mask, past_key_values, new_input_embeds, new_labels
+
+    def initialize_vision_tokenizer(self, model_args, tokenizer, add_region_feature=False):
+        if model_args.mm_use_im_patch_token:
+            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+
+        if add_region_feature:
+            num_region_fea_tokens = tokenizer.add_tokens([DEFAULT_REGION_FEA_TOKEN], special_tokens=True)
+            self.config.im_region_fea_token = tokenizer.convert_tokens_to_ids([DEFAULT_REGION_FEA_TOKEN])[0]
+            self.resize_token_embeddings(len(tokenizer))
+
+        if model_args.mm_use_im_start_end:
+            num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+
+            if add_region_feature:
+                num_new_tokens = num_new_tokens + num_region_fea_tokens
+
+            if num_new_tokens > 0:
+                input_embeddings = self.get_input_embeddings().weight.data
+                output_embeddings = self.get_output_embeddings().weight.data
+
+                input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
+                    dim=0, keepdim=True)
+                output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
+                    dim=0, keepdim=True)
+
+                input_embeddings[-num_new_tokens:] = input_embeddings_avg
+                output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+            if model_args.tune_mm_mlp_adapter:
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = True
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
+
+            if model_args.pretrain_mm_mlp_adapter:
+                mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
+                embed_tokens_weight = mm_projector_weights['model.embed_tokens.weight']
+                if add_region_feature:
+                    num_new_tokens = num_new_tokens - num_region_fea_tokens
+                assert num_new_tokens == 2
+                if input_embeddings.shape == embed_tokens_weight.shape:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
+                elif embed_tokens_weight.shape[0] == num_new_tokens:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight
+                else:
+                    raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
+        elif model_args.mm_use_im_patch_token:
+            if model_args.tune_mm_mlp_adapter:
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = False
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
diff --git a/openeqa/baselines/ferret/model/language_model/ferret_llama.py b/openeqa/baselines/ferret/model/language_model/ferret_llama.py
new file mode 100644
index 0000000..02ae019
--- /dev/null
+++ b/openeqa/baselines/ferret/model/language_model/ferret_llama.py
@@ -0,0 +1,139 @@
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         LlamaConfig, LlamaModel, LlamaForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+
+from ..ferret_arch import FERRETMetaModel, FERRETMetaForCausalLM
+
+
+class FERRETConfig(LlamaConfig):
+    model_type = "ferret"
+
+
+class FERRETLlamaModel(FERRETMetaModel, LlamaModel):
+    config_class = FERRETConfig
+
+    def __init__(self, config: LlamaConfig):
+        super(FERRETLlamaModel, self).__init__(config)
+
+
+class FERRETLlamaForCausalLM(LlamaForCausalLM, FERRETMetaForCausalLM):
+    config_class = FERRETConfig
+
+    def __init__(self, config):
+        super(LlamaForCausalLM, self).__init__(config)
+        self.model = FERRETLlamaModel(config)
+
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        region_masks: Optional[List[torch.Tensor]] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        #print(f'ferret_llama images: {images}, shape: {images.shape}')
+        input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, region_masks=region_masks)
+        #print(f'attention_mask: {attention_mask}')
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model/pipeline parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+                "images": kwargs.get("images", None),
+            }
+        )
+        return model_inputs
+
+AutoConfig.register("ferret", FERRETConfig)
+AutoModelForCausalLM.register(FERRETConfig, FERRETLlamaForCausalLM)
diff --git a/openeqa/baselines/ferret/model/make_delta.py b/openeqa/baselines/ferret/model/make_delta.py
new file mode 100644
index 0000000..f7781e9
--- /dev/null
+++ b/openeqa/baselines/ferret/model/make_delta.py
@@ -0,0 +1,74 @@
+"""
+Usage:
+# 7B
+python3 -m ferret.model.make_delta \
+    --base ./model/vicuna-7b-v1-3 \
+    --target ./checkpoints/ferret_ft_clipL336_vicunaV1-3-7b_3Ep_dataV16_RSamplerV2/checkpoint-final \
+    --delta ./checkpoints/ferret_ft_clipL336_vicunaV1-3-7b_3Ep_dataV16_RSamplerV2/ferret-7b-delta
+
+# 13B
+python3 -m ferret.model.make_delta \
+    --base ./model/vicuna-13b-v1-3 \
+    --target ./checkpoints/ferret_ft_clipL336_vicunaV1-3-13b_3Ep_dataV16_RSamplerV2/checkpoint-final \
+    --delta ./checkpoints/ferret_ft_clipL336_vicunaV1-3-13b_3Ep_dataV16_RSamplerV2/ferret-13b-delta
+"""
+import argparse
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from ferret.model.utils import auto_upgrade
+
+# all the parameters inside the geosampler and mm projector
+exclude_name_lists = ['model.mm_projector.weight', 'model.mm_projector.bias', 
+                    'model.region_geo_sampler.agg_projector_list.0.net.0.bias', 'model.region_geo_sampler.agg_projector_list.0.net.0.weight', 
+                    'model.region_geo_sampler.agg_projector_list.0.norm.bias', 'model.region_geo_sampler.agg_projector_list.0.norm.weight', 
+                    'model.region_geo_sampler.agg_projector_list.1.net.0.bias', 'model.region_geo_sampler.agg_projector_list.1.net.0.weight', 
+                    'model.region_geo_sampler.agg_projector_list.1.norm.bias', 'model.region_geo_sampler.agg_projector_list.1.norm.weight', 
+                    'model.region_geo_sampler.diff_projector_list.0.bias', 'model.region_geo_sampler.diff_projector_list.0.weight', 
+                    'model.region_geo_sampler.diff_projector_list.1.bias', 'model.region_geo_sampler.diff_projector_list.1.weight', 
+                    'model.region_geo_sampler.dim_projector.bias', 'model.region_geo_sampler.dim_projector.weight', 
+                    'model.region_geo_sampler.flatten_projector.bias', 'model.region_geo_sampler.flatten_projector.weight'
+                    ]
+
+
+def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
+    print("Loading base model")
+    base = AutoModelForCausalLM.from_pretrained(
+        base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Loading target model")
+    auto_upgrade(target_model_path)
+    target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Calculating delta")
+    for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
+        if name not in base.state_dict():
+            assert name in exclude_name_lists, f'{name} not in base model'
+            continue
+        if param.data.shape == base.state_dict()[name].shape:
+            param.data -= base.state_dict()[name]
+        else:
+            assert name in ['model.embed_tokens.weight', 'lm_head.weight'], f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
+            bparam = base.state_dict()[name]
+            param.data[:bparam.shape[0], :bparam.shape[1]] -= bparam
+
+    print("Saving delta")
+    if hub_repo_id:
+        kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
+    else:
+        kwargs = {}
+    target.save_pretrained(delta_path, **kwargs)
+    target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
+    target_tokenizer.save_pretrained(delta_path, **kwargs)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+    parser.add_argument("--hub-repo-id", type=str, default=None)
+    args = parser.parse_args()
+
+    make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
diff --git a/openeqa/baselines/ferret/model/multimodal_encoder/builder.py b/openeqa/baselines/ferret/model/multimodal_encoder/builder.py
new file mode 100644
index 0000000..2b13589
--- /dev/null
+++ b/openeqa/baselines/ferret/model/multimodal_encoder/builder.py
@@ -0,0 +1,11 @@
+import os
+from .clip_encoder import CLIPVisionTower
+
+
+def build_vision_tower(vision_tower_cfg, **kwargs):
+    vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
+    is_absolute_path_exists = os.path.exists(vision_tower)
+    if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion"):
+        return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
+
+    raise ValueError(f'Unknown vision tower: {vision_tower}')
diff --git a/openeqa/baselines/ferret/model/multimodal_encoder/clip_encoder.py b/openeqa/baselines/ferret/model/multimodal_encoder/clip_encoder.py
new file mode 100644
index 0000000..8f92567
--- /dev/null
+++ b/openeqa/baselines/ferret/model/multimodal_encoder/clip_encoder.py
@@ -0,0 +1,126 @@
+import torch
+import torch.nn as nn
+
+from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
+# Added for customized Processor.
+import math
+import numpy as np
+from typing import Dict
+from transformers.image_utils import PILImageResampling, ChannelDimension
+from transformers.image_processing_utils import get_size_dict
+from transformers.image_transforms import (
+    get_resize_output_image_size,
+    resize,
+)
+from typing import List, Optional, Tuple, Union
+
+class CLIPImageProcessor_GIT(CLIPImageProcessor):
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=True, height_width_order=True)
+        # Hack(haoxuan): Bypass the shortest_edge detection. We hope to get a {"height": size[0], "width": size[1]}, where w=h.
+        # if "shortest_edge" not in size:
+        #     raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        # output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=True)
+        output_size = get_resize_output_image_size(image, size=(size["height"], size["width"]), default_to_square=True)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+    
+
+class CLIPVisionTower(nn.Module):
+    def __init__(self, vision_tower, args, delay_load=False):
+        super().__init__()
+
+        self.is_loaded = False
+
+        self.vision_tower_name = vision_tower
+        self.select_layer = args.mm_vision_select_layer
+        self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
+
+        if not delay_load:
+            self.load_model()
+        else:
+            self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
+
+    def load_model(self, vision_tower_path=None):
+        self.image_processor = CLIPImageProcessor_GIT.from_pretrained(self.vision_tower_name)
+        if vision_tower_path is not None:
+            self.vision_tower, loading_info = CLIPVisionModel.from_pretrained(vision_tower_path, output_loading_info=True)
+            print('loading_info:', loading_info)
+        else:
+            print(f'clip_encoder load_model vision_tower_path is None')
+            self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
+        self.vision_tower.requires_grad_(False)
+
+        self.is_loaded = True
+
+    def feature_select(self, image_forward_outs):
+        image_features = image_forward_outs.hidden_states[self.select_layer]
+        if self.select_feature == 'patch':
+            image_features = image_features[:, 1:]
+        elif self.select_feature == 'cls_patch':
+            image_features = image_features
+        else:
+            raise ValueError(f'Unexpected select feature: {self.select_feature}')
+        return image_features
+
+    @torch.no_grad()
+    def forward(self, images):
+        print('clip encoder forward')
+        if type(images) is list:
+            print('clip encoder forward..... list')
+            image_features = []
+            for image in images:
+                image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
+                image_feature = self.feature_select(image_forward_out).to(image.dtype)
+                image_features.append(image_feature)
+        else:   # tensor일때
+            print('clip encoder forward..... tensor')
+            image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
+            image_features = self.feature_select(image_forward_outs).to(images.dtype)
+        return image_features
+
+    @property
+    def dummy_feature(self):
+        return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
+
+    @property
+    def dtype(self):
+        return self.vision_tower.dtype
+
+    @property
+    def device(self):
+        return self.vision_tower.device
+
+    @property
+    def config(self):
+        if self.is_loaded:
+            return self.vision_tower.config
+        else:
+            return self.cfg_only
+
+    @property
+    def hidden_size(self):
+        return self.config.hidden_size
+
+    @property
+    def num_patches(self):
+        return (self.config.image_size // self.config.patch_size) ** 2
diff --git a/openeqa/baselines/ferret/model/utils.py b/openeqa/baselines/ferret/model/utils.py
new file mode 100644
index 0000000..bbdf3b2
--- /dev/null
+++ b/openeqa/baselines/ferret/model/utils.py
@@ -0,0 +1,20 @@
+from transformers import AutoConfig
+
+
+def auto_upgrade(config):
+    cfg = AutoConfig.from_pretrained(config)
+    if 'llava' in config and 'llava' not in cfg.model_type:
+        assert cfg.model_type == 'llama'
+        print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
+        print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
+        confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
+        if confirm.lower() in ["y", "yes"]:
+            print("Upgrading checkpoint...")
+            assert len(cfg.architectures) == 1
+            setattr(cfg.__class__, "model_type", "llava")
+            cfg.architectures[0] = 'FERRETLlamaForCausalLM'
+            cfg.save_pretrained(config)
+            print("Checkpoint upgraded.")
+        else:
+            print("Checkpoint upgrade aborted.")
+            exit(1)
diff --git a/openeqa/baselines/ferret/utils.py b/openeqa/baselines/ferret/utils.py
new file mode 100644
index 0000000..baaaa06
--- /dev/null
+++ b/openeqa/baselines/ferret/utils.py
@@ -0,0 +1,126 @@
+import datetime
+import logging
+import logging.handlers
+import os
+import sys
+
+import requests
+
+from .constants import LOGDIR
+
+server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
+moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
+
+handler = None
+
+
+def build_logger(logger_name, logger_filename):
+    global handler
+
+    formatter = logging.Formatter(
+        fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+
+    # Set the format of root handlers
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
+    logging.getLogger().handlers[0].setFormatter(formatter)
+
+    # Redirect stdout and stderr to loggers
+    stdout_logger = logging.getLogger("stdout")
+    stdout_logger.setLevel(logging.INFO)
+    sl = StreamToLogger(stdout_logger, logging.INFO)
+    sys.stdout = sl
+
+    stderr_logger = logging.getLogger("stderr")
+    stderr_logger.setLevel(logging.ERROR)
+    sl = StreamToLogger(stderr_logger, logging.ERROR)
+    sys.stderr = sl
+
+    # Get logger
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.INFO)
+
+    # Add a file handler for all loggers
+    if handler is None:
+        os.makedirs(LOGDIR, exist_ok=True)
+        filename = os.path.join(LOGDIR, logger_filename)
+        handler = logging.handlers.TimedRotatingFileHandler(
+            filename, when='D', utc=True)
+        handler.setFormatter(formatter)
+
+        for name, item in logging.root.manager.loggerDict.items():
+            if isinstance(item, logging.Logger):
+                item.addHandler(handler)
+
+    return logger
+
+
+class StreamToLogger(object):
+    """
+    Fake file-like stream object that redirects writes to a logger instance.
+    """
+    def __init__(self, logger, log_level=logging.INFO):
+        self.terminal = sys.stdout
+        self.logger = logger
+        self.log_level = log_level
+        self.linebuf = ''
+
+    def __getattr__(self, attr):
+        return getattr(self.terminal, attr)
+
+    def write(self, buf):
+        temp_linebuf = self.linebuf + buf
+        self.linebuf = ''
+        for line in temp_linebuf.splitlines(True):
+            # From the io.TextIOWrapper docs:
+            #   On output, if newline is None, any '\n' characters written
+            #   are translated to the system default line separator.
+            # By default sys.stdout.write() expects '\n' newlines and then
+            # translates them so this is still cross platform.
+            if line[-1] == '\n':
+                self.logger.log(self.log_level, line.rstrip())
+            else:
+                self.linebuf += line
+
+    def flush(self):
+        if self.linebuf != '':
+            self.logger.log(self.log_level, self.linebuf.rstrip())
+        self.linebuf = ''
+
+
+def disable_torch_init():
+    """
+    Disable the redundant torch default initialization to accelerate model creation.
+    """
+    import torch
+    setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
+    setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
+
+
+def violates_moderation(text):
+    """
+    Check whether the text violates OpenAI moderation API.
+    """
+    url = "https://api.openai.com/v1/moderations"
+    headers = {"Content-Type": "application/json",
+               "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
+    text = text.replace("\n", "")
+    data = "{" + '"input": ' + f'"{text}"' + "}"
+    data = data.encode("utf-8")
+    try:
+        ret = requests.post(url, headers=headers, data=data, timeout=5)
+        flagged = ret.json()["results"][0]["flagged"]
+    except requests.exceptions.RequestException as e:
+        flagged = False
+    except KeyError as e:
+        flagged = False
+
+    return flagged
+
+
+def pretty_print_semaphore(semaphore):
+    if semaphore is None:
+        return "None"
+    return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
diff --git a/openeqa/baselines/llama_rag.py b/openeqa/baselines/llama_rag.py
new file mode 100644
index 0000000..479a2f6
--- /dev/null
+++ b/openeqa/baselines/llama_rag.py
@@ -0,0 +1,253 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+import argparse
+import sys
+import os
+import json
+from pathlib import Path
+from typing import Optional
+import time, datetime
+
+import pickle
+import numpy as np
+from numpy.linalg import norm
+import tqdm
+
+sys.path.insert(0, './')
+sys.path.insert(0, './openeqa')
+from sentence_transformers import SentenceTransformer
+from openeqa.utils.llama_utils import LLaMARunner, enable_full_determinism
+from openeqa.utils.prompt_utils import load_prompt
+
+log = logging.getLogger(__name__)
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--dataset",
+        type=Path,
+        default="data/open-eqa-v0.json",
+        help="path to EQA dataset (default: data/open-eqa-v0.json)",
+    )
+    parser.add_argument(
+        "--source",
+        type=str,
+        required=True,
+        help="scannet or hm3d",
+    )
+    parser.add_argument(
+        "-m",
+        "--model-path",
+        type=Path,
+        required=True,
+        help="path to weights in huggingface format",
+    )
+    parser.add_argument(
+        "--model-name",
+        type=str,
+        help="model name (defaults to model path folder name)",
+    )
+    parser.add_argument(
+        "--load-in-8bit",
+        action="store_true",
+        help="load model in 8bit mode (default: false)",
+    )
+    parser.add_argument(
+        "--use-fast-kernels",
+        action="store_true",
+        help="use fast kernels (default: false)",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=1234,
+        help="gpt seed (default: 1234)",
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=0.2,
+        help="gpt temperature (default: 0.2)",
+    )
+    parser.add_argument(
+        "--max-tokens",
+        type=int,
+        default=7000,
+        help="gpt maximum tokens (default: 128)",
+    )
+    parser.add_argument(
+        "--output-directory",
+        type=Path,
+        default="data/results",
+        help="output directory (default: data/results)",
+    )
+    parser.add_argument(
+        "--frames-directory",
+        type=Path,
+        default="data/frames/",
+        help="path image frames (default: data/frames/)",
+    )
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="continue running on API errors (default: false)",
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="only process the first 5 questions",
+    )
+    parser.add_argument(
+        "--ic-example-num",
+        type=int,
+        default=3,
+        help="using rag in-context example number",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--captioning-model",
+        type=str,
+        required=True,
+    )
+    args = parser.parse_args()
+    enable_full_determinism(args.seed)
+    if args.model_name is None:
+        args.model_name = args.model_path.name.lower()
+    args.output_directory.mkdir(parents=True, exist_ok=True)
+    args.output_path = args.output_directory / (
+        args.dataset.stem + "-{}-{}-{}-rag.json".format(args.model_name, args.source, args.prompt)
+    )
+    return args
+
+def parse_output(output: str) -> str:
+    #start_idx = output.find("A:")
+    end_idx = output.find("Q")
+    print(f'end_idx: {end_idx}')
+    # if end_idx == -1:
+    #     return output[start_idx:].replace("A:", "").strip()
+    answer_text = output[:end_idx].strip()
+    print(f'answer_text: {answer_text}')
+    return answer_text
+
+def ask_question(args, model, question: str, ic_ex_prompt: list, 
+                        max_tokens: int = 200, temperature: float = 0.2) -> Optional[str]:
+    prompt = load_prompt(args.prompt)
+
+    input = prompt.format(question=question, img_1=ic_ex_prompt[0], img_2=ic_ex_prompt[1], img_3=ic_ex_prompt[2])  #top-3
+
+    output = model(input, max_new_tokens=max_tokens, temperature=temperature)
+
+    return parse_output(output)
+
+def cosine_similarity(embedding1, embedding2):
+    """Calculate cosine similarity between two embeddings."""
+    return np.dot(embedding1, embedding2) / (norm(embedding1) * norm(embedding2))
+
+def retrieval(paths, question, ic_example_num):
+    print('retrieval')
+    embedding_model='all-MiniLM-L6-v2'
+    sbert = SentenceTransformer(embedding_model)
+
+    em_question = sbert.encode(question)
+
+    ic_ex_encode_list = []
+
+    for text_traj in paths:
+        with open(text_traj, 'rb') as file:
+            ic_ex_encoding = pickle.load(file)
+
+        similarity = cosine_similarity(ic_ex_encoding['embedding'], em_question)
+        ic_ex_encoding['similarity'] = similarity
+        ic_ex_encode_list.append(ic_ex_encoding)
+
+    sorted_ic_ex_encode_list = sorted(ic_ex_encode_list, key=lambda x: x['similarity'], reverse=True)
+
+    ic_ex_files = []
+    for idx, encoding in enumerate(sorted_ic_ex_encode_list):
+        if idx < ic_example_num:
+            ic_ex_files.append(encoding['text_traj_path'])
+        else:
+            pass
+
+    return ic_ex_files
+
+def read_txt_file(file_path):
+    with open(file_path) as file:
+        text_description = file.read()
+    return text_description
+
+def main(args: argparse.Namespace):
+    # load dataset
+    dataset = json.load(args.dataset.open("r"))
+    print("found {:,} questions".format(len(dataset)))
+
+    # load model
+    model = LLaMARunner(
+        args.model_path,
+        load_in_8bit=args.load_in_8bit,
+        use_fast_kernels=args.use_fast_kernels,
+    )
+
+    # load results
+    results = []
+    if args.output_path.exists():
+        results = json.load(args.output_path.open())
+        print("found {:,} existing results".format(len(results)))
+    completed = [item["question_id"] for item in results]
+    dataset_name = [item["episode_history"] for item in dataset if args.source in item["episode_history"]]
+
+    start = time.time()
+
+    # process data
+    for idx, item in enumerate(tqdm.tqdm(dataset)):
+        if args.dry_run and idx >= 5:
+            break
+
+        # skip completed questions
+        question_id = item["question_id"]
+        if question_id in completed:
+            continue  # skip existing
+        
+        # Use this for experiments that require dataset splitting. For experiments on the full dataset, remove the if-statement.
+        if 'hm3d' in item["episode_history"]:
+            pass
+        elif args.source in item["episode_history"]:
+            folder = args.frames_directory / item["episode_history"]
+            if 'llava' in args.captioning_model:
+                frames = sorted(folder.glob("*-llava.pkl"))
+            elif 'qwen' in args.captioning_model:
+                frames = sorted(folder.glob("*-qwen.pkl"))
+            else:
+                frames = sorted(folder.glob("*-rgb.pkl"))
+            paths = [str(frames[i]) for i in range(len(frames))]
+
+            # generate answer
+            question = item["question"]
+            ic_ex_files = retrieval(paths, question, args.ic_example_num)
+            
+            ic_examples = [read_txt_file(ic_ex_file) for ic_ex_file in ic_ex_files]
+            answer = ask_question(args, model=model, question=question, ic_ex_prompt=ic_examples)
+
+            # store results
+            results.append({"question_id": question_id, "category": item['category'], "question": question, "answer": answer, "GT answer": item["answer"], "ic_ex_files": ic_ex_files, "ic_examples": ic_examples, "time":str(datetime.timedelta(seconds=(time.time() - start)))})
+            json.dump(results, args.output_path.open("w"), indent=2)
+
+            print(f'{idx+1}/{len(dataset_name)}')
+        else:
+            break
+
+    # save at end (redundant)
+    json.dump(results, args.output_path.open("w"), indent=2)
+    print("saving {:,} answers".format(len(results)))
+
+if __name__ == "__main__":
+    main(parse_args())
diff --git a/openeqa/baselines/llama_uniform_sampling.py b/openeqa/baselines/llama_uniform_sampling.py
new file mode 100644
index 0000000..83e6ecb
--- /dev/null
+++ b/openeqa/baselines/llama_uniform_sampling.py
@@ -0,0 +1,241 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+import argparse
+import sys
+import os
+import json
+from pathlib import Path
+from typing import Optional
+import time, datetime
+
+import pickle
+import numpy as np
+from numpy.linalg import norm
+import tqdm
+
+sys.path.insert(0, './')
+sys.path.insert(0, './openeqa')
+from sentence_transformers import SentenceTransformer
+from openeqa.utils.llama_utils import LLaMARunner, enable_full_determinism
+from openeqa.utils.prompt_utils import load_prompt
+
+log = logging.getLogger(__name__)
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--dataset",
+        type=Path,
+        default="data/open-eqa-v0.json",
+        help="path to EQA dataset (default: data/open-eqa-v0.json)",
+    )
+    parser.add_argument(
+        "--source",
+        type=str,
+        required=True,
+        help="scannet or hm3d",
+    )
+    parser.add_argument(
+        "-m",
+        "--model-path",
+        type=Path,
+        required=True,
+        help="path to weights in huggingface format",
+    )
+    parser.add_argument(
+        "--model-name",
+        type=str,
+        help="model name (defaults to model path folder name)",
+    )
+    parser.add_argument(
+        "--visualization",
+        action="store_true",
+        help="embedding vector visualization",
+    )
+    parser.add_argument(
+        "--load-in-8bit",
+        action="store_true",
+        help="load model in 8bit mode (default: false)",
+    )
+    parser.add_argument(
+        "--use-fast-kernels",
+        action="store_true",
+        help="use fast kernels (default: false)",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=1234,
+        help="gpt seed (default: 1234)",
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=0.2,
+        help="gpt temperature (default: 0.2)",
+    )
+    parser.add_argument(
+        "--max-tokens",
+        type=int,
+        default=7000,
+        help="gpt maximum tokens (default: 128)",
+    )
+    parser.add_argument(
+        "--output-directory",
+        type=Path,
+        default="data/results",
+        help="output directory (default: data/results)",
+    )
+    parser.add_argument(
+        "--frames-directory",
+        type=Path,
+        default="data/frames/",
+        help="path image frames (default: data/frames/)",
+    )
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="continue running on API errors (default: false)",
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="only process the first 5 questions",
+    )
+    parser.add_argument(
+        "--ic-example-num",
+        type=int,
+        default=10,
+        help="using rag in-context example number",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--captioning-model",
+        type=str,
+        required=True,
+    )
+    args = parser.parse_args()
+    enable_full_determinism(args.seed)
+    if args.model_name is None:
+        args.model_name = args.model_path.name.lower()
+    args.output_directory.mkdir(parents=True, exist_ok=True)
+    args.output_path = args.output_directory / (
+        args.dataset.stem + "-{}-{}-{}-sampling.json".format(args.model_name, args.source, args.prompt)
+    )
+    return args
+
+def parse_output(output: str) -> str:
+    #start_idx = output.find("A:")
+    end_idx = output.find("Q")
+    print(f'end_idx: {end_idx}')
+    # if end_idx == -1:
+    #     return output[start_idx:].replace("A:", "").strip()
+    answer_text = output[:end_idx].strip()
+    return answer_text
+
+def ask_question(args, model, question: str, ic_ex_prompt: list, 
+                max_tokens: int = 200, temperature: float = 0.2) -> Optional[str]:
+    prompt = load_prompt(args.prompt)
+    
+    format_dict = {'question': question}
+    
+    for i, img in enumerate(ic_ex_prompt):
+        format_dict[f'img_{i+1}'] = img
+    
+    input = prompt.format(**format_dict)
+    output = model(input, max_new_tokens=max_tokens, temperature=temperature)
+
+    return parse_output(output)
+
+def read_txt_file(file_path):
+    with open(file_path) as file:
+        text_description = file.read()
+    return text_description
+
+def main(args: argparse.Namespace):
+    # load dataset
+    dataset = json.load(args.dataset.open("r"))
+    print("found {:,} questions".format(len(dataset)))
+
+    # load model
+    model = LLaMARunner(
+        args.model_path,
+        load_in_8bit=args.load_in_8bit,
+        use_fast_kernels=args.use_fast_kernels,
+    )
+
+    # load results
+    results = []
+    
+    token_count_sum = 0
+    if args.output_path.exists():
+        results = json.load(args.output_path.open())
+        print("found {:,} existing results".format(len(results)))
+    completed = [item["question_id"] for item in results]
+    dataset_name = [item["episode_history"] for item in dataset if args.source in item["episode_history"]]
+    print(f'len {args.source}: {len(dataset_name)}')
+
+    start = time.time()
+
+    # process data
+    for idx, item in enumerate(tqdm.tqdm(dataset)):
+        ic_ex_files = []
+        
+        if args.dry_run and idx >= 5:
+            break
+
+        # skip completed questions
+        question_id = item["question_id"]
+        if question_id in completed:
+            continue  # skip existing
+        
+        # Use this for experiments that require dataset splitting. For experiments on the full dataset, remove the if-statement.
+        if 'hm3d' in item["episode_history"]:
+            pass
+        elif args.source in item["episode_history"]:
+            # extract scene paths
+            folder = args.frames_directory / item["episode_history"]
+            if 'llava' in args.captioning_model:
+                frames = sorted(folder.glob("*-llava.pkl"))
+            elif 'qwen' in args.captioning_model:
+                frames = sorted(folder.glob("*-qwen.pkl"))
+            else:
+                frames = sorted(folder.glob("*-rgb.pkl"))
+            indices = np.round(np.linspace(0, len(frames) - 1, args.ic_example_num)).astype(int)
+            paths = [str(frames[i]) for i in indices]
+
+            for text_traj in paths:
+                with open(text_traj, 'rb') as file:
+                    ic_ex_encoding = pickle.load(file)
+                ic_ex_files.append(ic_ex_encoding['text_traj_path'])
+
+            ic_examples = [read_txt_file(ic_ex_file) for ic_ex_file in ic_ex_files]
+
+            # generate answer
+            question = item["question"]
+            answer = ask_question(args, model=model, question=question, ic_ex_prompt=ic_examples)
+
+            # store results
+            results.append({"question_id": question_id, "category": item['category'], "question": question, "answer": answer, "GT answer": item["answer"], "ic_ex_files": ic_ex_files, "ic_examples": ic_examples, "time":str(datetime.timedelta(seconds=(time.time() - start)))})
+            json.dump(results, args.output_path.open("w"), indent=2)
+
+            print(f'{idx+1}/{len(dataset_name)}')
+        else:
+            break
+
+    # save at end (redundant)
+    json.dump(results, args.output_path.open("w"), indent=2)
+    print("saving {:,} answers".format(len(results)))
+
+
+if __name__ == "__main__":
+    main(parse_args())
diff --git a/openeqa/baselines/llava/__init__.py b/openeqa/baselines/llava/__init__.py
new file mode 100644
index 0000000..4d1f016
--- /dev/null
+++ b/openeqa/baselines/llava/__init__.py
@@ -0,0 +1 @@
+from .model import LlavaLlamaForCausalLM
diff --git a/openeqa/baselines/llava/constants.py b/openeqa/baselines/llava/constants.py
new file mode 100644
index 0000000..374be09
--- /dev/null
+++ b/openeqa/baselines/llava/constants.py
@@ -0,0 +1,13 @@
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+WORKER_HEART_BEAT_INTERVAL = 15
+
+LOGDIR = "."
+
+# Model Constants
+IGNORE_INDEX = -100
+IMAGE_TOKEN_INDEX = -200
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
+DEFAULT_IM_START_TOKEN = "<im_start>"
+DEFAULT_IM_END_TOKEN = "<im_end>"
+IMAGE_PLACEHOLDER = "<image-placeholder>"
diff --git a/openeqa/baselines/llava/conversation.py b/openeqa/baselines/llava/conversation.py
new file mode 100644
index 0000000..00c5686
--- /dev/null
+++ b/openeqa/baselines/llava/conversation.py
@@ -0,0 +1,396 @@
+import dataclasses
+from enum import auto, Enum
+from typing import List, Tuple
+import base64
+from io import BytesIO
+from PIL import Image
+
+
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+    TWO = auto()
+    MPT = auto()
+    PLAIN = auto()
+    LLAMA_2 = auto()
+
+
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+
+    skip_next: bool = False
+
+    def get_prompt(self):
+        messages = self.messages
+        if len(messages) > 0 and type(messages[0][1]) is tuple:
+            messages = self.messages.copy()
+            init_role, init_msg = messages[0].copy()
+            init_msg = init_msg[0].replace("<image>", "").strip()
+            if 'mmtag' in self.version:
+                messages[0] = (init_role, init_msg)
+                messages.insert(0, (self.roles[0], "<Image><image></Image>"))
+                messages.insert(1, (self.roles[1], "Received."))
+            else:
+                messages[0] = (init_role, "<image>\n" + init_msg)
+
+        if self.sep_style == SeparatorStyle.SINGLE:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + self.sep
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.TWO:
+            seps = [self.sep, self.sep2]
+            ret = self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.MPT:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + message + self.sep
+                else:
+                    ret += role
+        elif self.sep_style == SeparatorStyle.LLAMA_2:
+            wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n" if len(msg) > 0 else msg
+            wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
+            ret = ""
+
+            for i, (role, message) in enumerate(messages):
+                if i == 0:
+                    assert message, "first message should not be none"
+                    assert role == self.roles[0], "first message should come from user"
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    if i == 0: message = wrap_sys(self.system) + message
+                    if i % 2 == 0:
+                        message = wrap_inst(message)
+                        ret += self.sep + message
+                    else:
+                        ret += " " + message + " " + self.sep2
+                else:
+                    ret += ""
+            ret = ret.lstrip(self.sep)
+        elif self.sep_style == SeparatorStyle.PLAIN:
+            seps = [self.sep, self.sep2]
+            ret = self.system
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += message + seps[i % 2]
+                else:
+                    ret += ""
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+        return ret
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def process_image(self, image, image_process_mode, return_pil=False, image_format='PNG', max_len=1344, min_len=672):
+        if image_process_mode == "Pad":
+            def expand2square(pil_img, background_color=(122, 116, 104)):
+                width, height = pil_img.size
+                if width == height:
+                    return pil_img
+                elif width > height:
+                    result = Image.new(pil_img.mode, (width, width), background_color)
+                    result.paste(pil_img, (0, (width - height) // 2))
+                    return result
+                else:
+                    result = Image.new(pil_img.mode, (height, height), background_color)
+                    result.paste(pil_img, ((height - width) // 2, 0))
+                    return result
+            image = expand2square(image)
+        elif image_process_mode in ["Default", "Crop"]:
+            pass
+        elif image_process_mode == "Resize":
+            image = image.resize((336, 336))
+        else:
+            raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
+        if max(image.size) > max_len:
+            max_hw, min_hw = max(image.size), min(image.size)
+            aspect_ratio = max_hw / min_hw
+            shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+            longest_edge = int(shortest_edge * aspect_ratio)
+            W, H = image.size
+            if H > W:
+                H, W = longest_edge, shortest_edge
+            else:
+                H, W = shortest_edge, longest_edge
+            image = image.resize((W, H))
+        if return_pil:
+            return image
+        else:
+            buffered = BytesIO()
+            image.save(buffered, format=image_format)
+            img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+            return img_b64_str
+
+    def get_images(self, return_pil=False):
+        images = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    msg, image, image_process_mode = msg
+                    image = self.process_image(image, image_process_mode, return_pil=return_pil)
+                    images.append(image)
+        return images
+
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    msg, image, image_process_mode = msg
+                    img_b64_str = self.process_image(
+                        image, "Default", return_pil=False,
+                        image_format='JPEG')
+                    img_str = f'<img src="data:image/jpeg;base64,{img_b64_str}" alt="user upload image" />'
+                    msg = img_str + msg.replace('<image>', '').strip()
+                    ret.append([msg, None])
+                else:
+                    ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        return ret
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            version=self.version)
+
+    def dict(self):
+        if len(self.get_images()) > 0:
+            return {
+                "system": self.system,
+                "roles": self.roles,
+                "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
+                "offset": self.offset,
+                "sep": self.sep,
+                "sep2": self.sep2,
+            }
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+
+
+conv_vicuna_v0 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("Human", "Assistant"),
+    messages=(
+        ("Human", "What are the key differences between renewable and non-renewable energy sources?"),
+        ("Assistant",
+            "Renewable energy sources are those that can be replenished naturally in a relatively "
+            "short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
+            "Non-renewable energy sources, on the other hand, are finite and will eventually be "
+            "depleted, such as coal, oil, and natural gas. Here are some key differences between "
+            "renewable and non-renewable energy sources:\n"
+            "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
+            "energy sources are finite and will eventually run out.\n"
+            "2. Environmental impact: Renewable energy sources have a much lower environmental impact "
+            "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
+            "and other negative effects.\n"
+            "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
+            "have lower operational costs than non-renewable sources.\n"
+            "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
+            "locations than non-renewable sources.\n"
+            "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
+            "situations and needs, while non-renewable sources are more rigid and inflexible.\n"
+            "6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
+            "non-renewable sources are not, and their depletion can lead to economic and social instability.\n")
+    ),
+    offset=2,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+)
+
+conv_vicuna_v1 = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+conv_llama_2 = Conversation(
+    system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
+    roles=("USER", "ASSISTANT"),
+    version="llama_v2",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA_2,
+    sep="<s>",
+    sep2="</s>",
+)
+
+conv_llava_llama_2 = Conversation(
+    system="You are a helpful language and vision assistant. "
+           "You are able to understand the visual content that the user provides, "
+           "and assist the user with a variety of tasks using natural language.",
+    roles=("USER", "ASSISTANT"),
+    version="llama_v2",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA_2,
+    sep="<s>",
+    sep2="</s>",
+)
+
+conv_mpt = Conversation(
+    system="""<|im_start|>system
+A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
+    roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
+    version="mpt",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.MPT,
+    sep="<|im_end|>",
+)
+
+conv_llava_plain = Conversation(
+    system="",
+    roles=("", ""),
+    messages=(
+    ),
+    offset=0,
+    sep_style=SeparatorStyle.PLAIN,
+    sep="\n",
+)
+
+conv_llava_v0 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("Human", "Assistant"),
+    messages=(
+    ),
+    offset=0,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+)
+
+conv_llava_v0_mmtag = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+           "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
+           "The visual content will be provided with the following format: <Image>visual content</Image>.",
+    roles=("Human", "Assistant"),
+    messages=(
+    ),
+    offset=0,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+    version="v0_mmtag",
+)
+
+conv_llava_v1 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+conv_llava_v1_mmtag = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+           "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
+           "The visual content will be provided with the following format: <Image>visual content</Image>.",
+    roles=("USER", "ASSISTANT"),
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+    version="v1_mmtag",
+)
+
+conv_mistral_instruct = Conversation(
+    system="",
+    roles=("USER", "ASSISTANT"),
+    version="llama_v2",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA_2,
+    sep="",
+    sep2="</s>",
+)
+
+conv_chatml_direct = Conversation(
+    system="""<|im_start|>system
+Answer the questions.""",
+    roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
+    version="mpt",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.MPT,
+    sep="<|im_end|>",
+)
+
+default_conversation = conv_vicuna_v1
+conv_templates = {
+    "default": conv_vicuna_v0,
+    "v0": conv_vicuna_v0,
+    "v1": conv_vicuna_v1,
+    "vicuna_v1": conv_vicuna_v1,
+    "llama_2": conv_llama_2,
+    "mistral_instruct": conv_mistral_instruct,
+    "chatml_direct": conv_chatml_direct,
+    "mistral_direct": conv_chatml_direct,
+
+    "plain": conv_llava_plain,
+    "v0_plain": conv_llava_plain,
+    "llava_v0": conv_llava_v0,
+    "v0_mmtag": conv_llava_v0_mmtag,
+    "llava_v1": conv_llava_v1,
+    "v1_mmtag": conv_llava_v1_mmtag,
+    "llava_llama_2": conv_llava_llama_2,
+
+    "mpt": conv_mpt,
+}
+
+
+if __name__ == "__main__":
+    print(default_conversation.get_prompt())
diff --git a/openeqa/baselines/llava/mm_utils.py b/openeqa/baselines/llava/mm_utils.py
new file mode 100644
index 0000000..de97345
--- /dev/null
+++ b/openeqa/baselines/llava/mm_utils.py
@@ -0,0 +1,247 @@
+from PIL import Image
+from io import BytesIO
+import base64
+import torch
+import math
+import ast
+
+from transformers import StoppingCriteria
+from llava.constants import IMAGE_TOKEN_INDEX
+
+
+def select_best_resolution(original_size, possible_resolutions):
+    """
+    Selects the best resolution from a list of possible resolutions based on the original size.
+
+    Args:
+        original_size (tuple): The original size of the image in the format (width, height).
+        possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
+
+    Returns:
+        tuple: The best fit resolution in the format (width, height).
+    """
+    original_width, original_height = original_size
+    best_fit = None
+    max_effective_resolution = 0
+    min_wasted_resolution = float('inf')
+
+    for width, height in possible_resolutions:
+        scale = min(width / original_width, height / original_height)
+        downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
+        effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
+        wasted_resolution = (width * height) - effective_resolution
+
+        if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
+            max_effective_resolution = effective_resolution
+            min_wasted_resolution = wasted_resolution
+            best_fit = (width, height)
+
+    return best_fit
+
+
+def resize_and_pad_image(image, target_resolution):
+    """
+    Resize and pad an image to a target resolution while maintaining aspect ratio.
+
+    Args:
+        image (PIL.Image.Image): The input image.
+        target_resolution (tuple): The target resolution (width, height) of the image.
+
+    Returns:
+        PIL.Image.Image: The resized and padded image.
+    """
+    original_width, original_height = image.size
+    target_width, target_height = target_resolution
+
+    scale_w = target_width / original_width
+    scale_h = target_height / original_height
+
+    if scale_w < scale_h:
+        new_width = target_width
+        new_height = min(math.ceil(original_height * scale_w), target_height)
+    else:
+        new_height = target_height
+        new_width = min(math.ceil(original_width * scale_h), target_width)
+
+    # Resize the image
+    resized_image = image.resize((new_width, new_height))
+
+    new_image = Image.new('RGB', (target_width, target_height), (0, 0, 0))
+    paste_x = (target_width - new_width) // 2
+    paste_y = (target_height - new_height) // 2
+    new_image.paste(resized_image, (paste_x, paste_y))
+
+    return new_image
+
+
+def divide_to_patches(image, patch_size):
+    """
+    Divides an image into patches of a specified size.
+
+    Args:
+        image (PIL.Image.Image): The input image.
+        patch_size (int): The size of each patch.
+
+    Returns:
+        list: A list of PIL.Image.Image objects representing the patches.
+    """
+    patches = []
+    width, height = image.size
+    for i in range(0, height, patch_size):
+        for j in range(0, width, patch_size):
+            box = (j, i, j + patch_size, i + patch_size)
+            patch = image.crop(box)
+            patches.append(patch)
+
+    return patches
+
+
+def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
+    """
+    Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
+
+    Args:
+        image_size (tuple): The size of the input image in the format (width, height).
+        grid_pinpoints (str): A string representation of a list of possible resolutions.
+        patch_size (int): The size of each image patch.
+
+    Returns:
+        tuple: The shape of the image patch grid in the format (width, height).
+    """
+    if type(grid_pinpoints) is list:
+        possible_resolutions = grid_pinpoints
+    else:
+        possible_resolutions = ast.literal_eval(grid_pinpoints)
+    width, height = select_best_resolution(image_size, possible_resolutions)
+    return width // patch_size, height // patch_size
+
+
+def process_anyres_image(image, processor, grid_pinpoints):
+    """
+    Process an image with variable resolutions.
+
+    Args:
+        image (PIL.Image.Image): The input image to be processed.
+        processor: The image processor object.
+        grid_pinpoints (str): A string representation of a list of possible resolutions.
+
+    Returns:
+        torch.Tensor: A tensor containing the processed image patches.
+    """
+    if type(grid_pinpoints) is list:
+        possible_resolutions = grid_pinpoints
+    else:
+        possible_resolutions = ast.literal_eval(grid_pinpoints)
+    best_resolution = select_best_resolution(image.size, possible_resolutions)
+    image_padded = resize_and_pad_image(image, best_resolution)
+
+    patches = divide_to_patches(image_padded, processor.crop_size['height'])
+
+    image_original_resize = image.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
+
+    image_patches = [image_original_resize] + patches
+    image_patches = [processor.preprocess(image_patch, return_tensors='pt')['pixel_values'][0]
+                     for image_patch in image_patches]
+    return torch.stack(image_patches, dim=0)
+
+
+def load_image_from_base64(image):
+    return Image.open(BytesIO(base64.b64decode(image)))
+
+
+def expand2square(pil_img, background_color):
+    width, height = pil_img.size
+    if width == height:
+        return pil_img
+    elif width > height:
+        result = Image.new(pil_img.mode, (width, width), background_color)
+        result.paste(pil_img, (0, (width - height) // 2))
+        return result
+    else:
+        result = Image.new(pil_img.mode, (height, height), background_color)
+        result.paste(pil_img, ((height - width) // 2, 0))
+        return result
+
+
+def process_images(images, image_processor, model_cfg):
+    image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None)
+    new_images = []
+    if image_aspect_ratio == 'pad':
+        for image in images:
+            image = expand2square(image, tuple(int(x*255) for x in image_processor.image_mean))
+            image = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+            new_images.append(image)
+    elif image_aspect_ratio == "anyres":
+        for image in images:
+            image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
+            new_images.append(image)
+    else:
+        return image_processor(images, return_tensors='pt')['pixel_values']
+    if all(x.shape == new_images[0].shape for x in new_images):
+        new_images = torch.stack(new_images, dim=0)
+    return new_images
+
+
+def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
+    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
+
+    def insert_separator(X, sep):
+        return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
+
+    input_ids = []
+    offset = 0
+    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
+        offset = 1
+        input_ids.append(prompt_chunks[0][0])
+
+    for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
+        input_ids.extend(x[offset:])
+
+    if return_tensors is not None:
+        if return_tensors == 'pt':
+            return torch.tensor(input_ids, dtype=torch.long)
+        raise ValueError(f'Unsupported tensor type: {return_tensors}')
+    return input_ids
+
+
+def get_model_name_from_path(model_path):
+    model_path = model_path.strip("/")
+    model_paths = model_path.split("/")
+    if model_paths[-1].startswith('checkpoint-'):
+        return model_paths[-2] + "_" + model_paths[-1]
+    else:
+        return model_paths[-1]
+
+class KeywordsStoppingCriteria(StoppingCriteria):
+    def __init__(self, keywords, tokenizer, input_ids):
+        self.keywords = keywords
+        self.keyword_ids = []
+        self.max_keyword_len = 0
+        for keyword in keywords:
+            cur_keyword_ids = tokenizer(keyword).input_ids
+            if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
+                cur_keyword_ids = cur_keyword_ids[1:]
+            if len(cur_keyword_ids) > self.max_keyword_len:
+                self.max_keyword_len = len(cur_keyword_ids)
+            self.keyword_ids.append(torch.tensor(cur_keyword_ids))
+        self.tokenizer = tokenizer
+        self.start_len = input_ids.shape[1]
+    
+    def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        offset = min(output_ids.shape[1] - self.start_len, self.max_keyword_len)
+        self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
+        for keyword_id in self.keyword_ids:
+            truncated_output_ids = output_ids[0, -keyword_id.shape[0]:]
+            if torch.equal(truncated_output_ids, keyword_id):
+                return True
+        outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
+        for keyword in self.keywords:
+            if keyword in outputs:
+                return True
+        return False
+    
+    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        outputs = []
+        for i in range(output_ids.shape[0]):
+            outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
+        return all(outputs)
diff --git a/openeqa/baselines/llava/model/__init__.py b/openeqa/baselines/llava/model/__init__.py
new file mode 100644
index 0000000..dbd9178
--- /dev/null
+++ b/openeqa/baselines/llava/model/__init__.py
@@ -0,0 +1,6 @@
+try:
+    from .language_model.llava_llama import LlavaLlamaForCausalLM, LlavaConfig
+    from .language_model.llava_mpt import LlavaMptForCausalLM, LlavaMptConfig
+    from .language_model.llava_mistral import LlavaMistralForCausalLM, LlavaMistralConfig
+except:
+    pass
diff --git a/openeqa/baselines/llava/model/apply_delta.py b/openeqa/baselines/llava/model/apply_delta.py
new file mode 100644
index 0000000..666dd96
--- /dev/null
+++ b/openeqa/baselines/llava/model/apply_delta.py
@@ -0,0 +1,48 @@
+"""
+Usage:
+python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
+"""
+import argparse
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from llava import LlavaLlamaForCausalLM
+
+
+def apply_delta(base_model_path, target_model_path, delta_path):
+    print("Loading base model")
+    base = AutoModelForCausalLM.from_pretrained(
+        base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Loading delta")
+    delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
+
+    print("Applying delta")
+    for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
+        if name not in base.state_dict():
+            assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
+            continue
+        if param.data.shape == base.state_dict()[name].shape:
+            param.data += base.state_dict()[name]
+        else:
+            assert name in ['model.embed_tokens.weight', 'lm_head.weight'], \
+                f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
+            bparam = base.state_dict()[name]
+            param.data[:bparam.shape[0], :bparam.shape[1]] += bparam
+
+    print("Saving target model")
+    delta.save_pretrained(target_model_path)
+    delta_tokenizer.save_pretrained(target_model_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+
+    args = parser.parse_args()
+
+    apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
diff --git a/openeqa/baselines/llava/model/builder.py b/openeqa/baselines/llava/model/builder.py
new file mode 100644
index 0000000..e3d5082
--- /dev/null
+++ b/openeqa/baselines/llava/model/builder.py
@@ -0,0 +1,167 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+import os
+import warnings
+import shutil
+
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
+import torch
+from llava.model import *
+from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+
+
+def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs):
+    kwargs = {"device_map": device_map, **kwargs}
+
+    if device != "cuda":
+        kwargs['device_map'] = {"": device}
+
+    if load_8bit:
+        kwargs['load_in_8bit'] = True
+    elif load_4bit:
+        kwargs['load_in_4bit'] = True
+        kwargs['quantization_config'] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type='nf4'
+        )
+    else:
+        kwargs['torch_dtype'] = torch.float16
+
+    if use_flash_attn:
+        kwargs['attn_implementation'] = 'flash_attention_2'
+
+    if 'llava' in model_name.lower():
+        # Load LLaVA model
+        if 'lora' in model_name.lower() and model_base is None:
+            warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.')
+        if 'lora' in model_name.lower() and model_base is not None:
+            from llava.model.language_model.llava_llama import LlavaConfig
+            lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            print('Loading LLaVA from base model...')
+            model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
+            token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
+            if model.lm_head.weight.shape[0] != token_num:
+                model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+                model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+
+            print('Loading additional LLaVA weights...')
+            if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
+                non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
+            else:
+                # this is probably from HF Hub
+                from huggingface_hub import hf_hub_download
+                def load_from_hf(repo_id, filename, subfolder=None):
+                    cache_file = hf_hub_download(
+                        repo_id=repo_id,
+                        filename=filename,
+                        subfolder=subfolder)
+                    return torch.load(cache_file, map_location='cpu')
+                non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
+            non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
+            if any(k.startswith('model.model.') for k in non_lora_trainables):
+                non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
+            model.load_state_dict(non_lora_trainables, strict=False)
+
+            from peft import PeftModel
+            print('Loading LoRA weights...')
+            model = PeftModel.from_pretrained(model, model_path)
+            print('Merging LoRA weights...')
+            model = model.merge_and_unload()
+            print('Model is loaded...')
+        elif model_base is not None:
+            # this may be mm projector only
+            print('Loading LLaVA from base model...')
+            if 'mpt' in model_name.lower():
+                if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')):
+                    shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py'))
+                tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True)
+                cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
+                model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
+            else:
+                tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+                cfg_pretrained = AutoConfig.from_pretrained(model_path)
+                model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
+
+            mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
+            mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+            model.load_state_dict(mm_projector_weights, strict=False)
+        else:
+            if 'mpt' in model_name.lower():
+                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
+                model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
+            elif 'mistral' in model_name.lower():
+                tokenizer = AutoTokenizer.from_pretrained(model_path)
+                model = LlavaMistralForCausalLM.from_pretrained(
+                    model_path,
+                    low_cpu_mem_usage=True,
+                    **kwargs
+                )
+            else:
+                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+                model = LlavaLlamaForCausalLM.from_pretrained(
+                    model_path,
+                    low_cpu_mem_usage=True,
+                    **kwargs
+                )
+    else:
+        # Load language model
+        if model_base is not None:
+            # PEFT model
+            from peft import PeftModel
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs)
+            print(f"Loading LoRA weights from {model_path}")
+            model = PeftModel.from_pretrained(model, model_path)
+            print(f"Merging weights")
+            model = model.merge_and_unload()
+            print('Convert to FP16...')
+            model.to(torch.float16)
+        else:
+            use_fast = False
+            if 'mpt' in model_name.lower():
+                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
+                model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs)
+            else:
+                tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+                model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
+
+    image_processor = None
+
+    if 'llava' in model_name.lower():
+        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
+        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
+        if mm_use_im_patch_token:
+            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
+        if mm_use_im_start_end:
+            tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
+        model.resize_token_embeddings(len(tokenizer))
+
+        vision_tower = model.get_vision_tower()
+        if not vision_tower.is_loaded:
+            vision_tower.load_model(device_map=device_map)
+        if device_map != 'auto':
+            vision_tower.to(device=device_map, dtype=torch.float16)
+        image_processor = vision_tower.image_processor
+
+    if hasattr(model.config, "max_sequence_length"):
+        context_len = model.config.max_sequence_length
+    else:
+        context_len = 2048
+
+    return tokenizer, model, image_processor, context_len
diff --git a/openeqa/baselines/llava/model/consolidate.py b/openeqa/baselines/llava/model/consolidate.py
new file mode 100644
index 0000000..1e32421
--- /dev/null
+++ b/openeqa/baselines/llava/model/consolidate.py
@@ -0,0 +1,29 @@
+"""
+Usage:
+python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
+"""
+import argparse
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from llava.model import *
+from llava.model.utils import auto_upgrade
+
+
+def consolidate_ckpt(src_path, dst_path):
+    print("Loading model")
+    auto_upgrade(src_path)
+    src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
+    src_model.save_pretrained(dst_path)
+    src_tokenizer.save_pretrained(dst_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--src", type=str, required=True)
+    parser.add_argument("--dst", type=str, required=True)
+
+    args = parser.parse_args()
+
+    consolidate_ckpt(args.src, args.dst)
diff --git a/openeqa/baselines/llava/model/language_model/llava_llama.py b/openeqa/baselines/llava/model/language_model/llava_llama.py
new file mode 100644
index 0000000..157dda4
--- /dev/null
+++ b/openeqa/baselines/llava/model/language_model/llava_llama.py
@@ -0,0 +1,159 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         LlamaConfig, LlamaModel, LlamaForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from ..llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
+
+
+class LlavaConfig(LlamaConfig):
+    model_type = "llava_llama"
+
+
+class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
+    config_class = LlavaConfig
+
+    def __init__(self, config: LlamaConfig):
+        super(LlavaLlamaModel, self).__init__(config)
+
+
+class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
+    config_class = LlavaConfig
+
+    def __init__(self, config):
+        super(LlamaForCausalLM, self).__init__(config)
+        self.model = LlavaLlamaModel(config)
+        self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[List[List[int]]] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                position_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                position_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images,
+                image_sizes
+            )
+
+        return super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        image_sizes: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                inputs,
+                position_ids,
+                attention_mask,
+                _,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                inputs,
+                position_ids,
+                attention_mask,
+                None,
+                None,
+                images,
+                image_sizes=image_sizes
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None,
+                                      inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        image_sizes = kwargs.pop("image_sizes", None)
+        inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        inputs.pop("cache_position")
+        if images is not None:
+            inputs['images'] = images
+        if image_sizes is not None:
+            inputs['image_sizes'] = image_sizes
+        return inputs
+
+AutoConfig.register("llava_llama", LlavaConfig)
+AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)
diff --git a/openeqa/baselines/llava/model/language_model/llava_mistral.py b/openeqa/baselines/llava/model/language_model/llava_mistral.py
new file mode 100644
index 0000000..0def682
--- /dev/null
+++ b/openeqa/baselines/llava/model/language_model/llava_mistral.py
@@ -0,0 +1,158 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         MistralConfig, MistralModel, MistralForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from ..llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
+
+
+class LlavaMistralConfig(MistralConfig):
+    model_type = "llava_mistral"
+
+
+class LlavaMistralModel(LlavaMetaModel, MistralModel):
+    config_class = LlavaMistralConfig
+
+    def __init__(self, config: MistralConfig):
+        super(LlavaMistralModel, self).__init__(config)
+
+
+class LlavaMistralForCausalLM(MistralForCausalLM, LlavaMetaForCausalLM):
+    config_class = LlavaMistralConfig
+
+    def __init__(self, config):
+        super(MistralForCausalLM, self).__init__(config)
+        self.model = LlavaMistralModel(config)
+
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[List[List[int]]] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                position_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                position_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images,
+                image_sizes
+            )
+
+        return super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        image_sizes: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                inputs,
+                position_ids,
+                attention_mask,
+                _,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                inputs,
+                position_ids,
+                attention_mask,
+                None,
+                None,
+                images,
+                image_sizes=image_sizes
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None,
+                                      inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        image_sizes = kwargs.pop("image_sizes", None)
+        inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            inputs['images'] = images
+        if image_sizes is not None:
+            inputs['image_sizes'] = image_sizes
+        return inputs
+
+AutoConfig.register("llava_mistral", LlavaMistralConfig)
+AutoModelForCausalLM.register(LlavaMistralConfig, LlavaMistralForCausalLM)
diff --git a/openeqa/baselines/llava/model/language_model/llava_mpt.py b/openeqa/baselines/llava/model/language_model/llava_mpt.py
new file mode 100644
index 0000000..02e5237
--- /dev/null
+++ b/openeqa/baselines/llava/model/language_model/llava_mpt.py
@@ -0,0 +1,97 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import Optional, Tuple
+
+import torch
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         MptConfig, MptForCausalLM, MptModel
+from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
+
+
+class LlavaMptConfig(MptConfig):
+    model_type = "llava_mpt"
+
+
+class LlavaMptModel(LlavaMetaModel, MptModel):
+    config_class = LlavaMptConfig
+
+    def __init__(self, config: MptConfig):
+        config.hidden_size = config.d_model
+        super(LlavaMptModel, self).__init__(config)
+    
+    def embed_tokens(self, x):
+        return self.wte(x)
+
+
+class LlavaMptForCausalLM(MptForCausalLM, LlavaMetaForCausalLM):
+    config_class = LlavaMptConfig
+    supports_gradient_checkpointing = True
+
+    def __init__(self, config):
+        super(MptForCausalLM, self).__init__(config)
+
+        self.transformer = LlavaMptModel(config)
+        self.lm_head = torch.nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.transformer
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, LlavaMptModel):
+            module.gradient_checkpointing = value
+
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        images=None):
+
+        input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
+        
+        return super().forward(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("llava_mpt", LlavaMptConfig)
+AutoModelForCausalLM.register(LlavaMptConfig, LlavaMptForCausalLM)
diff --git a/openeqa/baselines/llava/model/llava_arch.py b/openeqa/baselines/llava/model/llava_arch.py
new file mode 100644
index 0000000..d71650e
--- /dev/null
+++ b/openeqa/baselines/llava/model/llava_arch.py
@@ -0,0 +1,368 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from abc import ABC, abstractmethod
+
+import torch
+import torch.nn as nn
+
+from .multimodal_encoder.builder import build_vision_tower
+from .multimodal_projector.builder import build_vision_projector
+
+from llava.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+
+from llava.mm_utils import get_anyres_image_grid_shape
+
+
+class LlavaMetaModel:
+
+    def __init__(self, config):
+        super(LlavaMetaModel, self).__init__(config)
+
+        if hasattr(config, "mm_vision_tower"):
+            self.vision_tower = build_vision_tower(config, delay_load=True)
+            self.mm_projector = build_vision_projector(config)
+
+            if 'unpad' in getattr(config, 'mm_patch_merge_type', ''):
+                self.image_newline = nn.Parameter(
+                    torch.empty(config.hidden_size, dtype=self.dtype)
+                )
+
+    def get_vision_tower(self):
+        vision_tower = getattr(self, 'vision_tower', None)
+        if type(vision_tower) is list:
+            vision_tower = vision_tower[0]
+        return vision_tower
+
+    def initialize_vision_modules(self, model_args, fsdp=None):
+        vision_tower = model_args.vision_tower
+        mm_vision_select_layer = model_args.mm_vision_select_layer
+        mm_vision_select_feature = model_args.mm_vision_select_feature
+        pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
+        mm_patch_merge_type = model_args.mm_patch_merge_type
+
+        self.config.mm_vision_tower = vision_tower
+
+        if self.get_vision_tower() is None:
+            vision_tower = build_vision_tower(model_args)
+
+            if fsdp is not None and len(fsdp) > 0:
+                self.vision_tower = [vision_tower]
+            else:
+                self.vision_tower = vision_tower
+        else:
+            if fsdp is not None and len(fsdp) > 0:
+                vision_tower = self.vision_tower[0]
+            else:
+                vision_tower = self.vision_tower
+            vision_tower.load_model()
+
+        self.config.use_mm_proj = True
+        self.config.mm_projector_type = getattr(model_args, 'mm_projector_type', 'linear')
+        self.config.mm_hidden_size = vision_tower.hidden_size
+        self.config.mm_vision_select_layer = mm_vision_select_layer
+        self.config.mm_vision_select_feature = mm_vision_select_feature
+        self.config.mm_patch_merge_type = mm_patch_merge_type
+
+        if getattr(self, 'mm_projector', None) is None:
+            self.mm_projector = build_vision_projector(self.config)
+
+            if 'unpad' in mm_patch_merge_type:
+                embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
+                self.image_newline = nn.Parameter(
+                    torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std
+                )
+        else:
+            # In case it is frozen by LoRA
+            for p in self.mm_projector.parameters():
+                p.requires_grad = True
+
+        if pretrain_mm_mlp_adapter is not None:
+            mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
+            def get_w(weights, keyword):
+                return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
+
+            self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
+
+
+def unpad_image(tensor, original_size):
+    """
+    Unpads a PyTorch tensor of a padded and resized image.
+
+    Args:
+    tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format.
+    original_size (tuple): The original size of PIL image (width, height).
+
+    Returns:
+    torch.Tensor: The unpadded image tensor.
+    """
+    original_width, original_height = original_size
+    current_height, current_width = tensor.shape[1:]
+
+    original_aspect_ratio = original_width / original_height
+    current_aspect_ratio = current_width / current_height
+
+    if original_aspect_ratio > current_aspect_ratio:
+        scale_factor = current_width / original_width
+        new_height = int(original_height * scale_factor)
+        padding = (current_height - new_height) // 2
+        unpadded_tensor = tensor[:, padding:current_height - padding, :]
+    else:
+        scale_factor = current_height / original_height
+        new_width = int(original_width * scale_factor)
+        padding = (current_width - new_width) // 2
+        unpadded_tensor = tensor[:, :, padding:current_width - padding]
+
+    return unpadded_tensor
+
+
+class LlavaMetaForCausalLM(ABC):
+
+    @abstractmethod
+    def get_model(self):
+        pass
+
+    def get_vision_tower(self):
+        return self.get_model().get_vision_tower()
+
+    def encode_images(self, images):
+        image_features = self.get_model().get_vision_tower()(images)
+        image_features = self.get_model().mm_projector(image_features)
+        return image_features
+
+    def prepare_inputs_labels_for_multimodal(
+        self, input_ids, position_ids, attention_mask, past_key_values, labels,
+        images, image_sizes=None
+    ):
+        vision_tower = self.get_vision_tower()
+        if vision_tower is None or images is None or input_ids.shape[1] == 1:
+            return input_ids, position_ids, attention_mask, past_key_values, None, labels
+
+        if type(images) is list or images.ndim == 5:
+            if type(images) is list:
+                images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
+            concat_images = torch.cat([image for image in images], dim=0)
+            image_features = self.encode_images(concat_images)
+            split_sizes = [image.shape[0] for image in images]
+            image_features = torch.split(image_features, split_sizes, dim=0)
+            mm_patch_merge_type = getattr(self.config, 'mm_patch_merge_type', 'flat')
+            image_aspect_ratio = getattr(self.config, 'image_aspect_ratio', 'square')
+            if mm_patch_merge_type == 'flat':
+                image_features = [x.flatten(0, 1) for x in image_features]
+            elif mm_patch_merge_type.startswith('spatial'):
+                new_image_features = []
+                for image_idx, image_feature in enumerate(image_features):
+                    if image_feature.shape[0] > 1:
+                        base_image_feature = image_feature[0]
+                        image_feature = image_feature[1:]
+                        height = width = self.get_vision_tower().num_patches_per_side
+                        assert height * width == base_image_feature.shape[0]
+                        if image_aspect_ratio == 'anyres':
+                            num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.image_grid_pinpoints, self.get_vision_tower().config.image_size)
+                            image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
+                        else:
+                            raise NotImplementedError
+                        if 'unpad' in mm_patch_merge_type:
+                            image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
+                            image_feature = image_feature.flatten(1, 2).flatten(2, 3)
+                            image_feature = unpad_image(image_feature, image_sizes[image_idx])
+                            image_feature = torch.cat((
+                                image_feature,
+                                self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)
+                            ), dim=-1)
+                            image_feature = image_feature.flatten(1, 2).transpose(0, 1)
+                        else:
+                            image_feature = image_feature.permute(0, 2, 1, 3, 4).contiguous()
+                            image_feature = image_feature.flatten(0, 3)
+                        image_feature = torch.cat((base_image_feature, image_feature), dim=0)
+                    else:
+                        image_feature = image_feature[0]
+                        if 'unpad' in mm_patch_merge_type:
+                            image_feature = torch.cat((
+                                image_feature,
+                                self.model.image_newline[None].to(image_feature.device)
+                            ), dim=0)
+                    new_image_features.append(image_feature)
+                image_features = new_image_features
+            else:
+                raise ValueError(f"Unexpected mm_patch_merge_type: {self.config.mm_patch_merge_type}")
+        else:
+            image_features = self.encode_images(images)
+
+        # TODO: image start / end is not implemented here to support pretraining.
+        if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+            raise NotImplementedError
+
+        # Let's just add dummy tensors if they do not exist,
+        # it is a headache to deal with None all the time.
+        # But it is not ideal, and if you have a better idea,
+        # please open an issue / submit a PR, thanks.
+        _labels = labels
+        _position_ids = position_ids
+        _attention_mask = attention_mask
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
+        else:
+            attention_mask = attention_mask.bool()
+        if position_ids is None:
+            position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
+        if labels is None:
+            labels = torch.full_like(input_ids, IGNORE_INDEX)
+
+        # remove the padding using attention_mask -- FIXME
+        _input_ids = input_ids
+        input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
+        labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
+
+        new_input_embeds = []
+        new_labels = []
+        cur_image_idx = 0
+        for batch_idx, cur_input_ids in enumerate(input_ids):
+            num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
+            if num_images == 0:
+                cur_image_features = image_features[cur_image_idx]
+                cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
+                cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
+                new_input_embeds.append(cur_input_embeds)
+                new_labels.append(labels[batch_idx])
+                cur_image_idx += 1
+                continue
+
+            image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
+            cur_input_ids_noim = []
+            cur_labels = labels[batch_idx]
+            cur_labels_noim = []
+            for i in range(len(image_token_indices) - 1):
+                cur_input_ids_noim.append(cur_input_ids[image_token_indices[i]+1:image_token_indices[i+1]])
+                cur_labels_noim.append(cur_labels[image_token_indices[i]+1:image_token_indices[i+1]])
+            split_sizes = [x.shape[0] for x in cur_labels_noim]
+            cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
+            cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
+            cur_new_input_embeds = []
+            cur_new_labels = []
+
+            for i in range(num_images + 1):
+                cur_new_input_embeds.append(cur_input_embeds_no_im[i])
+                cur_new_labels.append(cur_labels_noim[i])
+                if i < num_images:
+                    cur_image_features = image_features[cur_image_idx]
+                    cur_image_idx += 1
+                    cur_new_input_embeds.append(cur_image_features)
+                    cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
+
+            cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]
+
+            cur_new_input_embeds = torch.cat(cur_new_input_embeds)
+            cur_new_labels = torch.cat(cur_new_labels)
+
+            new_input_embeds.append(cur_new_input_embeds)
+            new_labels.append(cur_new_labels)
+
+        # Truncate sequences to max length as image embeddings can make the sequence longer
+        tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
+        if tokenizer_model_max_length is not None:
+            new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
+            new_labels = [x[:tokenizer_model_max_length] for x in new_labels]
+
+        # Combine them
+        max_len = max(x.shape[0] for x in new_input_embeds)
+        batch_size = len(new_input_embeds)
+
+        new_input_embeds_padded = []
+        new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
+        attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
+        position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
+
+        for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
+            cur_len = cur_new_embed.shape[0]
+            if getattr(self.config, 'tokenizer_padding_side', 'right') == "left":
+                new_input_embeds_padded.append(torch.cat((
+                    torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device),
+                    cur_new_embed
+                ), dim=0))
+                if cur_len > 0:
+                    new_labels_padded[i, -cur_len:] = cur_new_labels
+                    attention_mask[i, -cur_len:] = True
+                    position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
+            else:
+                new_input_embeds_padded.append(torch.cat((
+                    cur_new_embed,
+                    torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)
+                ), dim=0))
+                if cur_len > 0:
+                    new_labels_padded[i, :cur_len] = cur_new_labels
+                    attention_mask[i, :cur_len] = True
+                    position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
+
+        new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
+
+        if _labels is None:
+            new_labels = None
+        else:
+            new_labels = new_labels_padded
+
+        if _attention_mask is None:
+            attention_mask = None
+        else:
+            attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
+
+        if _position_ids is None:
+            position_ids = None
+
+        return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
+
+    def initialize_vision_tokenizer(self, model_args, tokenizer):
+        if model_args.mm_use_im_patch_token:
+            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+
+        if model_args.mm_use_im_start_end:
+            num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+
+            if num_new_tokens > 0:
+                input_embeddings = self.get_input_embeddings().weight.data
+                output_embeddings = self.get_output_embeddings().weight.data
+
+                input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
+                    dim=0, keepdim=True)
+                output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
+                    dim=0, keepdim=True)
+
+                input_embeddings[-num_new_tokens:] = input_embeddings_avg
+                output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+            if model_args.tune_mm_mlp_adapter:
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = True
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
+
+            if model_args.pretrain_mm_mlp_adapter:
+                mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
+                embed_tokens_weight = mm_projector_weights['model.embed_tokens.weight']
+                assert num_new_tokens == 2
+                if input_embeddings.shape == embed_tokens_weight.shape:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
+                elif embed_tokens_weight.shape[0] == num_new_tokens:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight
+                else:
+                    raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
+        elif model_args.mm_use_im_patch_token:
+            if model_args.tune_mm_mlp_adapter:
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = False
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
diff --git a/openeqa/baselines/llava/model/make_delta.py b/openeqa/baselines/llava/model/make_delta.py
new file mode 100644
index 0000000..4ae55d5
--- /dev/null
+++ b/openeqa/baselines/llava/model/make_delta.py
@@ -0,0 +1,52 @@
+"""
+Usage:
+python3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~/model_weights/llava-7b-delta --hub-repo-id liuhaotian/llava-7b-delta
+"""
+import argparse
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from llava.model.utils import auto_upgrade
+
+
+def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
+    print("Loading base model")
+    base = AutoModelForCausalLM.from_pretrained(
+        base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Loading target model")
+    auto_upgrade(target_model_path)
+    target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Calculating delta")
+    for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
+        if name not in base.state_dict():
+            assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
+            continue
+        if param.data.shape == base.state_dict()[name].shape:
+            param.data -= base.state_dict()[name]
+        else:
+            assert name in ['model.embed_tokens.weight', 'lm_head.weight'], f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
+            bparam = base.state_dict()[name]
+            param.data[:bparam.shape[0], :bparam.shape[1]] -= bparam
+
+    print("Saving delta")
+    if hub_repo_id:
+        kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
+    else:
+        kwargs = {}
+    target.save_pretrained(delta_path, **kwargs)
+    target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
+    target_tokenizer.save_pretrained(delta_path, **kwargs)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+    parser.add_argument("--hub-repo-id", type=str, default=None)
+    args = parser.parse_args()
+
+    make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
diff --git a/openeqa/baselines/llava/model/multimodal_encoder/builder.py b/openeqa/baselines/llava/model/multimodal_encoder/builder.py
new file mode 100644
index 0000000..29f63a2
--- /dev/null
+++ b/openeqa/baselines/llava/model/multimodal_encoder/builder.py
@@ -0,0 +1,15 @@
+import os
+from .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2
+
+
+def build_vision_tower(vision_tower_cfg, **kwargs):
+    vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
+    is_absolute_path_exists = os.path.exists(vision_tower)
+    use_s2 = getattr(vision_tower_cfg, 's2', False)
+    if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
+        if use_s2:
+            return CLIPVisionTowerS2(vision_tower, args=vision_tower_cfg, **kwargs)
+        else:
+            return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
+
+    raise ValueError(f'Unknown vision tower: {vision_tower}')
diff --git a/openeqa/baselines/llava/model/multimodal_encoder/clip_encoder.py b/openeqa/baselines/llava/model/multimodal_encoder/clip_encoder.py
new file mode 100644
index 0000000..2c81415
--- /dev/null
+++ b/openeqa/baselines/llava/model/multimodal_encoder/clip_encoder.py
@@ -0,0 +1,147 @@
+import torch
+import torch.nn as nn
+
+from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
+
+
+class CLIPVisionTower(nn.Module):
+    def __init__(self, vision_tower, args, delay_load=False):
+        super().__init__()
+
+        self.is_loaded = False
+
+        self.vision_tower_name = vision_tower
+        self.select_layer = args.mm_vision_select_layer
+        self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
+
+        if not delay_load:
+            self.load_model()
+        elif getattr(args, 'unfreeze_mm_vision_tower', False):
+            self.load_model()
+        else:
+            self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
+
+    def load_model(self, device_map=None):
+        if self.is_loaded:
+            print('{} is already loaded, `load_model` called again, skipping.'.format(self.vision_tower_name))
+            return
+
+        self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
+        self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
+        self.vision_tower.requires_grad_(False)
+
+        self.is_loaded = True
+
+    def feature_select(self, image_forward_outs):
+        image_features = image_forward_outs.hidden_states[self.select_layer]
+        if self.select_feature == 'patch':
+            image_features = image_features[:, 1:]
+        elif self.select_feature == 'cls_patch':
+            image_features = image_features
+        else:
+            raise ValueError(f'Unexpected select feature: {self.select_feature}')
+        return image_features
+
+    @torch.no_grad()
+    def forward(self, images):
+        if type(images) is list:
+            image_features = []
+            for image in images:
+                image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
+                image_feature = self.feature_select(image_forward_out).to(image.dtype)
+                image_features.append(image_feature)
+        else:
+            image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
+            image_features = self.feature_select(image_forward_outs).to(images.dtype)
+
+        return image_features
+
+    @property
+    def dummy_feature(self):
+        return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
+
+    @property
+    def dtype(self):
+        return self.vision_tower.dtype
+
+    @property
+    def device(self):
+        return self.vision_tower.device
+
+    @property
+    def config(self):
+        if self.is_loaded:
+            return self.vision_tower.config
+        else:
+            return self.cfg_only
+
+    @property
+    def hidden_size(self):
+        return self.config.hidden_size
+
+    @property
+    def num_patches_per_side(self):
+        return self.config.image_size // self.config.patch_size
+
+    @property
+    def num_patches(self):
+        return (self.config.image_size // self.config.patch_size) ** 2
+
+
+
+class CLIPVisionTowerS2(CLIPVisionTower):
+    def __init__(self, vision_tower, args, delay_load=False):
+        super().__init__(vision_tower, args, delay_load)
+
+        self.s2_scales = getattr(args, 's2_scales', '336,672,1008')
+        self.s2_scales = list(map(int, self.s2_scales.split(',')))
+        self.s2_scales.sort()
+        self.s2_split_size = self.s2_scales[0]
+        self.s2_image_size = self.s2_scales[-1]
+
+        try:
+            from s2wrapper import forward as multiscale_forward
+        except ImportError:
+            raise ImportError('Package s2wrapper not found! Please install by running: \npip install git+https://github.com/bfshi/scaling_on_scales.git')
+        self.multiscale_forward = multiscale_forward
+
+        # change resize/crop size in preprocessing to the largest image size in s2_scale
+        if not delay_load or getattr(args, 'unfreeze_mm_vision_tower', False):
+            self.image_processor.size['shortest_edge'] = self.s2_image_size
+            self.image_processor.crop_size['height'] = self.image_processor.crop_size['width'] = self.s2_image_size
+
+    def load_model(self, device_map=None):
+        if self.is_loaded:
+            print('{} is already loaded, `load_model` called again, skipping.'.format(self.vision_tower_name))
+            return
+
+        self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
+        self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
+        self.vision_tower.requires_grad_(False)
+
+        self.image_processor.size['shortest_edge'] = self.s2_image_size
+        self.image_processor.crop_size['height'] = self.image_processor.crop_size['width'] = self.s2_image_size
+
+        self.is_loaded = True
+
+    @torch.no_grad()
+    def forward_feature(self, images):
+        image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
+        image_features = self.feature_select(image_forward_outs).to(images.dtype)
+        return image_features
+
+    @torch.no_grad()
+    def forward(self, images):
+        if type(images) is list:
+            image_features = []
+            for image in images:
+                image_feature = self.multiscale_forward(self.forward_feature, image.unsqueeze(0), img_sizes=self.s2_scales, max_split_size=self.s2_split_size)
+                image_features.append(image_feature)
+        else:
+            image_features = self.multiscale_forward(self.forward_feature, images, img_sizes=self.s2_scales, max_split_size=self.s2_split_size)
+
+        return image_features
+
+    @property
+    def hidden_size(self):
+        return self.config.hidden_size * len(self.s2_scales)
diff --git a/openeqa/baselines/llava/model/multimodal_projector/builder.py b/openeqa/baselines/llava/model/multimodal_projector/builder.py
new file mode 100644
index 0000000..31cd4f4
--- /dev/null
+++ b/openeqa/baselines/llava/model/multimodal_projector/builder.py
@@ -0,0 +1,51 @@
+import torch
+import torch.nn as nn
+import re
+
+
+class IdentityMap(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, *args, **kwargs):
+        return x
+
+    @property
+    def config(self):
+        return {"mm_projector_type": 'identity'}
+
+
+class SimpleResBlock(nn.Module):
+    def __init__(self, channels):
+        super().__init__()
+        self.pre_norm = nn.LayerNorm(channels)
+
+        self.proj = nn.Sequential(
+            nn.Linear(channels, channels),
+            nn.GELU(),
+            nn.Linear(channels, channels)
+        )
+    def forward(self, x):
+        x = self.pre_norm(x)
+        return x + self.proj(x)
+
+
+def build_vision_projector(config, delay_load=False, **kwargs):
+    projector_type = getattr(config, 'mm_projector_type', 'linear')
+
+    if projector_type == 'linear':
+        return nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+    mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
+    if mlp_gelu_match:
+        mlp_depth = int(mlp_gelu_match.group(1))
+        modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
+        for _ in range(1, mlp_depth):
+            modules.append(nn.GELU())
+            modules.append(nn.Linear(config.hidden_size, config.hidden_size))
+        return nn.Sequential(*modules)
+
+    if projector_type == 'identity':
+        return IdentityMap()
+
+    raise ValueError(f'Unknown projector type: {projector_type}')
diff --git a/openeqa/baselines/llava/model/utils.py b/openeqa/baselines/llava/model/utils.py
new file mode 100644
index 0000000..2563f89
--- /dev/null
+++ b/openeqa/baselines/llava/model/utils.py
@@ -0,0 +1,20 @@
+from transformers import AutoConfig
+
+
+def auto_upgrade(config):
+    cfg = AutoConfig.from_pretrained(config)
+    if 'llava' in config and 'llava' not in cfg.model_type:
+        assert cfg.model_type == 'llama'
+        print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
+        print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
+        confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
+        if confirm.lower() in ["y", "yes"]:
+            print("Upgrading checkpoint...")
+            assert len(cfg.architectures) == 1
+            setattr(cfg.__class__, "model_type", "llava")
+            cfg.architectures[0] = 'LlavaLlamaForCausalLM'
+            cfg.save_pretrained(config)
+            print("Checkpoint upgraded.")
+        else:
+            print("Checkpoint upgrade aborted.")
+            exit(1)
diff --git a/openeqa/baselines/llava/utils.py b/openeqa/baselines/llava/utils.py
new file mode 100644
index 0000000..4006cf9
--- /dev/null
+++ b/openeqa/baselines/llava/utils.py
@@ -0,0 +1,126 @@
+import datetime
+import logging
+import logging.handlers
+import os
+import sys
+
+import requests
+
+from llava.constants import LOGDIR
+
+server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
+moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
+
+handler = None
+
+
+def build_logger(logger_name, logger_filename):
+    global handler
+
+    formatter = logging.Formatter(
+        fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+
+    # Set the format of root handlers
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
+    logging.getLogger().handlers[0].setFormatter(formatter)
+
+    # Redirect stdout and stderr to loggers
+    stdout_logger = logging.getLogger("stdout")
+    stdout_logger.setLevel(logging.INFO)
+    sl = StreamToLogger(stdout_logger, logging.INFO)
+    sys.stdout = sl
+
+    stderr_logger = logging.getLogger("stderr")
+    stderr_logger.setLevel(logging.ERROR)
+    sl = StreamToLogger(stderr_logger, logging.ERROR)
+    sys.stderr = sl
+
+    # Get logger
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.INFO)
+
+    # Add a file handler for all loggers
+    if handler is None:
+        os.makedirs(LOGDIR, exist_ok=True)
+        filename = os.path.join(LOGDIR, logger_filename)
+        handler = logging.handlers.TimedRotatingFileHandler(
+            filename, when='D', utc=True, encoding='UTF-8')
+        handler.setFormatter(formatter)
+
+        for name, item in logging.root.manager.loggerDict.items():
+            if isinstance(item, logging.Logger):
+                item.addHandler(handler)
+
+    return logger
+
+
+class StreamToLogger(object):
+    """
+    Fake file-like stream object that redirects writes to a logger instance.
+    """
+    def __init__(self, logger, log_level=logging.INFO):
+        self.terminal = sys.stdout
+        self.logger = logger
+        self.log_level = log_level
+        self.linebuf = ''
+
+    def __getattr__(self, attr):
+        return getattr(self.terminal, attr)
+
+    def write(self, buf):
+        temp_linebuf = self.linebuf + buf
+        self.linebuf = ''
+        for line in temp_linebuf.splitlines(True):
+            # From the io.TextIOWrapper docs:
+            #   On output, if newline is None, any '\n' characters written
+            #   are translated to the system default line separator.
+            # By default sys.stdout.write() expects '\n' newlines and then
+            # translates them so this is still cross platform.
+            if line[-1] == '\n':
+                self.logger.log(self.log_level, line.rstrip())
+            else:
+                self.linebuf += line
+
+    def flush(self):
+        if self.linebuf != '':
+            self.logger.log(self.log_level, self.linebuf.rstrip())
+        self.linebuf = ''
+
+
+def disable_torch_init():
+    """
+    Disable the redundant torch default initialization to accelerate model creation.
+    """
+    import torch
+    setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
+    setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
+
+
+def violates_moderation(text):
+    """
+    Check whether the text violates OpenAI moderation API.
+    """
+    url = "https://api.openai.com/v1/moderations"
+    headers = {"Content-Type": "application/json",
+               "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
+    text = text.replace("\n", "")
+    data = "{" + '"input": ' + f'"{text}"' + "}"
+    data = data.encode("utf-8")
+    try:
+        ret = requests.post(url, headers=headers, data=data, timeout=5)
+        flagged = ret.json()["results"][0]["flagged"]
+    except requests.exceptions.RequestException as e:
+        flagged = False
+    except KeyError as e:
+        flagged = False
+
+    return flagged
+
+
+def pretty_print_semaphore(semaphore):
+    if semaphore is None:
+        return "None"
+    return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
diff --git a/prompts/ferret_rag.txt b/prompts/ferret_rag.txt
new file mode 100644
index 0000000..1d2038b
--- /dev/null
+++ b/prompts/ferret_rag.txt
@@ -0,0 +1,15 @@
+You are an intelligent question answering agent. I will ask you questions about a textualized image of indoor space and you must provide an think and answer. You must generate answer after thinking. The image description contains [object] [object's bounding box coordinates]. Q is Qnswer, I is Image, A is Answer. Answers must be short answers.
+
+If the question does not provide enough information to properly answer, provide an appropriate guess.
+
+Q: What machine is on top of the stove?
+I: <img_1>The image portrays a modern kitchen [140, 80, 960, 920] with a sleek design, featuring white cabinets [150, 90, 850, 400] and a dark countertop [170, 420, 890, 600]. The stove [380, 500, 620, 650] is centrally positioned, with a microwave [400, 300, 600, 450] placed directly above it. The microwave has a digital display and buttons on the right side, indicating its functionality.</img_1>, <img_2>The backsplash consists of tiled patterns [160, 280, 850, 320], adding texture to the space. Various cooking utensils [200, 530, 370, 650] and spice containers [640, 510, 750, 600] are neatly arranged on the counter.</img_2>, <img_3>The refrigerator [50, 120, 240, 800] is visible on the left side, blending seamlessly with the overall kitchen decor. The kitchen is well-lit, likely from both ceiling lights and natural light entering through a window [750, 100, 950, 500].</img_3>
+A: The microwave
+
+Q: What piece of furniture is in the middle of the bedroom?
+I: <img_1>The image shows a bedroom [329, 125, 998, 950] with soft lighting and a neutral color palette, creating a cozy atmosphere. In the center of the room, there is a large bed [410, 376, 890, 728] with a neatly arranged blanket and pillows.<img_1>, <img_2>A nightstand [245, 500, 370, 690] is positioned next to the bed, holding a lamp [278, 430, 320, 510] and a small decorative object.<img_2>, <img_3>A window [65, 145, 290, 540] on the left side of the room allows natural light to enter, complementing the artificial lighting from the ceiling fixture. On the right side, a dresser [680, 400, 950, 690] with a mirror is visible, with various personal items placed on top. A rug [370, 750, 870, 950] covers a portion of the wooden floor, enhancing the warmth of the room. The arrangement of furniture highlights the bed as the focal point of the bedroom.</img_3>
+A: a bed
+
+Q: {question}
+I: <img_1>{img_1}</img_1>, <img_2>{img_2}</img_2>, <img_3>{img_3}</img_3>
+A: 
\ No newline at end of file
diff --git a/prompts/ferret_uniform_sampling.txt b/prompts/ferret_uniform_sampling.txt
new file mode 100644
index 0000000..2a4a6eb
--- /dev/null
+++ b/prompts/ferret_uniform_sampling.txt
@@ -0,0 +1,15 @@
+You are an intelligent question answering agent. I will ask you questions about a textualized image of indoor space and you must provide an answer. The image description contains [object] [object's bounding box coordinates]. Q is Qnswer, I is Image, A is Answer. Answers must be short answers.
+
+If the question does not provide enough information to properly answer, provide an appropriate guess.
+
+Q: What machine is on top of the stove?
+I: <img_1>The image portrays a modern kitchen [140, 80, 960, 920] with a sleek design, featuring white cabinets [150, 90, 850, 400] and a dark countertop [170, 420, 890, 600]. The stove [380, 500, 620, 650] is centrally positioned, with a microwave [400, 300, 600, 450] placed directly above it. The microwave has a digital display and buttons on the right side, indicating its functionality.</img_1>, <img_2>The backsplash consists of tiled patterns [160, 280, 850, 320], adding texture to the space. Various cooking utensils [200, 530, 370, 650] and spice containers [640, 510, 750, 600] are neatly arranged on the counter.</img_2>, <img_3>The refrigerator [50, 120, 240, 800] is visible on the left side, blending seamlessly with the overall kitchen decor. The kitchen is well-lit, likely from both ceiling lights and natural light entering through a window [750, 100, 950, 500].</img_3>
+A: The microwave
+
+Q: What piece of furniture is in the middle of the bedroom?
+I: <img_1>The image shows a bedroom [329, 125, 998, 950] with soft lighting and a neutral color palette, creating a cozy atmosphere. In the center of the room, there is a large bed [410, 376, 890, 728] with a neatly arranged blanket and pillows.<img_1>, <img_2>A nightstand [245, 500, 370, 690] is positioned next to the bed, holding a lamp [278, 430, 320, 510] and a small decorative object.<img_2>, <img_3>A window [65, 145, 290, 540] on the left side of the room allows natural light to enter, complementing the artificial lighting from the ceiling fixture. On the right side, a dresser [680, 400, 950, 690] with a mirror is visible, with various personal items placed on top. A rug [370, 750, 870, 950] covers a portion of the wooden floor, enhancing the warmth of the room. The arrangement of furniture highlights the bed as the focal point of the bedroom.</img_3>
+A: a bed
+
+Q: {question}
+I: <img_1>{img_1}</img_1>, <img_2>{img_2}</img_2>, <img_3>{img_3}</img_3>, <img_4>{img_4}</img_4>, <img_5>{img_5}</img_5>, <img_6>{img_6}</img_6>, <img_7>{img_7}</img_7>, <img_7>{img_8}</img_8>, <img_9>{img_9}</img_9>, <img_10>{img_10}</img_10>
+A: 
\ No newline at end of file
diff --git a/prompts/vkm_uniform_sampling.txt b/prompts/vkm_uniform_sampling.txt
new file mode 100644
index 0000000..ad2f34d
--- /dev/null
+++ b/prompts/vkm_uniform_sampling.txt
@@ -0,0 +1,15 @@
+You are an intelligent question answering agent. I will ask you questions about a textualized image of indoor space and you must provide an answer. Q is Qnswer, I is Image that image description, A is Answer. Answers must be short answers.
+
+If the question does not provide enough information to properly answer, provide an appropriate guess.
+
+Q: What object is on coffee table?
+I: <img_1>A book on a coffee table</img_1>, <img_2>A kitchen table</img_2>, <img_3>A mug on a desk</img_3>
+A: book
+
+Q: What is the color of hat hanging on the wall?
+I: <img_1>A gray hat hanging on the wall</img_1>, <img_2>A TV is near the desk.</img_2>, <img_3>A chair is in front of desk.</img_3>
+A: gray
+
+Q: {question}
+I: <img_1>{img_1}</img_1>, <img_2>{img_2}</img_2>, <img_3>{img_3}</img_3>, <img_4>{img_4}</img_4>, <img_5>{img_5}</img_5>, <img_6>{img_6}</img_6>, <img_7>{img_7}</img_7>, <img_7>{img_8}</img_8>, <img_9>{img_9}</img_9>, <img_10>{img_10}</img_10>
+A: 
\ No newline at end of file
diff --git a/prompts/vlm_rag.txt b/prompts/vlm_rag.txt
new file mode 100644
index 0000000..a71e940
--- /dev/null
+++ b/prompts/vlm_rag.txt
@@ -0,0 +1,15 @@
+You are an intelligent question answering agent. I will ask you questions about a textualized image of indoor space and you must provide an answer. Q is Qnswer, I is Image that image description, A is Answer. Answers must be short answers.
+
+If the question does not provide enough information to properly answer, provide an appropriate guess.
+
+Q: What object is on coffee table?
+I: <img_1>A book on a coffee table</img_1>, <img_2>A kitchen table</img_2>, <img_3>A mug on a desk</img_3>
+A: book
+
+Q: What is the color of hat hanging on the wall?
+I: <img_1>A gray hat hanging on the wall</img_1>, <img_2>A TV is near the desk.</img_2>, <img_3>A chair is in front of desk.</img_3>
+A: gray
+
+Q: {question}
+I: <img_1>{img_1}</img_1>, <img_2>{img_2}</img_2>, <img_3>{img_3}</img_3>
+A: 
\ No newline at end of file
diff --git a/source/R-EQA.pdf b/source/R-EQA.pdf
new file mode 100644
index 0000000..5af2cc2
Binary files /dev/null and b/source/R-EQA.pdf differ
diff --git a/source/cvprw_poster.jpg b/source/cvprw_poster.jpg
new file mode 100644
index 0000000..a979ffc
Binary files /dev/null and b/source/cvprw_poster.jpg differ