This repo contains experiments and necessary steps to reproduce the results from our paper "A Study into Investigating Temporal Robustness of LLMs".
conda create -n robustness python=3.10
conda activate
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txtFurthermore, you might want to setup Weights and Biases.
Steps to prepare datasets:
- Create a data directory (e.g., /data)
- Download Archival QA from here: https://github.com/WangJiexin/ArchivalQA
- Capitalize the "question" and "answer" the first line in the ArchivalQA files.
- Download Time-Sensitive QA from here (test.hard.json): https://github.com/wenhuchen/Time-Sensitive-QA/blob/main/dataset/test.hard.json and preprocess it using the script in scripts/timesensitiveqa/
- Download the fact verification dataset from here (train,val,test): https://github.com/factiverse/QuanTemp/tree/main/data/raw_data
- Move the datasets to your data directory.
Now with all the source datasets available and set up, we can create our temporal robustness tests:
- For the relativization, removal, year shift, and positioning tests use the script in
/scripts/absolute_relative_time_refs.ipynbto create the datasets. - Event dating / ordering: Run the
/scripts/events/wikiyearpagedata.ipynbnotebook to create the tests - Temporal inverse: Run the
/scripts/temporal_inverse/sample_temporal_reversal.ipynbnotebook - Fact checking: Run the
/scripts/temporal_claims/temporal_fact_ds_creation.ipynbnotebook
I use different options, which are specified in the prompts.json and the system_prompts.json files. You can swap them there and specify the correct key as the --prompt_name argument.
The experiments use GPT-3.5 and GPT-4 which requires you to register an API key at OpenAI and export it to AZURE_OPENAI_API_KEY like
export AZURE_OPENAI_API_KEY=<key>Running the experiments will look like this:
python alpaca_query.py \
--ds_path="/home/wallat/temporalrobustness/data/Event-focused Questions/temporalquestions_all.csv" \
--model_name="$model_name" \
--run_name="$model_name TemporalQuestions all" \
--task="QA" \
--prompt_path="$prompt_path" \
--system_prompt="$system_prompt" \
--batch_size=$batch_sizeThe --ds_path is pointing toward the individual .csv files. --model_name might be "alpaca-7b" (more options in alpaca_query.py argparse arguments). --prompt_path should point toward a json file with prompts for the models. Per default, this is pointing at the prompts.json file. Lastly, --prompt_name is the key in the prompt_path file.
A full list of the command used to test the models is available in trob.sub slurm submission file.
Running this for all models will result in the big results table (Table 3).
Results will be written to standard output, a file with the model predictions will be saved (in /predictions/<model_name>/), and logged to weights and biases (if set up).
As soon as all tests were run, you may run the our utility notebook to aggregate and print all results in a more readable way. To do so, head over to /scripts/evaluate_test_suite_results.ipynb and set the model_base_path to your predictions folder (e.g., /predictions/<model_name>/).
Additional experiments will require additional steps. The scripts are in the /scripts/ folder.
For the time referencing experiments please have a look at /scripts/absolute_relative_time_refs.ipynb to create the datasets.
If these datasets exist, just run the standard experiment setup and pass the corresponding datasets (that were generated in here).
Head over to /scripts/evaluate_test_suite_results.ipynb and run it for a model. It will create a text file with the model name that contains the aggregated results in the same format as the presented in the paper.
Head over to /scripts/match_dates.ipynb. The script takes one prediction file and produces a figure which shows how model predictions are different that the ground truth years (Figure 3 in the paper).
Paraphrasing the questions can be useful to understand whether we can trust the predictions. To reproduce the results in Section 5 ("Automatic Testing of Temporal Robustness"), head over to /scritps/automatic_tests.ipynb and follow the notebook. It will cover sampling new data, getting model predictions, and then using the consistency between model predictions as an indicator of whether we can/should trust in a prediction. All this is under the real-world assumption that we do not know the ground truth.
For the results of the last Section (6), you will need to sample new data from ArichvalQA using the script in /scripts/absolute_relative_time_refs.ipynb. After retrieving the models' predictions, you may contrast the QA performance in different settings (e.g., no time vs. relative referencing).
You may also reproduce the QA results from the appendix. To do so, you will have to acquire the QA datasets:
- Download TemporalQuestions from here: https://www.dropbox.com/sh/fdepuisdce268za/AACtiPDaO_RwLCwhIwaET4Iba?dl=0
- Download TempLAMA from here: https://github.com/google-research/language/tree/master/language/templama
- Move them to your data folder
- Get the model predictions for it as described in "Running Experiments"