Skip to content

jwallat/temporalrobustness

Repository files navigation

Towards Temporal Robustness of Large Language Models

This repo contains experiments and necessary steps to reproduce the results from our paper "A Study into Investigating Temporal Robustness of LLMs".

Conda/Mamba Environment

conda create -n robustness python=3.10
conda activate 
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Furthermore, you might want to setup Weights and Biases.

Datasets and Preprocessing

Steps to prepare datasets:

  1. Create a data directory (e.g., /data)
  2. Download Archival QA from here: https://github.com/WangJiexin/ArchivalQA
  3. Capitalize the "question" and "answer" the first line in the ArchivalQA files.
  4. Download Time-Sensitive QA from here (test.hard.json): https://github.com/wenhuchen/Time-Sensitive-QA/blob/main/dataset/test.hard.json and preprocess it using the script in scripts/timesensitiveqa/
  5. Download the fact verification dataset from here (train,val,test): https://github.com/factiverse/QuanTemp/tree/main/data/raw_data
  6. Move the datasets to your data directory.

Now with all the source datasets available and set up, we can create our temporal robustness tests:

  1. For the relativization, removal, year shift, and positioning tests use the script in /scripts/absolute_relative_time_refs.ipynb to create the datasets.
  2. Event dating / ordering: Run the /scripts/events/wikiyearpagedata.ipynb notebook to create the tests
  3. Temporal inverse: Run the /scripts/temporal_inverse/sample_temporal_reversal.ipynb notebook
  4. Fact checking: Run the /scripts/temporal_claims/temporal_fact_ds_creation.ipynb notebook

Prompts

I use different options, which are specified in the prompts.json and the system_prompts.json files. You can swap them there and specify the correct key as the --prompt_name argument.

Models

The experiments use GPT-3.5 and GPT-4 which requires you to register an API key at OpenAI and export it to AZURE_OPENAI_API_KEY like

export AZURE_OPENAI_API_KEY=<key>

Running Experiments

Running the experiments will look like this:

python alpaca_query.py \
     --ds_path="/home/wallat/temporalrobustness/data/Event-focused Questions/temporalquestions_all.csv" \
     --model_name="$model_name" \
     --run_name="$model_name TemporalQuestions all" \
     --task="QA" \
     --prompt_path="$prompt_path" \
     --system_prompt="$system_prompt" \
     --batch_size=$batch_size

The --ds_path is pointing toward the individual .csv files. --model_name might be "alpaca-7b" (more options in alpaca_query.py argparse arguments). --prompt_path should point toward a json file with prompts for the models. Per default, this is pointing at the prompts.json file. Lastly, --prompt_name is the key in the prompt_path file.

A full list of the command used to test the models is available in trob.sub slurm submission file.

Running this for all models will result in the big results table (Table 3).

Results

Results will be written to standard output, a file with the model predictions will be saved (in /predictions/<model_name>/), and logged to weights and biases (if set up).

As soon as all tests were run, you may run the our utility notebook to aggregate and print all results in a more readable way. To do so, head over to /scripts/evaluate_test_suite_results.ipynb and set the model_base_path to your predictions folder (e.g., /predictions/<model_name>/).

Additional Experiments

Additional experiments will require additional steps. The scripts are in the /scripts/ folder.

Time Referencing

For the time referencing experiments please have a look at /scripts/absolute_relative_time_refs.ipynb to create the datasets. If these datasets exist, just run the standard experiment setup and pass the corresponding datasets (that were generated in here).

Evaluate all findings from the tests

Head over to /scripts/evaluate_test_suite_results.ipynb and run it for a model. It will create a text file with the model name that contains the aggregated results in the same format as the presented in the paper.

Event Dating Metrics and Figure

Head over to /scripts/match_dates.ipynb. The script takes one prediction file and produces a figure which shows how model predictions are different that the ground truth years (Figure 3 in the paper).

Automatic Test Suite

Paraphrasing the questions can be useful to understand whether we can trust the predictions. To reproduce the results in Section 5 ("Automatic Testing of Temporal Robustness"), head over to /scritps/automatic_tests.ipynb and follow the notebook. It will cover sampling new data, getting model predictions, and then using the consistency between model predictions as an indicator of whether we can/should trust in a prediction. All this is under the real-world assumption that we do not know the ground truth.

Findings Evaluation Tests

For the results of the last Section (6), you will need to sample new data from ArichvalQA using the script in /scripts/absolute_relative_time_refs.ipynb. After retrieving the models' predictions, you may contrast the QA performance in different settings (e.g., no time vs. relative referencing).

(Optional) QA Experiments

You may also reproduce the QA results from the appendix. To do so, you will have to acquire the QA datasets:

  1. Download TemporalQuestions from here: https://www.dropbox.com/sh/fdepuisdce268za/AACtiPDaO_RwLCwhIwaET4Iba?dl=0
  2. Download TempLAMA from here: https://github.com/google-research/language/tree/master/language/templama
  3. Move them to your data folder
  4. Get the model predictions for it as described in "Running Experiments"

Citation

About

A Study Into Temporal Robustness of LLMs

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors