Towards Temporal Robustness of Large Language Models

This repo contains experiments and necessary steps to reproduce the results from our paper "A Study into Investigating Temporal Robustness of LLMs".

Conda/Mamba Environment

conda create -n robustness python=3.10
conda activate 
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Furthermore, you might want to setup Weights and Biases.

Datasets and Preprocessing

Steps to prepare datasets:

Create a data directory (e.g., /data)
Download Archival QA from here: https://github.com/WangJiexin/ArchivalQA
Capitalize the "question" and "answer" the first line in the ArchivalQA files.
Download Time-Sensitive QA from here (test.hard.json): https://github.com/wenhuchen/Time-Sensitive-QA/blob/main/dataset/test.hard.json and preprocess it using the script in scripts/timesensitiveqa/
Download the fact verification dataset from here (train,val,test): https://github.com/factiverse/QuanTemp/tree/main/data/raw_data
Move the datasets to your data directory.

Now with all the source datasets available and set up, we can create our temporal robustness tests:

For the relativization, removal, year shift, and positioning tests use the script in /scripts/absolute_relative_time_refs.ipynb to create the datasets.
Event dating / ordering: Run the /scripts/events/wikiyearpagedata.ipynb notebook to create the tests
Temporal inverse: Run the /scripts/temporal_inverse/sample_temporal_reversal.ipynb notebook
Fact checking: Run the /scripts/temporal_claims/temporal_fact_ds_creation.ipynb notebook

Prompts

I use different options, which are specified in the prompts.json and the system_prompts.json files. You can swap them there and specify the correct key as the --prompt_name argument.

Models

The experiments use GPT-3.5 and GPT-4 which requires you to register an API key at OpenAI and export it to AZURE_OPENAI_API_KEY like

export AZURE_OPENAI_API_KEY=<key>

Running Experiments

Running the experiments will look like this:

python alpaca_query.py \
     --ds_path="/home/wallat/temporalrobustness/data/Event-focused Questions/temporalquestions_all.csv" \
     --model_name="$model_name" \
     --run_name="$model_name TemporalQuestions all" \
     --task="QA" \
     --prompt_path="$prompt_path" \
     --system_prompt="$system_prompt" \
     --batch_size=$batch_size

The --ds_path is pointing toward the individual .csv files. --model_name might be "alpaca-7b" (more options in alpaca_query.py argparse arguments). --prompt_path should point toward a json file with prompts for the models. Per default, this is pointing at the prompts.json file. Lastly, --prompt_name is the key in the prompt_path file.

A full list of the command used to test the models is available in trob.sub slurm submission file.

Running this for all models will result in the big results table (Table 3).

Results

Results will be written to standard output, a file with the model predictions will be saved (in /predictions/<model_name>/), and logged to weights and biases (if set up).

As soon as all tests were run, you may run the our utility notebook to aggregate and print all results in a more readable way. To do so, head over to /scripts/evaluate_test_suite_results.ipynb and set the model_base_path to your predictions folder (e.g., /predictions/<model_name>/).

Additional Experiments

Additional experiments will require additional steps. The scripts are in the /scripts/ folder.

Time Referencing

For the time referencing experiments please have a look at /scripts/absolute_relative_time_refs.ipynb to create the datasets. If these datasets exist, just run the standard experiment setup and pass the corresponding datasets (that were generated in here).

Evaluate all findings from the tests

Head over to /scripts/evaluate_test_suite_results.ipynb and run it for a model. It will create a text file with the model name that contains the aggregated results in the same format as the presented in the paper.

Event Dating Metrics and Figure

Head over to /scripts/match_dates.ipynb. The script takes one prediction file and produces a figure which shows how model predictions are different that the ground truth years (Figure 3 in the paper).

Automatic Test Suite

Paraphrasing the questions can be useful to understand whether we can trust the predictions. To reproduce the results in Section 5 ("Automatic Testing of Temporal Robustness"), head over to /scritps/automatic_tests.ipynb and follow the notebook. It will cover sampling new data, getting model predictions, and then using the consistency between model predictions as an indicator of whether we can/should trust in a prediction. All this is under the real-world assumption that we do not know the ground truth.

Findings Evaluation Tests

For the results of the last Section (6), you will need to sample new data from ArichvalQA using the script in /scripts/absolute_relative_time_refs.ipynb. After retrieving the models' predictions, you may contrast the QA performance in different settings (e.g., no time vs. relative referencing).

(Optional) QA Experiments

You may also reproduce the QA results from the appendix. To do so, you will have to acquire the QA datasets:

Download TemporalQuestions from here: https://www.dropbox.com/sh/fdepuisdce268za/AACtiPDaO_RwLCwhIwaET4Iba?dl=0
Download TempLAMA from here: https://github.com/google-research/language/tree/master/language/templama
Move them to your data folder
Get the model predictions for it as described in "Running Experiments"

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
scripts		scripts
src		src
.gitignore		.gitignore
alpaca_query.py		alpaca_query.py
automatic_tests.py		automatic_tests.py
automatic_tests_eval_predictions.py		automatic_tests_eval_predictions.py
prompts.json		prompts.json
prompts_system.json		prompts_system.json
readme.md		readme.md
requirements.txt		requirements.txt
temprob_icon.png		temprob_icon.png
trob.sub		trob.sub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Temporal Robustness of Large Language Models

Conda/Mamba Environment

Datasets and Preprocessing

Prompts

Models

Running Experiments

Results

Additional Experiments

Time Referencing

Evaluate all findings from the tests

Event Dating Metrics and Figure

Automatic Test Suite

Findings Evaluation Tests

(Optional) QA Experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Towards Temporal Robustness of Large Language Models

Conda/Mamba Environment

Datasets and Preprocessing

Prompts

Models

Running Experiments

Results

Additional Experiments

Time Referencing

Evaluate all findings from the tests

Event Dating Metrics and Figure

Automatic Test Suite

Findings Evaluation Tests

(Optional) QA Experiments

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages