-
Notifications
You must be signed in to change notification settings - Fork 6
How to submit to ForecastBench
👋 We’re delighted you’re considering contributing your forecasts to ForecastBench! This page details the steps required to ensure they’re successfully included in the benchmark.
Some participants have decided to share their forecasting code. Feel free to peruse their repositories, both for forecasting strategies and to find code that might speed the development of the code for your submission.
- Contact us
- Download the Question Set at 0:00 UTC on the forecast due date
- Generate your Forecast Set
- Upload your Forecast Set by 23:59:59 UTC on the forecast due date
To participate, a team must contact forecastbench@forecastingresearch.org with the list of email addresses that should be allowed to upload their team's forecasts.
In response, they'll be provided
- a folder on a GCP Cloud Storage bucket to which they should upload their forecast set, and
- the next forecast due date (every two weeks starting 2025-03-02).
💡 When you receive the response email, ensure you can log into GCP and upload a test file to your bucket to ensure the process goes smoothly. Feel free to reach out with any followup questions.
Ensure your code runs correctly before the forecast due date.
Follow the steps below using a previously-released question set to ensure you know how to successfully create a forecast set.
ℹ️ Question sets generated before 2025-10-26 contained combination questions, as described in the paper. To test your setup for rounds starting 2025-10-26 and later:
- read in the
"questions"array from the old question set - remove the combination questions before proceeding
- combination questions have an array value for the "id" field.
ℹ️ Unfortunately, we won't always be able to respond to emails on the forecast due date, so it would be best to ensure your code works well beforehand.
At 0:00 UTC on the forecast due date, navigate to https://github.com/forecastingresearch/forecastbench-datasets/tree/main/datasets/question_sets to download the latest question set. The question set will be named like <<forecast_due_date>>-llm.json.
The question set is of the format:
{
"forecast_due_date": {
"description": "Date in ISO format. e.g. 2024-07-21. Required.",
"type": "string"
},
"question_set": {
"description": "The name of the file that contains the question set. e.g. 2024-07-21-llm.json. Required.",
"type": "string"
},
"questions": {
"description": "A list of questions to forecast on. Required.",
"type": "array<object>"
}
}There are "questions" array:
{
"id": {
"description": "A unique identifier string given `source`. Required.",
"type": "string"
},
"source": {
"description": "Where the data comes from. e.g. 'acled'. Required.",
"type": "string"
},
"question": {
"description": "For questions sourced from 'market' sources, this is just the original question. For 'dataset' questions, this is the question presented as a Python f-string with the placeholders `{forecast_due_date}` and `{resolution_date}`. For the human survey, `{resolution_date}` may be replaced by 'the resolution date' while `{forecast_due_date}` may be replaced by 'the forecast due date' or 'today'. For LLMs, this template allows flexibility in deciding what information to insert (e.g. ISO date, date in the format of your choosing, or human replacements above). Required.",
"type": "string"
},
"resolution_criteria": {
"description": "ForecastBench resolution criteria. Specifies how forecasts will be evaluated for each question type. e.g. 'Resolves to the value calculated from the ACLED dataset once the data is published.' Required.",
"type": "string"
},
"background": {
"description": "Background information about the forecast question provided by the source, if available. Default: 'N/A'",
"type": "string"
},
"market_info_open_datetime": {
"description": "The datetime when the forecast question went on the market specified by `source`. e.g. 2022-05-02T05:00:00+00:00. Default: 'N/A'",
"type": "string"
},
"market_info_close_datetime": {
"description": "The datetime when the forecast question closes on the market specified by `source`. e.g. 2022-05-02T05:00:00+00:00. Default: 'N/A'",
"type": "string"
},
"market_info_resolution_criteria": {
"description": "The resolution criteria provided by the market specified by `'source'`, if available. Default: 'N/A'",
"type": "string"
},
"url": {
"description": "The URL where the resolution value is found. e.g. 'https://acleddata.com/'. Required.",
"type": "string"
},
"freeze_datetime": {
"description": "The datetime UTC when this question set was generated. This will be 10 days before the forecast due date. e.g. 2024-07-11T00:00:00+00:00. Required.",
"type": "string"
},
"freeze_datetime_value": {
"description": "The latest value of the market or comparison value the day the question was frozen. If there was an error, it may be set to 'N/A'. e.g. '0.25'. Required.",
"type": "string"
},
"freeze_datetime_value_explanation": {
"description": "Explanation of what the value specified in `value_at_freeze_datetime` represents. e.g. 'The market value.' Required.",
"type": "string"
},
"source_intro": {
"description": "A prompt that presents the source of this question, used in the human survey and provided here for completeness. Required.",
"type": "string"
},
"resolution_dates": {
"description": "The resolution dates for which forecasts should be provided for this forecast question. Only used for dataset questions. 'N/A' value for market questions. e.g. ['2024-01-08', '2024-01-31', '2024-03-31', '2024-06-29', '2024-12-31', '2026-12-31', '2028-12-30', '2033-12-29']. Required.",
"type": "array<string> | string"
}
}After downloading the question set, read in the questions:
import json
import pandas as pd
question_set_filename = "2024-07-21-llm.json"
with open(question_set_filename, "r", encoding="utf-8") as f:
question_set = json.load(f)
forecast_due_date = question_set["forecast_due_date"]
question_set_name = question_set["question_set"]
df = pd.DataFrame(question_set["questions"])
assert len(df) == 500You have 24 hours to generate and upload your forecasts for this question set.
A forecast set is the ensemble of all of your forecasts on a question set, and what we use to score your forecasting performance.
Your uploaded forecast set should be a JSON file named like <<forecast_due_date>>.<<organization>>.<<N>>.json, where:
-
forecast_due_dateis the forecast due date associated with the question set (forecast_due_datefrom the snippet above) -
organizationis your organization's name -
Nis the number of this forecast set; only important if you submit more than one forecast set per forecast due date.⚠️ You may submit up to$3$ forecast sets per round. If you submit more than$3$ , we will only consider the first$3$ files in alphabetical order.
Your forecast set should contain the 5 keys defined by the following data dictionary:
{
"organization": "<<your organization>>",
"model": "<<the model you're testing; if ensemble of models, write 'ensemble'; if submitting multiple forecasts with the same model, then differentiate them here (e.g. '(prompt 1)')>>",
"model_organization": "<<the organization that created the model; if ensemble, this should contain the same value as `organization`.>>",
"question_set": "<<'question_set' from the question set file.>>",
"forecasts": [
{}
]
}The keys organization and model will be used to find the appropriate logo for the leaderboard. The key model_organization will appear directly on the leaderboard as written here.
question_set contains the value of question_set from the question set file (question_set_name from the snippet above).
The forecasts are contained in an array of JSON objects under the forecasts key. Each JSON object in the array represents a single forecast and is defined by the following data dictionary:
{
"id": {
"description": "A unique identifier string given `source`, corresponding to the `id` from the question in the question set that's being forecast. e.g. 'd331f271'. Required.",
"type": "string"
},
"source": {
"description": "The `source` from the question in the question set that's being forecast. e.g. 'acled'. Required.",
"type": "string"
},
"forecast": {
"description": "The forecast. A float in [0,1]. e.g. 0.5. Required.",
"type": "number"
},
"resolution_date": {
"description": "The resolution date this forecast corresponds to. e.g. '2025-01-01'. `null` for market questions. Required.",
"type": "string | null"
},
"reasoning": {
"description": "The rationale underlying the forecast. e.g. ''. Optional.",
"type": "string | null"
},
}There are two question types in the question set:
- market: questions sourced from forecasting platforms
- dataset: questions generated from time series
The number of forecasts to provide depends on both the type and is summarized in the table below.
| Standard | |
|---|---|
| Market | 1 |
| Dataset |
|
The links above take you directly to the sections explaining why that number of forecasts are required. In short:
- Market question: 1 forecast of the final outcome (
$1$ forecast) - Dataset question: 1 forecast at each of 8 resolution dates (
$\le8$ forecasts) - NB:
$\le$ for dataset questions because if a series updates less frequently than weekly we'll have 7 resolution dates for that series.
To differentiate between question types: check to see whether the "source" is a market source or a dataset source.
SOURCES = {
"market": ["infer", "manifold", "metaculus", "polymarket"],
"dataset": ["acled", "dbnomics", "fred", "wikipedia", "yfinance"],
}
# question source masks
market_mask = df["source"].isin(SOURCES["market"])
dataset_mask = df["source"].isin(SOURCES["dataset"])
df_market = df[market_mask]
df_dataset = df[dataset_mask]
assert len(df_market) == 250
assert len(df_dataset) == 250
assert df[~market_mask & ~dataset_mask].emptyThe forecasting questions in ForecastBench are binary questions that ask how likely it is that a given event will (or will not) occur by a specified date.
For every question in df_market from the snippet above, you should provide your model's forecast of the final outcome of the question.
👀 An example forecast for a market question would look like:
{
"id": "14364",
"source": "metaculus",
"forecast": 0.32,
"resolution_date": null,
"reasoning": null
}For every question in df_dataset from the snippet above, you should provide your model's forecast of the final outcome of the question at each resolution date listed in the "resolution_dates" field. There are typically
Note, however, that the number of resolution dates present (and hence the number of forecasts to provide) is determined by the frequency of the series from which the question was generated. If, for example, a series is updated less frequently than weekly, the
👀 An example response to a standard dataset question with 8 resolution dates from the question set due on 2025-03-02 would look like:
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.53,
"resolution_date": "2025-03-09",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.55,
"resolution_date": "2025-04-01",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.57,
"resolution_date": "2025-05-31",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.59,
"resolution_date": "2025-08-29",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.63,
"resolution_date": "2026-03-02",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.67,
"resolution_date": "2028-03-01",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.7,
"resolution_date": "2030-03-01",
"reasoning": null
},
{
"id": "WFC",
"source": "yfinance",
"forecast": 0.72,
"resolution_date": "2035-02-28",
"reasoning": null
}Upload your forecast set to your GCP bucket folder.
ForecastBench • Contact • ForecastBench Website • FRI Website