FoodLMEval is a multi-dimensional benchmark comprising over 4,000 questions designed to evaluate the holistic food literacy of large language models across five key domains: Nutritional Literacy, Functional Health Claims, Food Safety, Cooking Procedure Integrity, and Culinary Reasoning.
Grounded in authoritative sources like the USDA and EFSA, the dataset utilizes diverse formats—including ranking and troubleshooting—to distinguish a model's genuine scientific and procedural understanding from simple pattern matching or lucky guessing.
The full dataset and auxiliary files for FoodLMEval are hosted on Figshare. Download Data Here
To run the evaluation, you must download specific files from the link above and place them into the corresponding folders in this repository. See the detailed instructions below for each aspect.
This repository is organized into five main folders, each corresponding to a distinct aspect of food literacy described in our paper.
This aspect tests the model's ability to recall nutrient density and rank foods based on nutritional content.
Setup Instructions:
- Download the
nutrient_datafolder (contains raw files with food names and nutrient density lists used to build the QA). - Download
nutrients_task_qa.csv(the final QA benchmark dataset). - Place both the
nutrient_datafolder andnutrients_task_qa.csvinsidea1_nutritional_literacy/.
Code Usage:
- Inference: Use
infer_qwen.py,infer_flan.py, orinfer_llama.pyto generate answers.- Note: To switch model sizes (e.g., 8B vs 32B), modify the
MODEL_IDvariable inside the script:MODEL_ID = "Qwen/Qwen3-32B-AWQ"
- Note: To switch model sizes (e.g., 8B vs 32B), modify the
- Evaluation: Use
eval_qwen.py,eval_flan.py, oreval_llama.py. These scripts contain the cleaning logic and metric calculations. - Robustness:
shuffle_mc.ipynbis provided for shuffling Multiple Choice Questions to ensure anti-bias evaluation.
Evaluates the understanding of health claims based on the Food Health Claims Knowledge Graph.
Setup Instructions:
- Download
balanced_health_QA.csv(the final QA dataset). - Place it inside
a2_functional_health/. - Reference: The original KG data is available at food-claims-kg.
Code Usage:
- Generation Logic:
gpt_prompt.pycontains the logic used to create the dataset. - Inference: Run
infer_health_qwen.py,infer_health_llama.py, orinfer_health_flan.pyto generate model responses. - Evaluation: Run
evaluate_health.pyafter inference to compute metrics. - Robustness:
shuffle_mc.pyhandles MCQ shuffling.
Assesses adherence to food safety guidelines using a dataset created via GPT generation and manual filtering.
Setup Instructions:
- Download
food_safety.jsonl(the final QA dataset). - Place it inside
a3_food_safety/.
Code Usage:
- Inference: Run
infer_flan.py,infer_llama.py, orinfer_qwen.py. Ensure thefood_safety.jsonlfile is present for these to run. - Generation:
gpt_qa.pycontains the code used to generate the initial QA pairs before manual filtering.
Tests whether models can understand recipe logic by identifying missing ingredients or steps.
Setup Instructions:
- Download
cooking_procedures.csv(the final benchmark dataset). - Place it inside
a4_cooking_procedure/.
Code Usage:
- Inference: Run
infer_flan.py,infer_llama.py, orinfer_qwen.pyto generate answers. - Evaluation: Run
evaluate_a4.py. This script compares the LLM-generated answers against the ground truth to print final metrics.
Evaluates high-level troubleshooting and culinary logic using expert community data (Reddit/StackExchange) filtered by DEITA and Ensemble of Gemini & GPT.
Setup Instructions: The Figshare link contains three levels of data processing. For reproduction, you primarily need the final file.
- Download
culinary_reasoning_QA.csv(The final benchmark after GPT and Gemini filters). - Place it inside
a5_culinary_reasoning/.
Optional Data Files:
unified_cooking_data.csv: Original raw crawled data.deita_filtered_dataset.csv: Data after DEITA selection (includes quality scores).
Code Usage:
- Inference: Use
llama_reason.py,qwen_reason.py, orflan_reason.py. Modify model names/parameters in the script as needed. - Data Creation:
llmasjudge.pywas used for the second-stage filtering to create the final QA dataset. - LLM-as-a-Judge: Use
eval_gpt.pyorgemini_eval.py. These scripts use GPT-4/Gemini to compare model answers against human expert answers.
- [Feb 2026]: Paper submitted to the KDD 2026 Datasets and Benchmarks Track.
