Skip to content

niluminous/food_eval

Repository files navigation

FoodLMEval: A Multi-aspect Benchmark for Assessing Food Literacy in LLMs

FoodLMEval is a multi-dimensional benchmark comprising over 4,000 questions designed to evaluate the holistic food literacy of large language models across five key domains: Nutritional Literacy, Functional Health Claims, Food Safety, Cooking Procedure Integrity, and Culinary Reasoning.

Grounded in authoritative sources like the USDA and EFSA, the dataset utilizes diverse formats—including ranking and troubleshooting—to distinguish a model's genuine scientific and procedural understanding from simple pattern matching or lucky guessing.


🖼️ Framework Overview

FoodLMEval Framework Overview


📂 Data Availability & Setup

The full dataset and auxiliary files for FoodLMEval are hosted on Figshare. Download Data Here

To run the evaluation, you must download specific files from the link above and place them into the corresponding folders in this repository. See the detailed instructions below for each aspect.


🏗️ Repository Structure & Usage Guide

This repository is organized into five main folders, each corresponding to a distinct aspect of food literacy described in our paper.

🥦 1. Nutritional Literacy (a1_nutritional_literacy)

This aspect tests the model's ability to recall nutrient density and rank foods based on nutritional content.

Setup Instructions:

  1. Download the nutrient_data folder (contains raw files with food names and nutrient density lists used to build the QA).
  2. Download nutrients_task_qa.csv (the final QA benchmark dataset).
  3. Place both the nutrient_data folder and nutrients_task_qa.csv inside a1_nutritional_literacy/.

Code Usage:

  • Inference: Use infer_qwen.py, infer_flan.py, or infer_llama.py to generate answers.
    • Note: To switch model sizes (e.g., 8B vs 32B), modify the MODEL_ID variable inside the script:
      MODEL_ID = "Qwen/Qwen3-32B-AWQ" 
  • Evaluation: Use eval_qwen.py, eval_flan.py, or eval_llama.py. These scripts contain the cleaning logic and metric calculations.
  • Robustness: shuffle_mc.ipynb is provided for shuffling Multiple Choice Questions to ensure anti-bias evaluation.

🏥 2. Functional Health Claims (a2_functional_health)

Evaluates the understanding of health claims based on the Food Health Claims Knowledge Graph.

Setup Instructions:

  1. Download balanced_health_QA.csv (the final QA dataset).
  2. Place it inside a2_functional_health/.
  3. Reference: The original KG data is available at food-claims-kg.

Code Usage:

  • Generation Logic: gpt_prompt.py contains the logic used to create the dataset.
  • Inference: Run infer_health_qwen.py, infer_health_llama.py, or infer_health_flan.py to generate model responses.
  • Evaluation: Run evaluate_health.py after inference to compute metrics.
  • Robustness: shuffle_mc.py handles MCQ shuffling.

⚠️ 3. Food Safety (a3_food_safety)

Assesses adherence to food safety guidelines using a dataset created via GPT generation and manual filtering.

Setup Instructions:

  1. Download food_safety.jsonl (the final QA dataset).
  2. Place it inside a3_food_safety/.

Code Usage:

  • Inference: Run infer_flan.py, infer_llama.py, or infer_qwen.py. Ensure the food_safety.jsonl file is present for these to run.
  • Generation: gpt_qa.py contains the code used to generate the initial QA pairs before manual filtering.

🍳 4. Cooking Procedure Integrity (a4_cooking_procedure)

Tests whether models can understand recipe logic by identifying missing ingredients or steps.

Setup Instructions:

  1. Download cooking_procedures.csv (the final benchmark dataset).
  2. Place it inside a4_cooking_procedure/.

Code Usage:

  • Inference: Run infer_flan.py, infer_llama.py, or infer_qwen.py to generate answers.
  • Evaluation: Run evaluate_a4.py. This script compares the LLM-generated answers against the ground truth to print final metrics.

🧠 5. Culinary Reasoning (a5_culinary_reasoning)

Evaluates high-level troubleshooting and culinary logic using expert community data (Reddit/StackExchange) filtered by DEITA and Ensemble of Gemini & GPT.

Setup Instructions: The Figshare link contains three levels of data processing. For reproduction, you primarily need the final file.

  1. Download culinary_reasoning_QA.csv (The final benchmark after GPT and Gemini filters).
  2. Place it inside a5_culinary_reasoning/.

Optional Data Files:

  • unified_cooking_data.csv: Original raw crawled data.
  • deita_filtered_dataset.csv: Data after DEITA selection (includes quality scores).

Code Usage:

  • Inference: Use llama_reason.py, qwen_reason.py, or flan_reason.py. Modify model names/parameters in the script as needed.
  • Data Creation: llmasjudge.py was used for the second-stage filtering to create the final QA dataset.
  • LLM-as-a-Judge: Use eval_gpt.py or gemini_eval.py. These scripts use GPT-4/Gemini to compare model answers against human expert answers.

📢 News

  • [Feb 2026]: Paper submitted to the KDD 2026 Datasets and Benchmarks Track.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published