ClickLLM

This repository contains the code and extended results for our paper "LLM-as-a-Judge in Entity Retrieval: Assessing Explicit and Implicit Relevance"

Repository Structure

ClickLLM/
├── assets/        # Figures and images used in documentation and results
├── data/          # Datasets, queries, qrels, and supporting data files
├── outputs/       # Output results from LLM judgments and reasoning
│   ├── llm_qrel/      # LLM-generated relevance judgments
│   │   ├── dbpedia/   # Judgments for DBpedia dataset
│   │   └── laque/     # Judgments for LaQuE dataset
│   └── llm_reasoning/ # LLM-generated reason assignments
└── src/           # Source code for scripts and analysis
    ├── dbpedia_judgement.py         # Runs LLM-based relevance judgments on DBpedia-Entity
    ├── laque_judgement.py           # Runs LLM-based relevance judgments on the LaQuE dataset
    ├── laque_analysis.py            # Analyzes and distills the reasons why users clicked on entities deemed irrelevant, generating a conceptual list of atomic reasons per query-entity pair using LLMs
    └── laque_analysis_assigner.py   # Assigns binary labels for each query-entity pair and reason, determining if the LLM thinks a specific reason applies to the user's click behavior

Main Results

Figures 1 and 2 compare LLM-based relevance judgments to human annotations on DBpedia-Entity.

Figure 1: Llama Model Evaluations

Abstract Binary

Abstract Graded

Title Binary

Title Graded

Figure 2: Qwen Model Evaluations

Abstract Binary

Abstract Graded

Title Binary

Title Graded

Figure 3: LLM Click-through Reason Assignment Results on LaQuE

Distribution of LLM-generated reasons for user clicks on entities judged irrelevant. Prominent result bias and lexical similarity are the most frequent factors.

Tabular Results

Table 1: Example queries and relevance judgments from DBpedia-Entity

Query	Entity	LLM	Human
Einstein Relativity theory	Theory of Relativity	2	2
Disney Orlando	Greater Orlando	1	0
Austin Texas	Texas	0	1
Guitar Classical Bach	Johann Sebastian Bach	2	0

Table 2: Example queries from the LaQuE dataset with clicked entities and LLM judgments

Query	Clicked Entity	LLM Judgment
Apple Mac	Macintosh	Relevant
Indian History in Hindi	Hindi	Not Relevant
CNN News Cast Members	List of CNN Anchors	Relevant
When Was Color Invented	Color television	Not Relevant

Table 3: Cohen’s κ scores for binary and graded LLM judgments from DBpedia-Entity

Input	Qwen3		Llama4
Input	Binary	Graded	Binary	Graded
Titles	0.3900	0.2733	0.4647	0.3428
Titles + Abstracts	0.4623	0.3042	0.5236	0.3658

Table 4: Agreement between LLM relevance judgments and click-through data on 15k queries

Input	# Agreements	Accuracy
Titles	14,910	91.93%
Titles + Abstracts	14,888	91.79%

Table 5: Example LaQuE queries with clicked entities and LLM click-through reasoning

Query	Clicked Entity	LLM Reasoning
X Factor USA Judges	The X Factor (UK TV series)	Name or lexical similarity, Prominent Results Bias
Palm Springs Florida	Palm Springs, California	Name or lexical similarity, Geographic name confusion
Brad Pitt Vegan	List of vegans	Category or topical association, Prominent Result Bias, Exploratory curiosity
John Bundy	Ted Bundy	Name or lexical similarity, Prominent Result Bias, Exploratory curiosity, Familiarity Bias

Scripts

Our project files consist of three stages: Relevance Judgement, Reason Nuggetization, and Label Assignment

Relevance Judgement

Dbpedia

The script gold_standard_abstract_graded.py prompts an LLM of choice Qwen3:8b or LLama4:Scout to perform a judgement task on Dbpedia query-entity pairs. The use_abstract flag determines whether the script uses entity titles or entity abstracts for the judgement task.

LaQuE

The script laque_graded.py prompts an LLM of choice Qwen3:8b or LLama4:Scout to perform a judgement task on LaQuE query-entity pairs.

Reason Nuggetization

The laque_analysis.py prompts an LLM of choice Qwen3:8b or LLama4:Scout to perform an analysis of the reasons why users have clicked on the entity despite irrelevance to the query as determined by LLM. The LLM will produce a separate list of reasons for each query-entity pair, which we will then aggregate to 6 general distilled reasons.

Label Assignment

Once the reasons are generated, run laque_analysis_assigner.py to assign a binary value of 0 or 1 for each query-entity pair depending on whether the LLM thinks the corresponding list item is an applicable reason for the user clicking on the entity.

Prompt

Given a query and the abstract of a knowledge entit you must choose one option:\n"
  0: The entity seems irrelevant to the query.\n"
  1: The entity seems relevant to the query but does not directly match it.\n"
  2: The entity seems highly relevant to the query or is an exact match.\n\n"
Break down each query into these steps:\n"
  1. Consider what information the user is likely searching for with the query.\n"
  2. Measure how well the abstract matches a likely intent of the query (M), scored 0–2.\n"
  3. Assess whether the entity matches any reasonable interpretation of the query (I), scored 0–2.\n"
  4. Based on M and I, decide on a final score (O), scored 0–2.\n"

"Query: {}\n"
"Entity: {}\n"
"IMPORTANT: Your response must only be in the format of "Final score: #" \n"
"Relevant?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ClickLLM

Repository Structure

Main Results

Figure 1: Llama Model Evaluations

Figure 2: Qwen Model Evaluations

Figure 3: LLM Click-through Reason Assignment Results on LaQuE

Tabular Results

Table 1: Example queries and relevance judgments from DBpedia-Entity

Table 2: Example queries from the LaQuE dataset with clicked entities and LLM judgments

Table 4: Agreement between LLM relevance judgments and click-through data on 15k queries

Table 5: Example LaQuE queries with clicked entities and LLM click-through reasoning

Scripts

Relevance Judgement

Dbpedia

LaQuE

Reason Nuggetization

Label Assignment

Prompt

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
data		data
outputs		outputs
src		src
README.md		README.md

17shiraz/ClickLLM

Folders and files

Latest commit

History

Repository files navigation

ClickLLM

Repository Structure

Main Results

Figure 1: Llama Model Evaluations

Figure 2: Qwen Model Evaluations

Figure 3: LLM Click-through Reason Assignment Results on LaQuE

Tabular Results

Table 1: Example queries and relevance judgments from DBpedia-Entity

Table 2: Example queries from the LaQuE dataset with clicked entities and LLM judgments

Table 4: Agreement between LLM relevance judgments and click-through data on 15k queries

Table 5: Example LaQuE queries with clicked entities and LLM click-through reasoning

Scripts

Relevance Judgement

Dbpedia

LaQuE

Reason Nuggetization

Label Assignment

Prompt

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages