This repository contains the code and extended results for our paper "LLM-as-a-Judge in Entity Retrieval: Assessing Explicit and Implicit Relevance"
ClickLLM/
├── assets/ # Figures and images used in documentation and results
├── data/ # Datasets, queries, qrels, and supporting data files
├── outputs/ # Output results from LLM judgments and reasoning
│ ├── llm_qrel/ # LLM-generated relevance judgments
│ │ ├── dbpedia/ # Judgments for DBpedia dataset
│ │ └── laque/ # Judgments for LaQuE dataset
│ └── llm_reasoning/ # LLM-generated reason assignments
└── src/ # Source code for scripts and analysis
├── dbpedia_judgement.py # Runs LLM-based relevance judgments on DBpedia-Entity
├── laque_judgement.py # Runs LLM-based relevance judgments on the LaQuE dataset
├── laque_analysis.py # Analyzes and distills the reasons why users clicked on entities deemed irrelevant, generating a conceptual list of atomic reasons per query-entity pair using LLMs
└── laque_analysis_assigner.py # Assigns binary labels for each query-entity pair and reason, determining if the LLM thinks a specific reason applies to the user's click behavior
Figures 1 and 2 compare LLM-based relevance judgments to human annotations on DBpedia-Entity.
![]() Abstract Binary |
![]() Abstract Graded |
![]() Title Binary |
![]() Title Graded |
![]() Abstract Binary |
![]() Abstract Graded |
![]() Title Binary |
![]() Title Graded |
Distribution of LLM-generated reasons for user clicks on entities judged irrelevant. Prominent result bias and lexical similarity are the most frequent factors.
| Query | Entity | LLM | Human |
|---|---|---|---|
| Einstein Relativity theory | Theory of Relativity | 2 | 2 |
| Disney Orlando | Greater Orlando | 1 | 0 |
| Austin Texas | Texas | 0 | 1 |
| Guitar Classical Bach | Johann Sebastian Bach | 2 | 0 |
| Query | Clicked Entity | LLM Judgment |
|---|---|---|
| Apple Mac | Macintosh | Relevant |
| Indian History in Hindi | Hindi | Not Relevant |
| CNN News Cast Members | List of CNN Anchors | Relevant |
| When Was Color Invented | Color television | Not Relevant |
| Input | Qwen3 | Llama4 | ||
|---|---|---|---|---|
| Binary | Graded | Binary | Graded | |
| Titles | 0.3900 | 0.2733 | 0.4647 | 0.3428 |
| Titles + Abstracts | 0.4623 | 0.3042 | 0.5236 | 0.3658 |
| Input | # Agreements | Accuracy |
|---|---|---|
| Titles | 14,910 | 91.93% |
| Titles + Abstracts | 14,888 | 91.79% |
| Query | Clicked Entity | LLM Reasoning |
|---|---|---|
| X Factor USA Judges | The X Factor (UK TV series) | Name or lexical similarity, Prominent Results Bias |
| Palm Springs Florida | Palm Springs, California | Name or lexical similarity, Geographic name confusion |
| Brad Pitt Vegan | List of vegans | Category or topical association, Prominent Result Bias, Exploratory curiosity |
| John Bundy | Ted Bundy | Name or lexical similarity, Prominent Result Bias, Exploratory curiosity, Familiarity Bias |
Our project files consist of three stages: Relevance Judgement, Reason Nuggetization, and Label Assignment
The script gold_standard_abstract_graded.py prompts an LLM of choice Qwen3:8b or LLama4:Scout to perform a judgement task on Dbpedia query-entity pairs.
The use_abstract flag determines whether the script uses entity titles or entity abstracts for the judgement task.
The script laque_graded.py prompts an LLM of choice Qwen3:8b or LLama4:Scout to perform a judgement task on LaQuE query-entity pairs.
The laque_analysis.py prompts an LLM of choice Qwen3:8b or LLama4:Scout to perform an analysis of the reasons why users have clicked on the entity despite irrelevance to the query as determined by LLM.
The LLM will produce a separate list of reasons for each query-entity pair, which we will then aggregate to 6 general distilled reasons.
Once the reasons are generated, run laque_analysis_assigner.py to assign a binary value of 0 or 1 for each query-entity pair depending on whether the LLM thinks the corresponding list item is an applicable reason for the user clicking on the entity.
Given a query and the abstract of a knowledge entit you must choose one option:\n"
0: The entity seems irrelevant to the query.\n"
1: The entity seems relevant to the query but does not directly match it.\n"
2: The entity seems highly relevant to the query or is an exact match.\n\n"
Break down each query into these steps:\n"
1. Consider what information the user is likely searching for with the query.\n"
2. Measure how well the abstract matches a likely intent of the query (M), scored 0–2.\n"
3. Assess whether the entity matches any reasonable interpretation of the query (I), scored 0–2.\n"
4. Based on M and I, decide on a final score (O), scored 0–2.\n"
"Query: {}\n"
"Entity: {}\n"
"IMPORTANT: Your response must only be in the format of "Final score: #" \n"
"Relevant?"








