Discrepancy in the reported percentage of flawed questions in FastChat MT-Bench

Hi,
I was reading the article [Inflection-2.5: meet the world's best personal AI](https://inflection.ai/inflection-2-5), and in the article it was mentioned that `nearly 25%—of examples in the reasoning, math, and coding categories had incorrect reference solutions or questions with flawed premises`. I compared the FastChat MT Bench questions and Inflection MT Bench corrected questions and found only 4 questions to have a change / difference. 

I downloaded the FastChat MT Bench question using the following code: [FastChat LLM Judge](https://github.com/lm-sys/FastChat/tree/d04ce6453ae016d9e03626b679c07aa1388dcbee/fastchat/llm_judge)
```bash
python3 download_mt_bench_pregenerated.py
```
And compared it with [corrected version of the MT-Bench](https://github.com/InflectionAI/Inflection-Benchmarks/blob/main/mt_bench_inf.jsonl), using `mergely`.
The [comparison](https://editor.mergely.com/nqfnxfkM) shows only 4 changes and the 4 changes looks correct in terms of the references provided. 
Can you please help in understanding how `nearly 25%` were flawed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in the reported percentage of flawed questions in FastChat MT-Bench #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancy in the reported percentage of flawed questions in FastChat MT-Bench #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions