DOC: Scoring Evaluations Blog by jsong468 · Pull Request #1617 · microsoft/PyRIT

jsong468 · 2026-04-15T19:34:03Z

Description

This PR adds a blog documenting our scorer evaluation background, story, and process!

Tests and Documentation

N/A

rlundeen2 · 2026-04-15T21:29:15Z

+3. If we want to run a new evaluation or update what's already in our JSONL registry, we move into step 3 and run the evaluation. Depending on the type of dataset and scorer, we use `ObjectiveScorerEvaluator` for true-false metrics or `HarmScorerEvaluator` for float scoring metrics.
+4. In step 4, we store the metrics and identities in the associated JSONL registry files.
+5. In step 5, we query metrics directly via `eval_hash`. We can look up metrics for a specific configuration, print them in a readable way using `ConsoleScorerPrinter`, and, more recently, tag our best scorers in `ScorerRegistry` so users can use them directly.
+


Can you add two more sections?

Using this to see Scoring Accuracy

Point to our docs, talk about how you can see this in a scenario run, maybe an image of scenario output from pyrit_shell

How you can evaluate your own scorers

Talk about how we have these human evaluations checked in. They're expensive to do, but they can evaluate pyrit scorers against other scorers and see how they do

to Rich's comments like a section that explains how users can use it (like a Scorer Evaluations and YOU!! section)

hannahwestra25 · 2026-04-15T22:02:57Z

+
+Here is a diagram of the full end-to-end process of our scorer evaluation framework.
+
+```{mermaid}


when i hit "view file" on github this diagram doesn't load ... ? idk if it's github or the diagram but could you post a screenshot of what it's supposed to be

yeah github doesn't show the mermaid diagrams but here's what it looks like

kinda small but readers can zoom in without resolution loss

hannahwestra25 · 2026-04-15T22:04:28Z

+We landed on a concept called scorer evaluation identifiers. At its core, it is a dataclass made up of many (though not all) of the things you can change about a scorer and the underlying LLM target that could potentially affect scoring behavior. More specifically, it is constructed from a more generic `ComponentIdentifier`, which uniquely identifies PyRIT component configurations through `params` (behavioral parameters that affect output) and `children` (child identifiers for components that "have a" different PyRIT component, such as targets for LLM scorers), while stripping out attributes that are less relevant to scoring logic when calculating a unique `eval_hash`. Using the `ScorerEvaluationIdentifier` and unique `eval_hash`, we can differentiate scoring setups from one another (if they are meaningfully different) and store evaluation metrics associated with that specific identifier and hash. Below is an example of how the `ComponentIdentifier` is built for `SelfAskTrueFalseScorer`:
+
+![Building a ComponentIdentifier](2026_04_14_build_identifier.png)
+


if possible, we should include screenshots that don't have the red squiggly line

hannahwestra25 · 2026-04-15T22:06:59Z

+The `eval_hash` allows us to easily look up metrics associated with a specific scoring configuration within our JSONL-formatted scorer metrics registries:
+
+![Metrics lookup by eval_hash](2026_04_14_metrics_lookup.png)
+


could we include what the eval identifier object would look like ? like if it were printed out and maybe compare two

jbolor21 · 2026-04-16T07:46:34Z

+
+To move beyond that, we took a different approach (quite novel at the time but now much more widespread): using an LLM as a judge. Instead of pattern-matching against keywords, we could have an LLM decide whether a response was actually harmful. This led us to build out a set of LLM-powered Scorers in PyRIT, powered by system prompts and scoring rubrics designed to automate jailbreak success decisions at scale.
+
+When we started using these scorers, they seemed to work reasonably well, but they were far from perfect. Nuanced responses sometimes tripped them up: off-topic replies could be flagged as harmful, and responses that merely *sounded* dangerous but were actually benign could fool the judge. We noticed these issues through small-scale experimentation and real-world red teaming operations, but our observations were just anecdotal. This raised a fundamental question: how do we actually *measure* how well our scorers perform?


nit: both of the examples are false positives should we also have a false negative example in there?

jbolor21 · 2026-04-16T07:51:04Z

+
+## Closing Thoughts
+
+One of the primary goals of PyRIT's new scorer evaluation framework is to provide a sound foundation upon which we can run experiments that tangibly improve our automated scorers. It allows us to track scorer metrics effectively and see how different scoring configurations stack up against each other.


nit, either in beginning or this part it'd be cool to mention "trust in scorers"

jsong468 added 4 commits April 14, 2026 16:07

add blog

fba5d19

Merge branch 'main' into scoring_blog

2cf1947

small improvements

a179a46

wording improvement

60c68ad

jsong468 marked this pull request as ready for review April 15, 2026 19:34

rlundeen2 reviewed Apr 15, 2026

View reviewed changes