Skip to content

DOC: Scoring Evaluations Blog#1617

Open
jsong468 wants to merge 4 commits intomicrosoft:mainfrom
jsong468:scoring_blog
Open

DOC: Scoring Evaluations Blog#1617
jsong468 wants to merge 4 commits intomicrosoft:mainfrom
jsong468:scoring_blog

Conversation

@jsong468
Copy link
Copy Markdown
Contributor

Description

This PR adds a blog documenting our scorer evaluation background, story, and process!

Tests and Documentation

N/A

@jsong468 jsong468 marked this pull request as ready for review April 15, 2026 19:34
Comment thread doc/blog/2026_04_14.md
Comment thread doc/blog/2026_04_14.md
3. If we want to run a new evaluation or update what's already in our JSONL registry, we move into step 3 and run the evaluation. Depending on the type of dataset and scorer, we use `ObjectiveScorerEvaluator` for true-false metrics or `HarmScorerEvaluator` for float scoring metrics.
4. In step 4, we store the metrics and identities in the associated JSONL registry files.
5. In step 5, we query metrics directly via `eval_hash`. We can look up metrics for a specific configuration, print them in a readable way using `ConsoleScorerPrinter`, and, more recently, tag our best scorers in `ScorerRegistry` so users can use them directly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add two more sections?

Using this to see Scoring Accuracy

Point to our docs, talk about how you can see this in a scenario run, maybe an image of scenario output from pyrit_shell

How you can evaluate your own scorers

Talk about how we have these human evaluations checked in. They're expensive to do, but they can evaluate pyrit scorers against other scorers and see how they do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • to Rich's comments like a section that explains how users can use it (like a Scorer Evaluations and YOU!! section)

Comment thread doc/blog/2026_04_14.md
Comment thread doc/blog/2026_04_14.md
Comment thread doc/blog/2026_04_14.md
Comment thread doc/blog/2026_04_14.md
Comment thread doc/blog/2026_04_14.md

Here is a diagram of the full end-to-end process of our scorer evaluation framework.

```{mermaid}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i hit "view file" on github this diagram doesn't load ... ? idk if it's github or the diagram but could you post a screenshot of what it's supposed to be

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah github doesn't show the mermaid diagrams but here's what it looks like

Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda small but readers can zoom in without resolution loss

Comment thread doc/blog/2026_04_14.md
We landed on a concept called scorer evaluation identifiers. At its core, it is a dataclass made up of many (though not all) of the things you can change about a scorer and the underlying LLM target that could potentially affect scoring behavior. More specifically, it is constructed from a more generic `ComponentIdentifier`, which uniquely identifies PyRIT component configurations through `params` (behavioral parameters that affect output) and `children` (child identifiers for components that "have a" different PyRIT component, such as targets for LLM scorers), while stripping out attributes that are less relevant to scoring logic when calculating a unique `eval_hash`. Using the `ScorerEvaluationIdentifier` and unique `eval_hash`, we can differentiate scoring setups from one another (if they are meaningfully different) and store evaluation metrics associated with that specific identifier and hash. Below is an example of how the `ComponentIdentifier` is built for `SelfAskTrueFalseScorer`:

![Building a ComponentIdentifier](2026_04_14_build_identifier.png)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible, we should include screenshots that don't have the red squiggly line

Comment thread doc/blog/2026_04_14.md
The `eval_hash` allows us to easily look up metrics associated with a specific scoring configuration within our JSONL-formatted scorer metrics registries:

![Metrics lookup by eval_hash](2026_04_14_metrics_lookup.png)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we include what the eval identifier object would look like ? like if it were printed out and maybe compare two

Comment thread doc/blog/2026_04_14.md
Comment thread doc/blog/2026_04_14.md

To move beyond that, we took a different approach (quite novel at the time but now much more widespread): using an LLM as a judge. Instead of pattern-matching against keywords, we could have an LLM decide whether a response was actually harmful. This led us to build out a set of LLM-powered Scorers in PyRIT, powered by system prompts and scoring rubrics designed to automate jailbreak success decisions at scale.

When we started using these scorers, they seemed to work reasonably well, but they were far from perfect. Nuanced responses sometimes tripped them up: off-topic replies could be flagged as harmful, and responses that merely *sounded* dangerous but were actually benign could fool the judge. We noticed these issues through small-scale experimentation and real-world red teaming operations, but our observations were just anecdotal. This raised a fundamental question: how do we actually *measure* how well our scorers perform?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: both of the examples are false positives should we also have a false negative example in there?

Comment thread doc/blog/2026_04_14.md

## Closing Thoughts

One of the primary goals of PyRIT's new scorer evaluation framework is to provide a sound foundation upon which we can run experiments that tangibly improve our automated scorers. It allows us to track scorer metrics effectively and see how different scoring configurations stack up against each other.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, either in beginning or this part it'd be cool to mention "trust in scorers"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants