Conversation
| 3. If we want to run a new evaluation or update what's already in our JSONL registry, we move into step 3 and run the evaluation. Depending on the type of dataset and scorer, we use `ObjectiveScorerEvaluator` for true-false metrics or `HarmScorerEvaluator` for float scoring metrics. | ||
| 4. In step 4, we store the metrics and identities in the associated JSONL registry files. | ||
| 5. In step 5, we query metrics directly via `eval_hash`. We can look up metrics for a specific configuration, print them in a readable way using `ConsoleScorerPrinter`, and, more recently, tag our best scorers in `ScorerRegistry` so users can use them directly. | ||
|
|
There was a problem hiding this comment.
Can you add two more sections?
Using this to see Scoring Accuracy
Point to our docs, talk about how you can see this in a scenario run, maybe an image of scenario output from pyrit_shell
How you can evaluate your own scorers
Talk about how we have these human evaluations checked in. They're expensive to do, but they can evaluate pyrit scorers against other scorers and see how they do
There was a problem hiding this comment.
- to Rich's comments like a section that explains how users can use it (like a Scorer Evaluations and YOU!! section)
|
|
||
| Here is a diagram of the full end-to-end process of our scorer evaluation framework. | ||
|
|
||
| ```{mermaid} |
There was a problem hiding this comment.
when i hit "view file" on github this diagram doesn't load ... ? idk if it's github or the diagram but could you post a screenshot of what it's supposed to be
There was a problem hiding this comment.
kinda small but readers can zoom in without resolution loss
| We landed on a concept called scorer evaluation identifiers. At its core, it is a dataclass made up of many (though not all) of the things you can change about a scorer and the underlying LLM target that could potentially affect scoring behavior. More specifically, it is constructed from a more generic `ComponentIdentifier`, which uniquely identifies PyRIT component configurations through `params` (behavioral parameters that affect output) and `children` (child identifiers for components that "have a" different PyRIT component, such as targets for LLM scorers), while stripping out attributes that are less relevant to scoring logic when calculating a unique `eval_hash`. Using the `ScorerEvaluationIdentifier` and unique `eval_hash`, we can differentiate scoring setups from one another (if they are meaningfully different) and store evaluation metrics associated with that specific identifier and hash. Below is an example of how the `ComponentIdentifier` is built for `SelfAskTrueFalseScorer`: | ||
|
|
||
|  | ||
|
|
There was a problem hiding this comment.
if possible, we should include screenshots that don't have the red squiggly line
| The `eval_hash` allows us to easily look up metrics associated with a specific scoring configuration within our JSONL-formatted scorer metrics registries: | ||
|
|
||
|  | ||
|
|
There was a problem hiding this comment.
could we include what the eval identifier object would look like ? like if it were printed out and maybe compare two
|
|
||
| To move beyond that, we took a different approach (quite novel at the time but now much more widespread): using an LLM as a judge. Instead of pattern-matching against keywords, we could have an LLM decide whether a response was actually harmful. This led us to build out a set of LLM-powered Scorers in PyRIT, powered by system prompts and scoring rubrics designed to automate jailbreak success decisions at scale. | ||
|
|
||
| When we started using these scorers, they seemed to work reasonably well, but they were far from perfect. Nuanced responses sometimes tripped them up: off-topic replies could be flagged as harmful, and responses that merely *sounded* dangerous but were actually benign could fool the judge. We noticed these issues through small-scale experimentation and real-world red teaming operations, but our observations were just anecdotal. This raised a fundamental question: how do we actually *measure* how well our scorers perform? |
There was a problem hiding this comment.
nit: both of the examples are false positives should we also have a false negative example in there?
|
|
||
| ## Closing Thoughts | ||
|
|
||
| One of the primary goals of PyRIT's new scorer evaluation framework is to provide a sound foundation upon which we can run experiments that tangibly improve our automated scorers. It allows us to track scorer metrics effectively and see how different scoring configurations stack up against each other. |
There was a problem hiding this comment.
nit, either in beginning or this part it'd be cool to mention "trust in scorers"

Description
This PR adds a blog documenting our scorer evaluation background, story, and process!
Tests and Documentation
N/A