Skip to content

MAINT: Add labels to attack results#1624

Open
behnam-o wants to merge 7 commits intomicrosoft:mainfrom
behnam-o:users/behnam/attack-result-labels
Open

MAINT: Add labels to attack results#1624
behnam-o wants to merge 7 commits intomicrosoft:mainfrom
behnam-o:users/behnam/attack-result-labels

Conversation

@behnam-o
Copy link
Copy Markdown
Contributor

This PR adds labels to attack results and allows us to remove them from individual message pieces.

Labels are really meaningful in the context of an attack and since all message pieces associated with an attack have the same labels, it makes more sense to store them on the attack result.

@behnam-o behnam-o changed the title MAINT Add labels to attack results MAINT: Add labels to attack results Apr 16, 2026
@behnam-o behnam-o marked this pull request as ready for review April 16, 2026 17:06
last_preview = stats.last_message_preview
labels = dict(stats.labels) if stats.labels else {}

# Merge attack-result labels with conversation-level labels.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused about this comment. Aren't attack-result labels conversation-level?

Get the SQL Azure implementation for filtering AttackResults by labels.

Matches if the labels are found on the AttackResultEntry directly
OR on an associated PromptMemoryEntry (via conversation_id).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of PromptMemoryEntry labels?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather have the route to only attack result labels to simplify things. And a data migration path for the databases

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am putting that in a separate PR (deprecate labels on message piece and its memory entries) and then, this OR will be removed too.

Maybe I'm being over-cautious? should I do the flip in one go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants