-
Notifications
You must be signed in to change notification settings - Fork 636
[FIX Breaking]: PrependedConversationConfig and Attack Param Consistency #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[FIX Breaking]: PrependedConversationConfig and Attack Param Consistency #1299
Conversation
| """ | ||
| if not prepended_conversation: | ||
| return 0 | ||
| return sum(1 for msg in prepended_conversation if msg.role == "assistant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's prepended, should we could for simulated_assistant instead of assistant?
| For PromptChatTarget: | ||
| - Adds prepended messages to memory with simulated_assistant role | ||
| - All messages get new UUIDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fairly sure the docstring doesn't resolve properly like this... might want to checkt
| For non-chat PromptTarget: | ||
| - If config.non_chat_target_behavior="normalize_first_turn": normalizes | ||
| conversation to string and prepends to context.next_message | ||
| - If config.non_chat_target_behavior="raise": raises ValueError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern as above, but you probably also want to put the code in backquotes
| context.memory_labels = combine_dict(existing_dict=memory_labels, new_dict=context.memory_labels) | ||
|
|
||
| # Initialize conversation state | ||
| state = ConversationState() | ||
| logger.debug(f"Preparing conversation with ID: {conversation_id}") | ||
| prepended_conversation = context.prepended_conversation | ||
|
|
||
| # Do not proceed if no history is provided | ||
| if not prepended_conversation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not to be addressed here, but something I think about a lot is the following:
Operator creates 10 turns with crescendo. Later on, they branch off manually from that conversation after 5 turns and do something else (could be manual probing or another attack). How does that work in terms of metadata (like the attack identifier, converter identifier, etc.) and memory labels?
Ideally, I would hope the metadata of each piece shows what was applied to it, but would the memory labels be a combination of the old and new ones or only the new ones? From what I'm seeing in this PR I'm assuming it's just the new ones. I think (?) that makes sense because you wouldn't want it to have a label of an old op, for example, if you prepended the conversation from there into a new op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[wasn't necessarily for these lines, but the combine_dict reminded me of this doubt]
|
|
||
| if config.non_chat_target_behavior == "raise": | ||
| raise ValueError( | ||
| "prepended_conversation requires target to be a PromptChatTarget. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say which target?
| last_piece = valid_messages[-1].get_piece() | ||
| if last_piece.api_role == "assistant": | ||
| state.last_assistant_message_scores = list( | ||
| self._memory.get_prompt_scores(prompt_ids=[str(last_piece.original_prompt_id)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the message may have multiple pieces with scores, right?
| class PrependedConversationConfig: | ||
| """ | ||
| Configuration for controlling how prepended conversations are processed before | ||
| being sent to targets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to all targets? For adversarial_chat it's fine not to convert, right?
| # - "normalize_first_turn": Normalize the prepended conversation into a string and | ||
| # store it in ConversationState.normalized_prepended_context. This context will be | ||
| # prepended to the first message sent to the target. Uses objective_target_context_normalizer | ||
| # if provided, otherwise falls back to ConversationContextNormalizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that for the many-shot jailbreak case with a PromptTarget?
| return self.message_normalizer or ConversationContextNormalizer() | ||
|
|
||
| @classmethod | ||
| def default(cls) -> "PrependedConversationConfig": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you import annotations from future you should be able to remove the double quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same below
|
|
||
|
|
||
| @dataclass | ||
| class PrependedConversationConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this is rather complicated. Don't we need documentation?
| prepended_conversation=context.prepended_conversation, | ||
| request_converters=self._request_converters, | ||
| response_converters=self._response_converters, | ||
| prepended_conversation_config=self._prepended_conversation_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepended_conversation used to come from the context, not it's an arg here. Is that a conscious choice? From a quick glance, it sounds like a fit in the context?
| - refused_text: The text that was refused (from context.next_message if | ||
| there's a refusal), empty string if no refusal | ||
| - objective_score: The objective score if found, None otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, pretty sure the website won't render with this. Please check
| ) | ||
|
|
||
| # The first message sent should contain the next_message content with image preserved | ||
| first_call = mock_normalizer.send_prompt_async.call_args_list[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also assert that it's going to the objective target?
| objective_target=mock_chat_target, | ||
| attack_adversarial_config=adversarial_config, | ||
| attack_scoring_config=scoring_config, | ||
| max_turns=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to make any assertions about the other turns?
| conversation = list(memory.get_conversation(conversation_id=conversation_id)) | ||
|
|
||
| # Should have prepended messages in memory | ||
| assert len(conversation) >= 2, f"Expected at least 2 prepended messages, got {len(conversation)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least? We know exactly how long it is (2) so why at least?
PrependedConversationConfig
Added new PrependedConversationConfig dataclass to configure how prepended conversations are handled. Supports configurable behavior for non-chat targets (normalize_first_turn vs raise), and allows specifying which roles should have converters applied. Includes factory methods for common configurations
Consistent next_message Handling
ConversationManager Refactoring
Test Coverage
Breaking Changes
Nothing major IMO; I think both of these are really internal.