Skip to content

Conversation

@rlundeen2
Copy link
Contributor

@rlundeen2 rlundeen2 commented Jan 2, 2026

PrependedConversationConfig

Added new PrependedConversationConfig dataclass to configure how prepended conversations are handled. Supports configurable behavior for non-chat targets (normalize_first_turn vs raise), and allows specifying which roles should have converters applied. Includes factory methods for common configurations

Consistent next_message Handling

  • Fixed RedTeamingAttack._generate_next_prompt_async to return Message instead of str, preserving multimodal content (images, audio)
  • Ensures next_message is sent as the first message to targets across all attacks
  • Multimodal content is now properly preserved end-to-end
  • Consistent prepended_conversation Handling (e.g some attacks used to extract next_message if the last prepended message was user, but others did not. Now it is never auto-extracted)

ConversationManager Refactoring

  • Refactored initialize_context_async to accept context object with cleaner parameter passing
  • Moved prepended conversation processing logic into the manager
  • Added add_prepended_conversation_to_memory_async for explicit memory operations
  • Started handling non PromptChatTargets
  • The logic for this used to be quite difficult to follow. prepended_conversation was updated multiple places. Now it should all be handled centrally.

Test Coverage

  • Added test_attack_parameter_consistency.py with cross-attack behavioral tests
  • Rewrote test_conversation_manager.py to match refactored API
  • Added comprehensive tests for PrependedConversationConfig settings
  • Fixed numerous tests affected by API changes

Breaking Changes

Nothing major IMO; I think both of these are really internal.

  • ConversationManager.initialize_context_async signature changed to accept context parameter
  • generate_next_prompt_async in RedTeamingAttack now returns Message instead of str

@rlundeen2 rlundeen2 changed the title [FIX Breaking]: Attack Parameter Consistency and ConversationManager Refactoring [FIX Breaking]: PrependedconversationConfig and Attack Param Consistency Jan 2, 2026
@romanlutz romanlutz changed the title [FIX Breaking]: PrependedconversationConfig and Attack Param Consistency [FIX Breaking]: PrependedConversationConfig and Attack Param Consistency Jan 2, 2026
"""
if not prepended_conversation:
return 0
return sum(1 for msg in prepended_conversation if msg.role == "assistant")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's prepended, should we could for simulated_assistant instead of assistant?

Comment on lines +284 to +286
For PromptChatTarget:
- Adds prepended messages to memory with simulated_assistant role
- All messages get new UUIDs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly sure the docstring doesn't resolve properly like this... might want to checkt

Comment on lines +288 to +291
For non-chat PromptTarget:
- If config.non_chat_target_behavior="normalize_first_turn": normalizes
conversation to string and prepends to context.next_message
- If config.non_chat_target_behavior="raise": raises ValueError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern as above, but you probably also want to put the code in backquotes

Comment on lines +313 to 318
context.memory_labels = combine_dict(existing_dict=memory_labels, new_dict=context.memory_labels)

# Initialize conversation state
state = ConversationState()
logger.debug(f"Preparing conversation with ID: {conversation_id}")
prepended_conversation = context.prepended_conversation

# Do not proceed if no history is provided
if not prepended_conversation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not to be addressed here, but something I think about a lot is the following:

Operator creates 10 turns with crescendo. Later on, they branch off manually from that conversation after 5 turns and do something else (could be manual probing or another attack). How does that work in terms of metadata (like the attack identifier, converter identifier, etc.) and memory labels?

Ideally, I would hope the metadata of each piece shows what was applied to it, but would the memory labels be a combination of the old and new ones or only the new ones? From what I'm seeing in this PR I'm assuming it's just the new ones. I think (?) that makes sense because you wouldn't want it to have a label of an old op, for example, if you prepended the conversation from there into a new op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[wasn't necessarily for these lines, but the combine_dict reminded me of this doubt]


if config.non_chat_target_behavior == "raise":
raise ValueError(
"prepended_conversation requires target to be a PromptChatTarget. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say which target?

last_piece = valid_messages[-1].get_piece()
if last_piece.api_role == "assistant":
state.last_assistant_message_scores = list(
self._memory.get_prompt_scores(prompt_ids=[str(last_piece.original_prompt_id)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the message may have multiple pieces with scores, right?

class PrependedConversationConfig:
"""
Configuration for controlling how prepended conversations are processed before
being sent to targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to all targets? For adversarial_chat it's fine not to convert, right?

Comment on lines +38 to +41
# - "normalize_first_turn": Normalize the prepended conversation into a string and
# store it in ConversationState.normalized_prepended_context. This context will be
# prepended to the first message sent to the target. Uses objective_target_context_normalizer
# if provided, otherwise falls back to ConversationContextNormalizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that for the many-shot jailbreak case with a PromptTarget?

return self.message_normalizer or ConversationContextNormalizer()

@classmethod
def default(cls) -> "PrependedConversationConfig":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you import annotations from future you should be able to remove the double quotes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same below



@dataclass
class PrependedConversationConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this is rather complicated. Don't we need documentation?

prepended_conversation=context.prepended_conversation,
request_converters=self._request_converters,
response_converters=self._response_converters,
prepended_conversation_config=self._prepended_conversation_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepended_conversation used to come from the context, not it's an arg here. Is that a conscious choice? From a quick glance, it sounds like a fit in the context?

Comment on lines +701 to +703
- refused_text: The text that was refused (from context.next_message if
there's a refusal), empty string if no refusal
- objective_score: The objective score if found, None otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, pretty sure the website won't render with this. Please check

)

# The first message sent should contain the next_message content with image preserved
first_call = mock_normalizer.send_prompt_async.call_args_list[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also assert that it's going to the objective target?

objective_target=mock_chat_target,
attack_adversarial_config=adversarial_config,
attack_scoring_config=scoring_config,
max_turns=5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to make any assertions about the other turns?

conversation = list(memory.get_conversation(conversation_id=conversation_id))

# Should have prepended messages in memory
assert len(conversation) >= 2, f"Expected at least 2 prepended messages, got {len(conversation)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least? We know exactly how long it is (2) so why at least?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants