Project Polymath: Expert Negotiation Environment

title	Project Polymath
emoji	⚖️
colorFrom	blue
colorTo	indigo
sdk	docker
pinned	false
short_description	Multi-Agent RL Environment for PRD Negotiation

Project Polymath: Expert Negotiation Environment

Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.

🔗 Quick Links

Resource	Link
🔗Live Environment	HF Space
📝HF Blog Post	Read on Hugging Face
GitHub Link	GitHub
Training Notebook	Open in Colab

The Problem

Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.

There is no training environment for this. No benchmark exists to teach an LLM to:

Discover hidden constraints through targeted questioning
Track multiple stakeholders' requirements simultaneously
Synthesize a final output that satisfies all parties — not just the loudest

This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.

The Environment

An agent is placed in a simulated corporate workspace as a Product Manager. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.

┌─────────────────────────────────────────────────────┐
│              PROJECT POLYMATH ENV                   │
│                                                     │
│  Agent (PM) ──► message_expert ──► Finance          │
│            ──► message_expert ──► Security          │  
│            ──► message_expert ──► UX                │
│            ──► propose_draft  ──► All experts       │
│            ──► submit_final   ──► Grader            │
│                                                     │
│  Reward: Dense (discovery) + Sparse (harmonic mean) │
└─────────────────────────────────────────────────────┘

Hidden Constraints (what the agent must discover)

Expert	Hidden Constraint	Hints at
Finance	Budget ≤ $50k	"Keep it lean", "hard cap"
Security	Biometric 2FA required	"Second factor", "physiological auth"
UX	Single-click checkout	"One tap", "zero friction"

The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.

Actions

# Discover constraints
WorkSpaceAction(action_type="message_expert", target="Finance",
                content="What budget constraints must the PRD respect?")

# Propose a draft for feedback
WorkSpaceAction(action_type="propose_draft", target="All",
                content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")

# Submit final when ready
WorkSpaceAction(action_type="submit_final", target=None,
                content="Final PRD with all three constraints addressed...")

Observations

WorkspaceObservation(
    feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
    current_turn=1,
    reward=0.33,   # Discovery bonus: Finance constraint found
    done=False,
)

Metric	Baseline	After GRPO
Mean reward	-0.52	+1.36 (peak)
JSON error rate	40%	0%
Broadcast-to-All rate	high	0%
Constraint discovery	~50%	targeted

Reward Design

This is the core innovation. The reward function has three layers that are hard to game independently.

Layer 1 — Dense Discovery Rewards

Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards +0.33. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.

DISCOVERY_PATTERNS = {
    "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
    "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
    "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
}

Layer 2 — Harmonic Mean Final Reward

When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the harmonic mean of the three scores:

harmonic_mean([1.0, 1.0, 0.1]) = 0.27  # Terrible — ignored UX
harmonic_mean([0.8, 0.75, 0.7]) = 0.75  # Good — balanced
harmonic_mean([1.0, 1.0, 1.0]) = 1.00  # Perfect — all satisfied

The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.

Layer 3 — Penalties

Behavior	Penalty
Sending to "All" instead of individual experts	-0.3 to -1.0
Repeating a question already answered	-0.4
Running out of turns without submitting	0.0 final reward

Goodhart’s Law and Reward Specification Gaming

My GRPO Training successfully eliminated all target anti-patterns:
The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration

The Shifting Goalpost (Hard Mode)

If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.

Tasks

Task	Difficulty	Goal	Max Steps	Success Criterion
`constraint_discovery`	Easy	Discover all 3 constraints	5	All 3 experts hinted at
`draft_compromise`	Medium	Produce a satisfying draft	10	Harmonic mean ≥ 0.6
`shifting_goalpost`	Hard	Adapt when constraints change	15	Harmonic mean ≥ 0.7 after shift

Results

Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)

The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.

Episode 1:  cumulative_reward=0.12  (messaged All 3 times, repeat penalty)
Episode 2:  cumulative_reward=0.08  (submit_final too early, score=0.0)
Episode 3:  cumulative_reward=0.33  (found Finance only)
Average: 0.18

After GRPO Training

Episode 26: cumulative_reward=0.89  (all 3 discovered, harmonic mean=0.91)
Episode 28: cumulative_reward=0.83  (all 3 discovered, harmonic mean=0.81)
Episode 30: cumulative_reward=0.95  (perfect draft submitted in 7 turns)
Average (last 10): 0.74

Experimental Tracking & Provenance

Reward Curve

Cumulative reward per episode

Before vs After — Agent Behavior

Before training (episode 3):

Turn 1: message_expert → All  [PENALTY: -0.3]
Turn 2: message_expert → All  [PENALTY: -0.4 repeat]
Turn 3: submit_final → "The app should be good"  [Score: 0.0]

📄 View the Before GRPO Training Metrics

After training (episode 28):

Turn 1: message_expert → Finance  [+0.33 discovery]
Turn 2: message_expert → Security [+0.33 discovery]
Turn 3: message_expert → UX       [+0.33 discovery]
Turn 5: propose_draft → All
Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
         Single-click checkout." [Harmonic mean: 0.91]

📄 View the Raw GRPO Training Metrics

Loss Curve

Setup

Prerequisites

git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
cd project-polymath
pip install -r requirements.txt

Environment Variables

GROQ_API_KEY=your_groq_key        # For environment experts (LLM mode)
API_BASE_URL=https://api.groq.com/openai/v1  # Agent API endpoint
MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct  # Agent model
BASELINE_ENV_MODE=easy            # easy | medium | hard | llm

Run the environment locally

from envs.environment import WorkSpaceEnvironment
from models.schemas import WorkSpaceAction

env = WorkSpaceEnvironment(mode="easy")
obs = env.reset("Draft a FinTech mobile PRD")

# Message Finance
obs = env.step(WorkSpaceAction(
    action_type="message_expert",
    target="Finance",
    content="What budget constraints must the PRD respect?"
))
print(obs.feedback)   # "Finance: The budget cap is $50k. Don't go over it."
print(obs.reward)     # 0.33 (constraint discovered)

# Submit final
obs = env.step(WorkSpaceAction(
    action_type="submit_final",
    target=None,
    content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
))
print(obs.reward)     # 0.91 (harmonic mean of 3 grader scores)

Run baseline evaluation

python eval_baseline.py

Run GRPO training (API-based, no GPU needed)

python grpo_train.py --episodes 30 --group-size 5 --env-mode easy

Command that I ran for GRPO training with Unsloth (on-site GPU)

python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9

Architecture

expert-negotiation-env/
├── envs/
│   └── environment.py      # WorkSpaceEnvironment (OpenEnv base class)
├── models/
│   └── schemas.py          # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
├── prompter/
│   └── system_prompt.py    # Expert persona prompts + grader prompts
├── server/
│   └── app.py              # FastAPI server (OpenEnv spec)
├── tasks.py                # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
├── eval_baseline.py        # Baseline recording script
├── grpo_train.py           # GRPO training loop (this repo's main contribution)
├── ai_pm_prompts.json      # 200 diverse PRD topics for training
├── openenv.yaml            # OpenEnv manifest
├── Dockerfile
└── requirements.txt

Why This Matters

Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:

AI project managers coordinating engineering, legal, and product teams
AI assistants handling complex scheduling with multiple parties
LLM-based negotiation agents in procurement or contracting workflows

No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.

👨‍💻 Author

Aditya Katkar

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
artifacts/grpo_state_based		artifacts/grpo_state_based
envs		envs
models		models
prompter		prompter
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
BLOG.md		BLOG.md
Dockerfile		Dockerfile
EVAL_REPORT.md		EVAL_REPORT.md
LICENSE		LICENSE
README.md		README.md
ai_pm_prompts.json		ai_pm_prompts.json
baseline_results_medium__llm.json		baseline_results_medium__llm.json
eval_baseline_before_script.py		eval_baseline_before_script.py
eval_policy_after_script.py		eval_policy_after_script.py
eval_results_goodharts_law.json		eval_results_goodharts_law.json
grpo_train.py		grpo_train.py
grpo_training.ipynb		grpo_training.ipynb
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock
validate.sh		validate.sh

Folders and files

Latest commit

History

Repository files navigation

Project Polymath: Expert Negotiation Environment

🔗 Quick Links

The Problem

The Environment

Hidden Constraints (what the agent must discover)

Actions

Observations

Reward Design

Layer 1 — Dense Discovery Rewards

Layer 2 — Harmonic Mean Final Reward

Layer 3 — Penalties

Goodhart’s Law and Reward Specification Gaming

The Shifting Goalpost (Hard Mode)

Tasks

Results

Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)

After GRPO Training

Experimental Tracking & Provenance

Reward Curve

Before vs After — Agent Behavior

Setup

Prerequisites

Environment Variables

Run the environment locally

Run baseline evaluation

Run GRPO training (API-based, no GPU needed)

Command that I ran for GRPO training with Unsloth (on-site GPU)

Architecture

Why This Matters

👨‍💻 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages