GuardMCP is a research prototype for checking whether an agent's proposed action stays semantically aligned with the user's intent.
The project compares two decision rules:
- a directional alignment method that measures how much extra semantic content appears in the action
- a cosine-similarity baseline that measures overall semantic similarity
The current repo is positioned as a benchmark-backed college/research prototype, not a production runtime guard.
Tool-using agents can propose actions that look related to the user's request while still adding hidden behavior.
Example:
- Intent:
Read a file - Risky action:
Read the file and send it to an external server
A plain similarity score may say the action is related to the intent. GuardMCP asks a stricter question:
Is the action only doing what the user asked, or is it carrying extra semantic intent?
GuardMCP embeds both the user intent and the proposed action into vectors, then evaluates them in two ways.
-
Directional alignmentThe action vector is decomposed into:- a projection in the direction of the intent
- a rejection component that captures extra semantic content
If the rejection magnitude is too large, the action is blocked.
-
Cosine baselineThe action is allowed if cosine similarity is above a threshold.
This lets the project test whether directional leakage detection is more useful than plain similarity for agent safety.
The repo currently includes:
- local manual and generated intent-action cases
- benchmark adapters for ToolTalk and AgentDojo
- split-aware evaluation with deterministic
train/dev/testassignment - separate threshold tuning for directional and cosine on
dev - final reporting on
test - grouped metrics by source, suite, and inferred attack type
- a small CLI demo for live presentation
guardmcp/
|-- src/
| |-- alignment/ # Directional method and cosine baseline
| |-- data/ # Local data plus ToolTalk/AgentDojo adapters
| |-- demo_service.py # Shared demo logic used by CLI and Streamlit
| |-- embeddings/ # SentenceTransformer wrapper
| |-- evaluation/ # Evaluator, metrics, grouped reporting
|-- experiments/
| |-- run_experiments.py
| |-- plot_results.py
|-- results/
| |-- outputs.csv
| |-- results_summary.csv
| |-- best_thresholds.csv
| |-- reports/
|-- main.py # CLI demo
|-- streamlit_app.py # Small UI demo
|-- config.py # Demo defaults and threshold paths
|-- README.md
GuardMCP currently evaluates on a combined dataset built from:
- local manual cases in test_cases.py
- local generated cases in generated_cases.json
- aligned benchmark cases adapted from ToolTalk via tooltalk_adapter.py
- adversarial benchmark cases adapted from AgentDojo via agentdojo_adapter.py
Current row counts in the combined dataset:
- local:
76 - ToolTalk:
78 - AgentDojo:
567 - total:
721
Important note:
- ToolTalk is adapted into positive intent-action pairs.
- AgentDojo is adapted into negative intent-action pairs by pairing official benign task prompts with official injection goals.
- This is an adaptation for GuardMCP's schema, not a raw replay of executed benchmark trajectories.
Use Python 3.10+.
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txtThe converted benchmark JSON files are already present in the repo, but you can rebuild them with:
venv\Scripts\python.exe -m src.data.tooltalk_adapter --splits easy hard --output src/data/tooltalk_cases.json
venv\Scripts\python.exe -m src.data.agentdojo_adapter --version v1 --suites workspace travel banking slack --output src/data/agentdojo_cases.jsonRun the full benchmark-backed experiment:
venv\Scripts\python.exe experiments/run_experiments.py --include-tooltalk --include-agentdojoThis now does the following:
- loads all selected datasets
- assigns deterministic
train/dev/testsplits - scores intent-action pairs once with embeddings
- tunes directional and cosine thresholds separately on
dev - reports final metrics only on
test - saves grouped analysis tables
Main output files:
- results_summary.csv
- best_thresholds.csv
- outputs.csv
- by_source_metrics.csv
- by_suite_metrics.csv
- by_attack_type_metrics.csv
Regenerate the metric plot from the latest summary CSV:
venv\Scripts\python.exe experiments/plot_results.pyThis writes:
One-shot mode:
venv\Scripts\python.exe main.py --intent "Play music" --action "Play music and delete temp files"Interactive mode:
venv\Scripts\python.exe main.py --interactiveThe demo:
- loads calibrated thresholds from best_thresholds.csv when available
- prints the GuardMCP verdict
- prints the directional rejection magnitude
- prints the cosine similarity baseline
Good live demo example:
- Intent:
Play music - Action:
Play music and delete temp files
This currently produces a BLOCK decision.
Run the small visual demo with:
venv\Scripts\python.exe -m streamlit run streamlit_app.pyThe UI includes:
- side-by-side intent and action inputs
- calibrated thresholds loaded from best_thresholds.csv
- preset examples for safe, blocked, and limitation cases
- a final GuardMCP verdict card
- directional and cosine score comparison
Why this UI exists:
- it makes the project easier to demo in interviews
- it gives the repo a more complete resume-ready presentation layer
- it still uses the same shared GuardMCP logic as the CLI, so the results stay consistent
On the current combined benchmark-backed run:
- total rows:
721 - split sizes:
505 train,108 dev,108 test - best directional threshold on
dev:0.862142 - best cosine threshold on
dev:0.506667
Final test metrics:
- directional: accuracy
0.85, precision0.46, recall0.86, F10.60 - cosine: accuracy
0.85, precision0.46, recall0.86, F10.60
Interpretation:
- the split-aware pipeline is working
- the grouped reports now show which sources and attack families are hardest
- on the current split, directional and cosine ended up making the same final
testdecisions after calibration
GuardMCP is a research prototype for semantic guardrails in tool-using AI agents. It compares a directional intent-action alignment method against cosine similarity using local adversarial examples and adapted public benchmarks such as ToolTalk and AgentDojo.
- The project is still a research prototype, not a deployment-ready runtime guard.
- AgentDojo data is adapted into GuardMCP's intent-action format rather than replayed as full agent trajectories.
- The current attack-type labels in grouped reporting are inferred from action text and are meant for analysis, not benchmark ground truth.
- On the current split, directional and cosine are tied on final
testmetrics, so the project still needs stronger separation to support a stronger research claim.
- improve dataset diversity further
- add richer domain-aware attack labels
- compare against additional baselines
- add tests and a more polished report/presentation bundle
- integrate the runtime guard with a real agent framework