Scientific QA robustness evaluation pipeline for evidence-missing RAG scenarios on PeerQA, with EM/F1 reliability analysis.
-
Updated
Mar 18, 2026 - Python
Scientific QA robustness evaluation pipeline for evidence-missing RAG scenarios on PeerQA, with EM/F1 reliability analysis.
RevealVLLMSafetyEval is a comprehensive pipeline for evaluating Vision-Language Models (VLMs) on their compliance with harm-related policies. It automates the creation of adversarial multi-turn datasets and the evaluation of model responses, supporting responsible AI development and red-teaming efforts.
Multi-agent deep research engine with SIA (Semantic Intelligence Architecture) — thermodynamic entropy control, adversarial critique, multi-reactor swarm orchestration
GuardMCP - Deterministic Runtime Semantic Enforcement for Agentic Tool Execution using Directional Intent–Action Alignment
Add a description, image, and links to the adversarial-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the adversarial-evaluation topic, visit your repo's landing page and select "manage topics."