Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
-
Updated
Feb 12, 2026 - Python
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
AI observability platform for production LLM and agent systems.
The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.
A powerful AI observability framework that provides comprehensive insights into agent interactions across platforms, enabling developers to monitor, analyze, and optimize AI-driven applications with minimal integration effort.
A comprehensive solution for monitoring your AI models in production
Open-source observability for your LLM application.
A Python package for tracking and analyzing LLM usage across different models and applications. It is primarily designed as a library for integration into development process of LLM-based agentic workflow tooling, providing robust tracking capabilities.
A lightweight observability tool for visualizing and comparing RAG retrieval strategies. Features real-time embedding visualization and side-by-side performance metrics.
RAG Eval Observability is a production-ready, open-source platform for building, evaluating, and monitoring Retrieval-Augmented Generation (RAG) systems. It pairs a ChatGPT-style UI with a robust backend for document ingestion, multiple retrieval strategies, offline evaluation, and real-time observability, along with backend CI/CD deployed on Azure
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like AI Agent, LLM and tools tracing, debugging multi-agentic system, self-hosted dashboards and advanced analytics with timeline and execution graph view.
A practical and critical evaluation of Retrieval-Augmented Generation (RAG) systems on legal/insurance documents using RAGAS. This project analyzes metric failures, false negatives, retrieval pitfalls, and proposes a more realistic composite evaluation score.
Reproducibility code for “Evaluating the Performance of Large Language Models in Taxonomic Classification of Questions in Verbal Protocols of Design” (AI EDAM submission; under review). [WIP]
Token cost is a design problem, not a billing problem. Most LLM cost overruns come from architectural waste, not model pricing. This tool is a token waste profiler that helps you understand where your tokens are going and which ones are useless.
Decision-level observability for LLM pipelines, making system behavior explainable even when no outputs exist.
Chat with your recipes! A production-oriented RAG implementation for Mealie. Engineering sandbox exploring retrieval strategies, end-to-end LLM observability with Langfuse, and modern Python engineering standards.
🔍 Enhance retrieval with the Agentic RAG System, featuring intelligent query routing, self-correcting retrieval, and real-time streaming responses.
LLM Context Manager
An AI-powered multi-agent system that demonstrates clinical triage, OTC medication recommendations, and e-pharmacy integration for respiratory conditions. Built with modular agents that collaborate to provide safe, intelligent healthcare assistance.
Add a description, image, and links to the llm-observability topic page so that developers can more easily learn about it.
To associate your repository with the llm-observability topic, visit your repo's landing page and select "manage topics."