Operational doctrine for practical AI systems design.
-
Updated
Jun 16, 2026
Operational doctrine for practical AI systems design.
Benchmarking Open-Ended Inference Optimization by AI Agents
Evaluation Infrastructure for AI Agents
Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.
AI agent workspace architecture, demonstrated end-to-end in Claude Code: roles library, persistent memory, hooks, scheduled agents, self-audits, loop selection, and measurement-gated self-improvement. Interactive tour, fork-ready samples.
Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Local side-by-side eval workbench that turns one AI product evaluation idea into eval cases, comparable evidence, and a decision-grade report.
Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal
Rust based CLI tool for estimating success rates when using LLM judges for evaluation.
Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.
Dali is open evidentiary infrastructure for legal AI, focused on verifiable outputs, reproducible evaluations, and evidence artifacts.
AI-powered Pull Request Review Assistant with diff parsing, structured LLM output, explainable risk scoring, and AI evals.
Open-source toolkit for assessing whether an AI workflow is ready for production: governance, RAG quality, evals, observability, human review, cost, risk, and business value.
End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.
Offline, auditable benchmark for one-shot LLM market decisions.
SQL-backed LLM observability & evals dashboard for a conversational AI assistant — model-health monitoring across quality, safety, performance, cost, and drift, with executive summary, alerting, and PDF/PPTX export. Synthetic demo data.
A lightweight workbench for dataset-driven agent and LLM evaluation.
Scenario evaluation harness for AI workflows with structured reports and traces.
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."