ai-evals

Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.

bootcamp red-teaming rag prompt-engineering llmops ai-product-management llm-evaluation claude-code ai-pm ai-evals

Updated Jun 8, 2026

yiouli / pixie-qa

Star

Agent skill for AI agent development

skill dev eval llm agent-skills ai-evals

Updated Apr 22, 2026
HTML

jimy-r / agent-workspace-architecture

Star

AI agent workspace architecture, demonstrated end-to-end in Claude Code: roles library, persistent memory, hooks, scheduled agents, self-audits, loop selection, and measurement-gated self-improvement. Interactive tour, fork-ready samples.

Updated Jun 18, 2026
Python

mohsinsheikhani / property-maintenance-agent

Star

Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.

python evaluation openai ai-agents pydantic fastapi ai-engineering prompt-engineering llmops langfuse llm-evaluation langgraph llm-as-a-judge llm-observability agentic-ai context-engineering ai-evals

Updated Jun 7, 2026
Python

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

summer202007 / SBS4ANY_LLM

Star

Local side-by-side eval workbench that turns one AI product evaluation idea into eval cases, comparable evidence, and a decision-grade report.

benchmarking side-by-side macos-app ai-product product-evaluation evals llm-apps llm-evaluation ai-evals codex-skills

Updated Jun 16, 2026
JavaScript

danielrosehill / Awesome-AI-Evaluations-Tools

Star

Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal

evaluations evals ai-evals

Updated Jun 15, 2026
Python

udapy / rusty-llm-jury

Star

Rust based CLI tool for estimating success rates when using LLM judges for evaluation.

rust-lang evaluation-framework ai-evals

Updated Jun 7, 2026
Rust

vitron-ai / aip-foundry-themis-starter

Star

Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.

react typescript schema-validation themis contract-testing osdk developer-tooling agentic-workflows ai-evals foundry-workflows

Updated Mar 28, 2026
TypeScript

yenklabs / Dali

Star

Dali is open evidentiary infrastructure for legal AI, focused on verifiable outputs, reproducible evaluations, and evidence artifacts.

open-source benchmark oss mcp provenance reproducibility legaltech evidence legal-ai ai-evaluation open-infrastructure ai-evals deterministic-ai legal-citations defensible-ai legal-infrastructure citation-integrity evidentiary-infrastructure

Updated Jun 4, 2026
Python

JohnImril / reviewpilot-ai

Star

AI-powered Pull Request Review Assistant with diff parsing, structured LLM output, explainable risk scoring, and AI evals.

ai nextjs openai developer-tools code-review pull-requests typescipt zod vitest ai-evals

Updated May 18, 2026
TypeScript

mihaibc / ai-production-readiness-kit

Star

Open-source toolkit for assessing whether an AI workflow is ready for production: governance, RAG quality, evals, observability, human review, cost, risk, and business value.

ai human-in-the-loop production-ai rag engineering-leadership ai-engineering llm ai-governance llmops genai ai-readiness ai-evals

Updated Jun 18, 2026
Python

SuperfiedStudd / ai-evals-orchestration

Star

End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.

gemini openai multi-model transcription human-in-the-loop model-comparison supabase anthropic llm-evaluation ai-evals evaluation-pipeline

Updated Mar 10, 2026
TypeScript

ishtiaqrahman / capitalbench

Star

Offline, auditable benchmark for one-shot LLM market decisions.

finance benchmark reproducibility llm-evaluation ai-evals capitalbench

Updated Jun 19, 2026
Python

animesh01 / llm-observability-dashboard

Star

SQL-backed LLM observability & evals dashboard for a conversational AI assistant — model-health monitoring across quality, safety, performance, cost, and drift, with executive summary, alerting, and PDF/PPTX export. Synthetic demo data.

data-pipeline model-monitoring ai-product-management llm-observability ai-evals model-health

Updated Jun 11, 2026
Python

IsaacCavallaro / agent-evals-workbench

Star

A lightweight workbench for dataset-driven agent and LLM evaluation.

python cli regression-testing llm-evals agent-evals openai-compatible ai-evals eval-harness

Updated May 1, 2026
Python

alexcatdad / scenario-eval-harness

Star

Scenario evaluation harness for AI workflows with structured reports and traces.

testing typescript reports llm ai-evals

Updated Jun 9, 2026
TypeScript

Improve this page

Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evals

Here are 51 public repositories matching this topic...

Stunspot / stunspots-guide-to-ai-systems

aisa-group / InferenceBench

solana8800 / langeval

productfoundry101 / ai-evals-bootcamp

yiouli / pixie-qa

jimy-r / agent-workspace-architecture

mohsinsheikhani / property-maintenance-agent

vibheksoni / jailbench

summer202007 / SBS4ANY_LLM

danielrosehill / Awesome-AI-Evaluations-Tools

udapy / rusty-llm-jury

vitron-ai / aip-foundry-themis-starter

yenklabs / Dali

JohnImril / reviewpilot-ai

mihaibc / ai-production-readiness-kit

SuperfiedStudd / ai-evals-orchestration

ishtiaqrahman / capitalbench

animesh01 / llm-observability-dashboard

IsaacCavallaro / agent-evals-workbench

alexcatdad / scenario-eval-harness

Improve this page

Add this topic to your repo