close

DEV Community

# evaluation

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

Image 1
Comments
5 min read
Our Quality Scores Were Precise, Useless, and Identical

Our Quality Scores Were Precise, Useless, and Identical

Image 1
Comments 1
8 min read
Evaluating LLM Output Quality In Production

Evaluating LLM Output Quality In Production

Image Image Image 6
Comments
10 min read
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

Image 2
Comments 1
7 min read
Stop Asking 'Is GAI Here' — Ask 'At What Layer'

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

Image 1
Comments
3 min read
Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

Image 1
Comments 1
6 min read
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

Image 3
Comments
6 min read
An LLM benchmark is only useful for as long as it's hard

An LLM benchmark is only useful for as long as it's hard

Image 2
Comments
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

Image 2
Comments
11 min read
Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production

Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production

Image 2
Comments
5 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Image 1
Comments
5 min read
Your RAG faithfulness check is measuring copy-paste, not faithfulness

Your RAG faithfulness check is measuring copy-paste, not faithfulness

Image Image 2
Comments 5
5 min read
Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything

Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything

Image 1
Comments 2
5 min read
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

Image 2
Comments
5 min read
第一次对AI Agent的精神病学评估

第一次对AI Agent的精神病学评估

Image 1
Comments
1 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.