Skip to content

DEV Community

# evaluation

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Saurav Bhattacharya

Jun 25

Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

#ai #evaluation #testing #typescript

5 min read

Alex @ Vibe Agent Making

Jun 24

Our Quality Scores Were Precise, Useless, and Identical

#engineering #management #evaluation #codequality

8 min read

Jun 23

Evaluating LLM Output Quality In Production

#ai #observability #llm #evaluation

10 min read

Saurav Bhattacharya

Jun 20

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

#ai #agents #observability #evaluation

7 min read

keeper

Jun 19

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

#ai #gai #framework #evaluation

3 min read

Saurav Bhattacharya

Jun 20

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

#ai #agents #evaluation #observability

6 min read

Saurav Bhattacharya

Jun 19

Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

#ai #agents #evaluation #observability

6 min read

Arthur

Jun 11

An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

10 min read

Saurav Bhattacharya

Jun 9

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

#ai #agents #safety #evaluation

11 min read

Saurav Bhattacharya

Jun 22

Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production

#ai #agents #evaluation #observability

5 min read

Saurav Bhattacharya

Jun 7

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

#ai #security #evaluation #agents

5 min read

Het Patel

Jun 22

Your RAG faithfulness check is measuring copy-paste, not faithfulness

#rag #llm #evaluation #machinelearning

5 min read

Saurav Bhattacharya

Jun 21

Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything

#ai #agents #evaluation #observability

5 min read

Saurav Bhattacharya

Jun 17

Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

#ai #evaluation #observability #typescript

5 min read

guangda

Jun 6

第一次对AI Agent的精神病学评估

#ai #agents #psychology #evaluation

1 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.