Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
evaluation
Follow
Hide
Posts
Left menu
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 25
Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce
#
ai
#
evaluation
#
testing
#
typescript
1
 reaction
Comments
Add Comment
5 min read
Our Quality Scores Were Precise, Useless, and Identical
Alex @ Vibe Agent Making
Alex @ Vibe Agent Making
Alex @ Vibe Agent Making
Follow
Jun 24
Our Quality Scores Were Precise, Useless, and Identical
#
engineering
#
management
#
evaluation
#
codequality
1
 reaction
Comments
1
 comment
8 min read
Evaluating LLM Output Quality In Production
Nazar Boyko
Nazar Boyko
Nazar Boyko
Follow
Jun 23
Evaluating LLM Output Quality In Production
#
ai
#
observability
#
llm
#
evaluation
6
 reactions
Comments
Add Comment
10 min read
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 20
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems
#
ai
#
agents
#
observability
#
evaluation
2
 reactions
Comments
1
 comment
7 min read
Stop Asking 'Is GAI Here' — Ask 'At What Layer'
keeper
keeper
keeper
Follow
Jun 19
Stop Asking 'Is GAI Here' — Ask 'At What Layer'
#
ai
#
gai
#
framework
#
evaluation
1
 reaction
Comments
Add Comment
3 min read
Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 20
Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It
#
ai
#
agents
#
evaluation
#
observability
1
 reaction
Comments
1
 comment
6 min read
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 19
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output
#
ai
#
agents
#
evaluation
#
observability
3
 reactions
Comments
Add Comment
6 min read
An LLM benchmark is only useful for as long as it's hard
Arthur
Arthur
Arthur
Follow
Jun 11
An LLM benchmark is only useful for as long as it's hard
#
llm
#
evaluation
#
benchmarks
#
humaneval
2
 reactions
Comments
Add Comment
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 9
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
#
ai
#
agents
#
safety
#
evaluation
2
 reactions
Comments
Add Comment
11 min read
Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 22
Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production
#
ai
#
agents
#
evaluation
#
observability
2
 reactions
Comments
Add Comment
5 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 7
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
#
ai
#
security
#
evaluation
#
agents
1
 reaction
Comments
Add Comment
5 min read
Your RAG faithfulness check is measuring copy-paste, not faithfulness
Het Patel
Het Patel
Het Patel
Follow
Jun 22
Your RAG faithfulness check is measuring copy-paste, not faithfulness
#
rag
#
llm
#
evaluation
#
machinelearning
2
 reactions
Comments
5
 comments
5 min read
Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 21
Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything
#
ai
#
agents
#
evaluation
#
observability
1
 reaction
Comments
2
 comments
5 min read
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 17
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces
#
ai
#
evaluation
#
observability
#
typescript
2
 reactions
Comments
Add Comment
5 min read
第一次对AI Agent的精神病å¦è¯„ä¼°
guangda
guangda
guangda
Follow
Jun 6
第一次对AI Agent的精神病å¦è¯„ä¼°
#
ai
#
agents
#
psychology
#
evaluation
1
 reaction
Comments
Add Comment
1 min read
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account