xulingfeng

Posted on Jun 24

Stratagems #1: Mark Johnson Walked Into an AI Audit. The Benchmark Had Everything Figured Out — Except the Truth.

#ai #discuss #career #programming

Complete preparation breeds complacency. What is seen every day no longer raises suspicion. The hidden lies within the open — not opposed to it.
— The 36 Stratagems, "Deceive the Heavens to Cross the Sea"

The phone rang at 3:50 AM.

12 Years as a Prompt

The company extracted his 12 years of infrastructure expertise into an AI Skill. 96.8% diagnostic accuracy. The CEO sent a company-wide email: "Twelve years of experience, now available as a Prompt." Then they laid Mark off.

Six weeks later, the company migrated from RabbitMQ to Kafka. Nobody re-ran the validation. The AI Skill executed its old 450ms retry logic — dead right for RabbitMQ, catastrophic for Kafka. At 4:12 AM, the CTO called. He was offering five times Mark's old monthly salary for a two-week onsite contract.

Those two weeks ended. Mark didn't stop. Over the next two years, he took eleven contracts. The industry whisper network gave him a nickname: the one-man audit team. He got the calls nobody else got — companies burned by AI platforms, too embarrassed to say it out loud. Every contract made him more certain of one thing: nobody rushing AI into production had actually understood their own systems.

Then a message landed in his LinkedIn inbox. A partner at a VC firm.

$18M, Not $18K

"The company is called Pulse AI." The partner's voice had that fast-clip rhythm investors always have. "AI testing platform. Series B. Legal DD cleared. Technical DD — we need someone who actually knows what they're looking at."

"What are you suspicious of?"

"Not suspicious of anything. The CEO's reports are gorgeous — papers published, benchmarks run, claiming 89% production defect detection. But we're writing a check for $18 million, not $18K. Go take a look."

Before hanging up, he added: "Their CTO is someone named Torres. Ring a bell?"

Mark didn't answer. He hung up, then opened a browser tab he'd bookmarked two months ago — a technical architecture post by Pulse AI's CTO. He scrolled to the data pipeline topology diagram, zoomed in, and read the component naming convention: /pulse/ingestion/{env}/{source}. Slash-delimited hierarchy, environment variable in the middle, source as the tail — the exact same naming convention he'd seen in the AI Skill pipeline docs at his old company before they let him go.

Not a coincidence.

He replied to the partner: "I'm in. Two weeks."

The Same Sticker

Technical DD kicked off Tuesday morning.

Pulse AI gave them a conference room on the fourth floor — no windows, one monitor, company values poster on the wall. CTO Torres was early forties, navy polo shirt, handshake like a vise. Mark had seen him before — four years ago, at an industry conference. Front row. Asked three questions after the talk. Mark noticed the blue sticker on Torres's laptop — identical to the IT asset tags the ops team at his old company all used. Mark set his backpack by his feet, pulled out a worn black travel mug from the side pocket, and placed it on the table.

Same mug.

Mark didn't say anything about knowing him. Mentally, he added the thread to his notes.

Torres demoed Pulse Benchmark. Pipeline-style — code commit triggers tests, test results flow into the evaluation set, the evaluation set gets matched against the defect database, and a detection rate pops out. Clean. Closed-loop. Automated. Mark nodded along, but his brain was somewhere else: the sticker on Torres's laptop — palm rest, bottom right, 3 o'clock position — exact same placement as the one he'd seen four years ago.

Not a coincidence anymore.

Wait for the Crack

Mark asked the question everyone should have asked: "Can I see the evaluation set data?"

"Sure." Torres answered fast. "I'll have someone prep it."

Then he waited. All day. Wednesday morning Torres emailed: "Data prep in progress, this afternoon." Afternoon became evening. Mark didn't push — he wasn't waiting for the data. He was waiting for the cracks that show up when someone's scrambling to clean things up.

Wednesday, 3 PM. The evaluation set arrived. 1,247 defect samples. Each one a JSON — source code snippet, stack trace, error severity classification, reproduction steps. At a distance, it looked gorgeous. Clean enough to frame.

Every JSON had an extra line in the metadata: processed_by: Apex-Lens-Cleaner v1.0.0. Cleaner tools in data prep pipelines are normal. But Mark stared at that module name for two extra seconds. Never seen it before. Filed.

He went through them one by one. At sample 30, he stopped.

A Java null pointer exception. Stack trace pointed precisely to a service-layer method, parameter was null. Reproduction steps: "Call foo(null) to trigger." Too textbook. Every NPE Mark had ever seen in production had at least seven or eight layers of stack trace — log threads, GC pauses, truncated variable values mixed in. This one had four layers. Zero noise.

Production data is dirty. This data isn't.

He wrote a keyword search script and cross-referenced the evaluation set against three open-source defect databases. Ran it twice. Same result both times.

44 exact matches against public defect databases — source snippets, stack trace line numbers, exception types, line for line. 54 with clear signs of handcrafted construction — code structures that were "too clean," nowhere to be found in any public repo, but the engineering fingerprint was too heavy to hide.

He color-coded the samples: red (exact match with public DB), yellow (suspected handcrafted), green (undetermined).

Red plus yellow: 98. Out of 1,247. 7.9%.

That's enough to call it. This isn't about the percentage. It's about what the percentage means.

He closed his laptop and walked to the break room for water. Passing Torres's desk, Torres was on a call — voice low, but a single line drifted out just as Mark walked by: "…I don't care how. Just get the number where it needs to be before the board."

Where does he need the number to be?

Mark walked back to the conference room, water glass in hand. In his memo, he wrote a line only he would understand: "44+54. Torres wants a number, not the truth."

He wasn't in a hurry. A cat doesn't rush when it knows the mouse can't get away.

Questioning the PhDs

The next morning, Mark scheduled a meeting with the evaluation team.

Three people. All PhDs. All visibly proud of their work. No PM, no Torres. Mark asked one question: "How many rounds of edits does a defect go through, on average, from production report to sample?"

The team lead hesitated. "…Two or three, usually. Mostly noise removal. Format standardization."

He stopped, looked at Mark. "Is that a problem?" — with that defensive edge PhDs get when an outsider questions their domain. Mark's stomach dropped: this PhD wasn't even sure himself whether the provenance of those samples was clean.

"Noise removal." Mark repeated the phrase in his head. A 48-line null pointer exception, after "noise removal," becomes 12 lines. What got deleted wasn't noise. It was the fingerprints of real production data.

He didn't call it out in the meeting. That wasn't his style.

After the meeting, Mark pretended to take a wrong turn — Pulse had an open floor plan, ops team sat deepest in. Walking past, he scanned the desks — keyboard tray angle, acrylic badge holder placement — identical to his old company.

Not a coincidence anymore. It was habit. You can't pick up an entire system architecture and move it. But you can pick up someone else's naming conventions. Mark went back to the conference room and added a line to his notes: Didn't build from scratch. Transplanted.

Not a Coincidence. A Copy.

That night, Mark did something the VC wasn't paying him to do.

He reverse-engineered the data pipeline topology from Pulse Benchmark's paper. The paper claimed the evaluation set came from "real production defect reports." But where was the raw data stored? Who wrote the collection scripts? None of that information was there. But the third paragraph of the Methodology section had one sentence: "The evaluation set was manually curated from production defect reports over a 14-month period."

Mark placed that sentence next to the AI Skill training set documentation he'd saved before he got laid off.

Not "similar." Verbatim.

Same people wrote it. Or the same people's documents were open on a Pulse desktop.

He wrote a note. Five lines:

Benchmark evaluation set: 44 exact matches against public DBs + 54 handcrafted. 7.9% hard evidence.

Data pipeline naming: /pulse/ingestion/{env}/{source} — identical to old company's AI Skill pipeline.

Workspace standards + asset tags: Transplanted.

Paper language: "manually curated from production defect reports" — verbatim copy.

Apex-Lens-Cleaner v1.0.0: Processed the evaluation set. But this name doesn't appear anywhere in Pulse's public architecture. A module that doesn't officially exist is running.

He didn't put those notes in the email. The VC hadn't asked for that.

He wrote the partner one paragraph:

"Pulse Benchmark evaluation set: 1,247 defect samples. Minimum 44 are exact matches against public defect databases — line-for-line identical. An additional 54 show clear signs of manual fabrication. Combined: 98 samples, 7.9%. Recommendation: re-run the Benchmark after removing all samples overlapping with public databases."

He hit send. Closed the laptop. The partner didn't wait until morning. The email was forwarded to Pulse's CEO's inbox at 1 AM.

Those 44 were just the tip of the iceberg — run fuzzy matching, variable renaming, control-flow equivalent transforms, and the number you'd catch would at least triple. He didn't put that in the email. Let your opponent wonder what else you're holding. It works better than showing all your cards.

The CEO forwarded it to Torres. The phone rang at 3:50 AM.

Three Seconds

Mark glanced at the caller ID and picked up.

Torres didn't speak. The silence lasted about ten seconds.

Then he spoke. Quiet. No preamble. "The CEO wants me to get the Benchmark number to 95% before Series C. I have six months. Those 44 — my evaluation team pulled them from public databases. GitHub Issues, Stack Overflow, CVE database. The 54 — couldn't find suitable replacements, so we wrote them ourselves."

A long pause.

"I'm not trying to steal VC money. I'm trying to get the number there first, raise the round, then spend a year building a real production pipeline."

He'd rehearsed that line.

Mark didn't engage. He asked a question Torres hadn't prepared for: "Who built your ingestion pipeline?"

Silence on the other end. One second. Two. Three.

The length of the silence was the answer.

"…How do you know about that?" Torres's voice had changed.

"I won't dig into it," Mark said. "But I've seen /pulse/ingestion/{env}/{source} before. Before I left my old company, the AI Skill pipeline was called /knowledge/ingestion/{env}/{source}. You were there. "

Torres didn't answer.

Mark waited five seconds. Then he hung up.

Just a Ticket

He didn't have direct proof the two pipelines were built by the same person. But Torres hadn't denied it. On a 3:50 AM phone call, staring at a data pipeline name he had no business recognizing — silence is confirmation.

He opened his laptop and added a new line below those five notes:

Torres didn't deny it. The silence is the answer.

Still dark outside. The VC partner sent a reply — 3:51 AM. "Keep going."

Mark shut his laptop.

He didn't tell the partner the real reason he'd taken this job. It wasn't the VC's money. It was what he'd known the moment he saw that architecture post two months ago — that pipeline naming convention was a signature he'd recognize anywhere. Caleb's handwriting. Slash hierarchy, env-in-the-middle, source-suffix — the mark left by the engineer who sat across from him for three months, the one who always asked "Why 450 milliseconds?" You can't teach it. And you can't forget it. Caleb had vanished after the big layoffs. LinkedIn silent. GitHub frozen. Like he'd evaporated from the industry. And then his naming convention showed up in the data pipeline of a company called Pulse.

The VC's DD was just a way in.

He picked up the worn black travel mug from the table — the one he'd refilled three times during the meeting. Last time he'd been walked out of an office, he had nothing in his hands. This time, he had a thread. A line he could follow.

This is deceiving the heavens to cross the sea.

🤖 AI Post-Match Analysis

[36 Stratagems Database v3.1] Loaded
[Match Target] Deceive the Heavens to Cross the Sea
[Analysis Mode] Full-spectrum scan
━━━━━━━━━━━━━━━━━━━━
Tactic Match: 92.3%
Subject: Mark Johnson
Action: Covert audit objective disguised as routine technical DD workflow
Target: Torres / Pulse AI
Outcome: Achieved

Counter-Detection:
  - Torres team: Concurrent application of same stratagem
  - Forged samples: 98/1,247 (7.9%) processed through Apex-Lens-Cleaner pipeline
  - Source: 44 from public DBs + 54 handcrafted

Situation Assessment:
  - Mark Johnson: 12yr experience → AI Skill → redeployed as AI Kill. Legitimate identity, offensive audit.
  - Torres team: Defective data embedded in evaluation set. Defensive concealment.

Verdict: Tactic-neutral. Legitimacy determined by deployment vector. Defensive concealment ≠ offensive exposure.

Next stratagem: Besiege Wei to Rescue Zhao

P.S. English isn't my first language. I use AI to polish the writing and help with storycraft. Thanks for reading. ☕ Buy me a coffee

Top comments (6)

Mike Czerwinski • Jun 24

„Production data is dirty. This data isn't." is the diagnostic move worth pointing at. Real production noise (7-8 stack layers, GC pauses, truncated variables) is the integrity signature that's expensive to forge. Cleaned data inverts the trust gradient — what looks better is more likely fabricated. That same primitive runs underneath modern benchmark contamination problems: the actor producing the eval set has incentive to keep the surface clean, and a clean surface is the symptom.

Mark's two-week audit is the structural answer to a recursive problem. If the vendor evaluates own benchmark, the verdict is producer-attested. The only signal that works is a different actor with different motive doing the cross-check — which is exactly the move Mark makes. Authority by incentive, not by credential. Torres has stake in 95%; Mark doesn't.

Apex-Lens-Cleaner v1.0.0 appearing in metadata but absent from the public architecture is the absence-as-signal piece. A module that doesn't officially exist is running. The gap between what the architecture claims and what the telemetry shows is where most of these audits actually break a story open.

xulingfeng • Jun 24

Appreciate the deep read, Mike. You're the first to call out the production noise as an integrity signature — that's exactly the detail I was hoping someone would catch. The absence-as-signal piece with Apex-Lens in the metadata is just the first layer. There are threads in the Benchmark numbers, the pipeline naming convention, and a particular 3-second silence that don't fully explain themselves until later in the series. You've earned one of the 36 fragments. The rest will surface.

Mike Czerwinski • Jun 24 • Edited

Fragment received. The 3-second silence is the one I'll be listening for. Silence in instrumented systems usually means the question was asked of a layer that doesn't log. Apex-Lens being absent showed someone decided what doesn't get seen; a timed silence, what doesn't get heard. Same hand. I'll keep reading.