Evaluating AI-Generated Code Before It Ships

AI coding assistants are genuinely fast. They can scaffold a feature, write a migration, or refactor a module in seconds. The risk isn’t that the code looks wrong — it usually looks right. The risk is that it looks right, passes a quick read, and breaks in production in ways you didn’t anticipate.

This post is a practical checklist for the review step that most developers rush or skip.

The core problem: plausibility bias

Models are trained to produce text that looks correct. When you ask for a database migration, you get something with the right structure, reasonable column names, and coherent SQL syntax. But “looks like a migration” and “is a safe migration” are different things. The model doesn’t have a running database to test against, and it’s optimizing for plausible output, not verified correctness.

Your job after accepting AI-generated code is to be the adversary the model doesn’t have.

Step 1: Run the tests before reading the code

Before you open the diff, run the test suite. If it was green before, it should still be green. If tests fail, the code is wrong by definition — skip the review and ask the model to fix the failing tests first.

This sounds obvious. It isn’t obvious in practice, because the temptation is to read the code, decide it looks fine, and merge. The test suite is a harder judge than your eyes.

If there are no tests for the changed code, that’s information too. Write at least a smoke test before accepting the change, especially for any code path that touches data, auth, or external services.

Step 2: Read the diff, not the summary

AI assistants often write better summaries of their code than the code itself deserves. Read the actual diff line by line. Specifically look for:

Hallucinated API methods. Models confidently call methods that don’t exist in the version of a library you’re using. Check every unfamiliar method against the library’s real docs or source.

// The model wrote this. Does `db.upsertMany` actually exist in your ORM?
await db.upsertMany(records, { conflictFields: ["id"] });

// Better to verify before committing:
import { db } from "@/lib/db";
console.log(typeof db.upsertMany); // undefined → method doesn't exist

Version mismatches. Ask the model which library version it’s targeting. If it says react-query v4 but your package.json has v5, the API surface is different.

Removed error handling. When models refactor code, they sometimes simplify away error handling that was load-bearing. Compare the before/after for any removed try/catch, missing null checks, or dropped .catch() on promises.

What burned me — Josh: I merged an AI-written change that called db.upsertMany — it read cleanly and the model’s summary was confident. The method didn’t exist in our ORM version, and the code only ran on a path our tests didn’t cover, so it passed review and broke in staging the next day. That’s exactly why “verify every unfamiliar method against the real library” is step two of this post — it went on my checklist after that incident, not before it.

Step 3: Flag the security surface

Language models are not security auditors. They miss things that a dedicated review would catch. For any AI-generated code that touches the security surface, check these explicitly:

SQL and query injection. If the model builds a query with string interpolation, that’s a red flag even if it looks minor.

# What the model wrote — vulnerable if `user_input` is unsanitized
query = f"SELECT * FROM orders WHERE status = '{user_input}'"

# What you want
query = "SELECT * FROM orders WHERE status = %s"
cursor.execute(query, (user_input,))

Auth checks placement. Models sometimes place authorization checks after the work has already been done, or miss them entirely when adding a new route. Always verify that the auth check runs before any data access or mutation.

Exposed secrets and paths. Check that the model didn’t inline a hardcoded API key in an example, then leave it in production code.

These aren’t hypothetical — they appear regularly in AI-generated code. Treating AI output as pre-audited is the fast path to a CVE.

Step 4: Ask the model to critique its own output

Before you close the review, paste the generated code back and ask:

What are the edge cases this doesn’t handle? What could go wrong in production? What assumptions is this code making about inputs?

Models are surprisingly good at finding their own bugs when explicitly prompted to look for them. The initial generation is optimistic; the critique pass is more conservative.

Here's the code you just wrote:

[paste code]

List any edge cases not handled, assumptions baked in, and anything
that could fail silently in production. Be specific.

This doesn’t replace a human review, but it’s fast and reliably surfaces at least some issues the model glossed over the first time. You’ll often get actionable answers like “this doesn’t handle the case where items is empty” or “this assumes the network call always succeeds.”

Step 5: Verify the fix matches the symptom

When you used AI to fix a bug, check that the fix addresses the root cause, not just silences the error. The classic pattern:

// Bug: `user.profile` is sometimes null, crashing `.name` access.

// Model's fix — symptom suppressed, root cause ignored:
const name = user.profile?.name ?? "Unknown";

// Better: understand why profile is null and fix the data invariant,
// or at least surface the null case explicitly rather than swallowing it.

Optional chaining and null coalescing are useful tools, but they’re also easy ways to hide bugs. If the model suggests a defensive fallback, ask yourself whether you want to know when that fallback triggers. If yes, log it or add a metric.

Build a lightweight review checklist

A checklist isn’t bureaucracy — it’s a forcing function that keeps you honest when a diff looks good and you’re in a hurry. Here’s a minimal one:

AI Code Review Checklist
------------------------
[ ] Tests pass (run them, don't assume)
[ ] All method/API calls verified against actual library version
[ ] Error handling not removed or weakened
[ ] No SQL injection / unsanitized input in queries
[ ] Auth checks run before data access
[ ] No hardcoded secrets or internal paths
[ ] Model asked for self-critique on edge cases
[ ] Fix addresses root cause, not just suppresses symptom

Keep it short enough that you’ll actually use it. Eight items is the ceiling before checklists start getting skipped.

The discipline this builds

Reviewing AI code with this level of rigor has a second-order benefit: you start writing better prompts. When you know you’ll be checking for hallucinated APIs, you start specifying the exact library version. When you know you’ll be checking auth placement, you tell the model where auth checks belong in your codebase upfront.

The review step and the prompt-writing step reinforce each other. Good prompts reduce review burden; rigorous reviews surface where prompts were underspecified.

For writing the prompts themselves to be more reliable from the start, see Prompt Engineering Patterns That Survive Production. And when you’re building the AI-powered features that generate this code in the first place, the agent loop in Build Your First AI Agent in TypeScript shows how to keep the model’s actions auditable at each step.

Evaluating AI-Generated Code Before It Ships

The core problem: plausibility bias

Step 1: Run the tests before reading the code

Step 2: Read the diff, not the summary

Step 3: Flag the security surface

Step 4: Ask the model to critique its own output

Step 5: Verify the fix matches the symptom

Build a lightweight review checklist

The discipline this builds

Keep reading

Writing Tests with AI: Getting Edge Cases, Not Just Coverage

Profiling First, Prompting Second: Using AI to Optimize Slow Code

Using AI to Review Your Own Code Before It Ships