Skip to content
0degrees.ai
Production

Evaluating AI-Generated Code Before It Ships

A practical checklist for reviewing AI-written code — catching hallucinated APIs, subtle logic bugs, and security gaps before they reach production.

0degrees Team 5 min read

AI coding assistants are genuinely fast. They can scaffold a feature, write a migration, or refactor a module in seconds. The risk isn’t that the code looks wrong — it usually looks right. The risk is that it looks right, passes a quick read, and breaks in production in ways you didn’t anticipate.

This post is a practical checklist for the review step that most developers rush or skip.

The core problem: plausibility bias

Models are trained to produce text that looks correct. When you ask for a database migration, you get something with the right structure, reasonable column names, and coherent SQL syntax. But “looks like a migration” and “is a safe migration” are different things. The model doesn’t have a running database to test against, and it’s optimizing for plausible output, not verified correctness.

Your job after accepting AI-generated code is to be the adversary the model doesn’t have.

Step 1: Run the tests before reading the code

Before you open the diff, run the test suite. If it was green before, it should still be green. If tests fail, the code is wrong by definition — skip the review and ask the model to fix the failing tests first.

This sounds obvious. It isn’t obvious in practice, because the temptation is to read the code, decide it looks fine, and merge. The test suite is a harder judge than your eyes.

If there are no tests for the changed code, that’s information too. Write at least a smoke test before accepting the change, especially for any code path that touches data, auth, or external services.

Step 2: Read the diff, not the summary

AI assistants often write better summaries of their code than the code itself deserves. Read the actual diff line by line. Specifically look for:

Hallucinated API methods. Models confidently call methods that don’t exist in the version of a library you’re using. Check every unfamiliar method against the library’s real docs or source.

// The model wrote this. Does `db.upsertMany` actually exist in your ORM?
await db.upsertMany(records, { conflictFields: ["id"] });

// Better to verify before committing:
import { db } from "@/lib/db";
console.log(typeof db.upsertMany); // undefined → method doesn't exist

Version mismatches. Ask the model which library version it’s targeting. If it says react-query v4 but your package.json has v5, the API surface is different.

Removed error handling. When models refactor code, they sometimes simplify away error handling that was load-bearing. Compare the before/after for any removed try/catch, missing null checks, or dropped .catch() on promises.

Step 3: Flag the security surface

Language models are not security auditors. They miss things that a dedicated review would catch. For any AI-generated code that touches the security surface, check these explicitly:

SQL and query injection. If the model builds a query with string interpolation, that’s a red flag even if it looks minor.

# What the model wrote — vulnerable if `user_input` is unsanitized
query = f"SELECT * FROM orders WHERE status = '{user_input}'"

# What you want
query = "SELECT * FROM orders WHERE status = %s"
cursor.execute(query, (user_input,))

Auth checks placement. Models sometimes place authorization checks after the work has already been done, or miss them entirely when adding a new route. Always verify that the auth check runs before any data access or mutation.

Exposed secrets and paths. Check that the model didn’t inline a hardcoded API key in an example, then leave it in production code.

These aren’t hypothetical — they appear regularly in AI-generated code. Treating AI output as pre-audited is the fast path to a CVE.

Step 4: Ask the model to critique its own output

Before you close the review, paste the generated code back and ask:

What are the edge cases this doesn’t handle? What could go wrong in production? What assumptions is this code making about inputs?

Models are surprisingly good at finding their own bugs when explicitly prompted to look for them. The initial generation is optimistic; the critique pass is more conservative.

Here's the code you just wrote:

[paste code]

List any edge cases not handled, assumptions baked in, and anything
that could fail silently in production. Be specific.

This doesn’t replace a human review, but it’s fast and reliably surfaces at least some issues the model glossed over the first time. You’ll often get actionable answers like “this doesn’t handle the case where items is empty” or “this assumes the network call always succeeds.”

Step 5: Verify the fix matches the symptom

When you used AI to fix a bug, check that the fix addresses the root cause, not just silences the error. The classic pattern:

// Bug: `user.profile` is sometimes null, crashing `.name` access.

// Model's fix — symptom suppressed, root cause ignored:
const name = user.profile?.name ?? "Unknown";

// Better: understand why profile is null and fix the data invariant,
// or at least surface the null case explicitly rather than swallowing it.

Optional chaining and null coalescing are useful tools, but they’re also easy ways to hide bugs. If the model suggests a defensive fallback, ask yourself whether you want to know when that fallback triggers. If yes, log it or add a metric.

Build a lightweight review checklist

A checklist isn’t bureaucracy — it’s a forcing function that keeps you honest when a diff looks good and you’re in a hurry. Here’s a minimal one:

AI Code Review Checklist
------------------------
[ ] Tests pass (run them, don't assume)
[ ] All method/API calls verified against actual library version
[ ] Error handling not removed or weakened
[ ] No SQL injection / unsanitized input in queries
[ ] Auth checks run before data access
[ ] No hardcoded secrets or internal paths
[ ] Model asked for self-critique on edge cases
[ ] Fix addresses root cause, not just suppresses symptom

Keep it short enough that you’ll actually use it. Eight items is the ceiling before checklists start getting skipped.

The discipline this builds

Reviewing AI code with this level of rigor has a second-order benefit: you start writing better prompts. When you know you’ll be checking for hallucinated APIs, you start specifying the exact library version. When you know you’ll be checking auth placement, you tell the model where auth checks belong in your codebase upfront.

The review step and the prompt-writing step reinforce each other. Good prompts reduce review burden; rigorous reviews surface where prompts were underspecified.

For writing the prompts themselves to be more reliable from the start, see Prompt Engineering Patterns That Survive Production. And when you’re building the AI-powered features that generate this code in the first place, the agent loop in Build Your First AI Agent in TypeScript shows how to keep the model’s actions auditable at each step.

Build AI software, the right way.

Get new tutorials on agents, RAG and shipping LLM apps — straight to your inbox. No spam, unsubscribe anytime.

Keep reading

RAG 3 min read

RAG vs Fine-Tuning: How to Actually Choose

A practical decision framework for when to reach for retrieval-augmented generation, when to fine-tune, and when to do neither — with the trade-offs that matter.

0degrees Team
Agents 3 min read

Build Your First AI Agent in TypeScript

A from-scratch walkthrough of the agent loop — tools, reasoning, and termination — using the Claude API and plain TypeScript. No frameworks.

0degrees Team