How We QA a 50,000-Line SaaS with AI (And What We Caught)

We build IdeaLift, a product management platform with 10+ integrations — Slack, Discord, Teams, GitHub, Jira, Linear, Zendesk, and more. The codebase is 50,000+ lines across a Next.js web app, a bot server, a browser extension, and a Jira Forge app.

Most of it was built with AI assistance. Cursor, Claude, Copilot — all three, depending on the task. We ship fast. Some weeks, we merge 30+ PRs.

That pace creates a problem: how do you QA a fast-moving, AI-generated codebase without a dedicated QA team? This is the vibe coding QA gap at scale.

We solved it by building an AI-powered QA system directly into our admin dashboard. After six months of dogfooding it, here's what we've learned — and what we're extracting into VibeProof.

The Setup

Our QA system has four pillars:

1. Structured Test Cases

Every testable behavior is a numbered test case with a consistent format:

TC-SLACK-014: Verify idea capture from Slack thread reply
─────────────────────────────────────────────────────────
Priority: High
Type: Integration
Preconditions: Slack workspace connected, channel selected

Steps:
  1. Post a message in the connected Slack channel
  2. Reply to the message in a thread
  3. Use the ⚡ shortcut on the reply
  4. Verify the idea appears in the dashboard

Expected: Idea created with source "Slack", thread context preserved

We organize test cases by test area — Slack, Discord, Teams, GitHub, Dashboard, etc. Each area maps to an integration or feature surface. Right now we have 10 test areas covering the full product.

2. Priority-Based Coverage Targets

Not every feature deserves the same testing effort. We use a four-tier priority system:

Tier	Areas	Coverage Target	Pass Rate Target
P0	Auth, Dashboard, Inbox, Idea Management	100%	95%
P1	Slack, Discord, GitHub, Jira, Linear	90%	85%
P2	Teams, Zendesk, NLQ Search	70%	75%
P3	HubSpot, Fireflies, Extension	50%	60%

This means we test authentication and core features exhaustively, test primary integrations thoroughly, and test secondary features at a baseline level. The priority matrix tells us where to focus when time is tight — which is always.

3. Test Runs with Evidence Collection

A test run groups related executions into a named session — usually tied to a release or a sprint. For each test execution, we capture:

Status: passed, failed, blocked, skipped
Actual result: what actually happened vs. what should have happened
Screenshots: uploaded directly during execution, stored in Azure Blob Storage
Duration: how long each test took to execute

When a test fails, we capture the evidence in the moment — the error message, the wrong UI state, the API response. This matters because "it failed yesterday" with no proof is useless for debugging.

4. AI Chat Assistant with RAG

This is the feature we're most excited about extracting into VibeProof. Our QA assistant uses Claude Sonnet with retrieval-augmented generation (RAG) to answer questions about how features work.

The assistant searches across three sources:

Knowledge base articles — hand-written guides for each integration
Test cases — the structured test library
Documentation — markdown docs checked into the repo

When a QA tester asks "how does Slack idea capture work?", the assistant pulls relevant knowledge base articles, finds related test cases, and synthesizes an answer with citations. This eliminates the "ask the developer" bottleneck that kills QA velocity.

What It Caught

After six months of running this system, here are the categories of bugs we consistently catch:

OAuth Token Expiry Edge Cases

Our integrations use OAuth tokens that expire. The AI-generated code handled the happy path (token valid → make API call), but consistently missed edge cases around expiry:

Token expires mid-request
Refresh token is also expired
Token refresh succeeds but the retry uses the old token
Two concurrent requests both try to refresh simultaneously

These bugs only surface in production, days or weeks after the initial OAuth flow. Without structured test cases specifically targeting token lifecycle, they'd go unnoticed until a customer reports "my Slack integration stopped working."

Cross-Integration State Bugs

When a user connects Slack and Discord, ideas flow from both sources. Our AI-generated code handled each integration independently, but cross-integration scenarios revealed bugs:

Duplicate detection across sources (same idea submitted via Slack and Discord)
Notification preferences applying to one source but not another
Rate limiting counted per-source instead of per-workspace

These bugs are invisible to the developer who built each integration separately. They only appear when you test the interactions between features — exactly what structured test cases are designed to cover.

Authorization Boundary Violations

This is the scariest category. AI-generated API routes frequently work correctly for the authenticated user but don't properly scope data to their workspace. Without test cases that specifically try to access Workspace B's data from Workspace A's session, these vulnerabilities go undetected.

We found three authorization boundary bugs in the first month of structured testing. All three had been in production for weeks. None were caught by code review.

What We Learned

AI-Generated Code Needs AI-Generated Tests

The humans who review AI-generated code bring the same blind spots as the AI that wrote it. If the developer didn't think to add workspace scoping, the reviewer usually doesn't catch it either — because both are focused on "does this feature work?" not "does this feature leak data?"

A separate AI system that reads the code from scratch and generates test cases from the outside finds the gaps that the write-review loop misses.

Coverage Metrics Change Behavior

Before we had coverage dashboards, testing was ad hoc. "I clicked around and it works" was the standard. After adding coverage metrics with targets (P0: 100%, P1: 90%), the team started treating testing as a measurable outcome, not a checkbox.

The metric that changed behavior most: pass rate by area over time. When the Slack area dropped from 92% to 78% after a refactor, it was immediately visible and prioritized. Without the metric, that regression would have been noticed by customers.

One-Click GitHub Issues Are Non-Negotiable

The distance between "I found a bug" and "there's a GitHub issue with steps to reproduce" determines whether bugs get fixed. When that distance is one click — with the test steps, expected result, actual result, and screenshots auto-populated — bugs get fixed the same day.

When the distance is "write up the bug, take screenshots, create the issue, tag the developer" — bugs sit for days. Or forever.

From Internal Tool to VibeProof

The QA system we built for IdeaLift works. It catches real bugs, tracks real metrics, and fits into a real development workflow. But it's tightly coupled to our codebase — the test areas are our integrations, the knowledge base is our docs, the auth is our admin panel.

VibeProof is the extraction of this system into a standalone product that works with any codebase:

Connect your repo instead of hardcoded test areas
AI generates test cases from your code instead of manual creation
Universal knowledge base that learns from your codebase, not ours
Same evidence collection, same GitHub integration, same coverage metrics

We're building VibeProof because we proved the approach works on a complex, fast-moving, AI-generated codebase. If it works for a 50K-line SaaS with 10 integrations, it works for your app too.

Try VibeProof free — same QA system, built for your codebase.