Quality Gates

During modernization, legacy and modern code coexist. Changes carry more risk than in a stable codebase — a refactored service might break an integration point nobody documented, or a migrated module might subtly change behavior that downstream systems depend on.

Quality gates create checkpoints that catch these problems before they reach production.

PR Contract

Every pull request over 200 lines of code should include a structured header. This is not bureaucracy — it is communication. Reviewers spend less time understanding the change and more time evaluating it.

## PR Contract
**Intent:** Migrate invoice PDF generation from legacy template engine to modern renderer
**Proof:** 14 tests — 8 unit (renderer logic), 4 integration (PDF output), 2 parity (legacy vs modern output)
**Risk:** Medium — PDF layout differences could affect downstream print workflows
**Review Focus:** Parity test assertions in test/parity/invoice-pdf.test.ts
**Related PRs:** #142 (template engine extraction), #145 (font loading service)

Contract Fields

Field	Purpose	Example
Intent	Why this change exists (1 sentence)	“Extract payment processing into standalone service”
Proof	Test count and what they verify	”12 tests: 6 unit, 4 integration, 2 regression”
Risk	Low/Medium/High with explanation	”High — changes database schema with rollback migration”
Review Focus	Where human reviewers should spend time	”Business logic in `src/services/payment.ts` lines 45-120”
Related PRs	Dependency chain or “Standalone"	"#201, #203 (must merge in order)”

The contract helps reviewers prioritize. A “Low risk, Standalone” PR with 90% test coverage needs a lighter touch than “High risk” with schema changes and a dependency chain.

Stacked PR Splitting

PRs over 500 lines of changed code should be split into a chain. Each PR in the chain targets the previous branch, and each covers one bounded concern with its own tests.

Target size: 200-800 lines of code per PR.

Splitting Boundaries

Boundary	What Goes In This PR	Example
Type layer	Types, interfaces, schemas	`PaymentIntent`, `InvoiceSchema`, shared DTOs
Gateway/Client	External API integration	Stripe client, legacy SOAP adapter
Adapter/Boundary	Integration with existing code	Anti-corruption layer, event translators
Infrastructure	Migrations, models, config	Database schema changes, environment config
Service	Business logic (one service per PR)	`PaymentProcessingService`, `InvoiceRenderer`
Route/Activation	Wiring, feature flags, routing	API endpoint registration, feature toggle

Why This Order

The chain follows dependency direction: types first (no dependencies), then infrastructure (depends on types), then services (depends on infrastructure), then activation (depends on everything).

Monolith PR
Stacked PRs

PR #200: "Migrate payment processing" (2,400 LOC)
- New types
- Database migration
- Stripe client wrapper
- Payment service
- API routes
- 38 tests mixed together

Reviewers face a wall of changes. Hard to understand dependencies. Hard to revert partially.

PR #200: Types + schemas (180 LOC)           → main
PR #201: Database migration (220 LOC)         → #200
PR #202: Stripe client adapter (350 LOC)      → #201
PR #203: PaymentProcessingService (480 LOC)   → #202
PR #204: API routes + feature flag (320 LOC)  → #203

Each PR is reviewable in isolation. Each has focused tests. Each can be reverted independently.

Layered Review Model

Code review during modernization needs multiple layers. No single layer catches everything.

Layer	Type	What It Catches
1. Static Analysis	Automated	Type errors, lint violations, security patterns, formatting
2. CI/CD Pipeline	Automated	Test failures, build errors, dependency issues
3. AI Semantic Review	Automated	Logic correctness, edge cases, security anti-patterns, code quality
4. Human Review	Manual	Architecture fit, intent alignment, business logic, risk assessment
5. Human Approval	Manual	Merge authority, deployment readiness

What AI Reviews vs. What Humans Review

AI Catches

Logic correctness and edge cases
Security patterns (injection, auth bypass, secrets in code)
Code quality (naming, dead code, unused imports)
Pattern compliance (project conventions)
Common bugs (off-by-one, null dereference, race conditions)

Humans Evaluate

Intent: Is this the right thing to build?
Architecture: Does it fit the system’s direction?
Risk: What could go wrong in production?
Business logic: Does the domain behavior match reality?
Test quality: Are the tests testing the right things?

AI review tools are good at finding things that are objectively wrong. Humans are needed for things that require judgment about what should be built and how it fits the bigger picture.

Issue Taxonomy

When reviewing, categorize findings by type and severity. This creates a shared vocabulary for the team and sets clear expectations about what blocks a merge.

Category	Severity	Examples	Blocks Merge?
Bug	Critical / High	Logic error, race condition, unhandled edge case	Yes
Security	Critical / High	Auth bypass, injection, secrets exposure, SSRF	Yes
Architecture	Medium / High	Boundary violation, tight coupling, DDD drift	Yes (if high)
Performance	Medium	N+1 queries, memory leak, unbounded collection	Depends
Pattern	Low / Medium	Style inconsistency, convention violation	No
Quality	Low	Missing test, unclear name, dead code	No

Severity threshold: Critical and High findings must be resolved before merge. Medium findings should be addressed or tracked. Low findings are suggestions.

Before Each Commit

A lightweight checklist that prevents common problems from entering the repository:

Check	Why
Tests pass	Regressions caught immediately
Build succeeds	Broken builds block the entire team
Lint/format clean	Consistent codebase, no noise in diffs
No TODO without ticket reference	TODOs without tickets are forgotten promises
No debug logging left behind	`console.log("here")` does not belong in production
No hardcoded secrets or credentials	Even in test code, use environment variables

Metrics Worth Tracking

Metric	Target	Why
PR size	Under 800 LOC	Smaller PRs get better reviews
Review turnaround	Under 24 hours	Long review queues stall modernization
Bug escape rate	Under 5% post-merge	Measures gate effectiveness
Test coverage (new code)	Over 80%	New modernization code needs strong coverage

These are starting points. Adjust targets based on your team’s velocity and risk tolerance. The point is to measure — without metrics, “quality” is just a feeling.