Skip to content

Quality Gates

During modernization, legacy and modern code coexist. Changes carry more risk than in a stable codebase — a refactored service might break an integration point nobody documented, or a migrated module might subtly change behavior that downstream systems depend on.

Quality gates create checkpoints that catch these problems before they reach production.

Every pull request over 200 lines of code should include a structured header. This is not bureaucracy — it is communication. Reviewers spend less time understanding the change and more time evaluating it.

## PR Contract
**Intent:** Migrate invoice PDF generation from legacy template engine to modern renderer
**Proof:** 14 tests — 8 unit (renderer logic), 4 integration (PDF output), 2 parity (legacy vs modern output)
**Risk:** Medium — PDF layout differences could affect downstream print workflows
**Review Focus:** Parity test assertions in test/parity/invoice-pdf.test.ts
**Related PRs:** #142 (template engine extraction), #145 (font loading service)
FieldPurposeExample
IntentWhy this change exists (1 sentence)“Extract payment processing into standalone service”
ProofTest count and what they verify”12 tests: 6 unit, 4 integration, 2 regression”
RiskLow/Medium/High with explanation”High — changes database schema with rollback migration”
Review FocusWhere human reviewers should spend time”Business logic in src/services/payment.ts lines 45-120”
Related PRsDependency chain or “Standalone""#201, #203 (must merge in order)”

The contract helps reviewers prioritize. A “Low risk, Standalone” PR with 90% test coverage needs a lighter touch than “High risk” with schema changes and a dependency chain.

PRs over 500 lines of changed code should be split into a chain. Each PR in the chain targets the previous branch, and each covers one bounded concern with its own tests.

Target size: 200-800 lines of code per PR.

BoundaryWhat Goes In This PRExample
Type layerTypes, interfaces, schemasPaymentIntent, InvoiceSchema, shared DTOs
Gateway/ClientExternal API integrationStripe client, legacy SOAP adapter
Adapter/BoundaryIntegration with existing codeAnti-corruption layer, event translators
InfrastructureMigrations, models, configDatabase schema changes, environment config
ServiceBusiness logic (one service per PR)PaymentProcessingService, InvoiceRenderer
Route/ActivationWiring, feature flags, routingAPI endpoint registration, feature toggle

The chain follows dependency direction: types first (no dependencies), then infrastructure (depends on types), then services (depends on infrastructure), then activation (depends on everything).

PR #200: "Migrate payment processing" (2,400 LOC)
- New types
- Database migration
- Stripe client wrapper
- Payment service
- API routes
- 38 tests mixed together

Reviewers face a wall of changes. Hard to understand dependencies. Hard to revert partially.

Code review during modernization needs multiple layers. No single layer catches everything.

LayerTypeWhat It Catches
1. Static AnalysisAutomatedType errors, lint violations, security patterns, formatting
2. CI/CD PipelineAutomatedTest failures, build errors, dependency issues
3. AI Semantic ReviewAutomatedLogic correctness, edge cases, security anti-patterns, code quality
4. Human ReviewManualArchitecture fit, intent alignment, business logic, risk assessment
5. Human ApprovalManualMerge authority, deployment readiness

AI Catches

  • Logic correctness and edge cases
  • Security patterns (injection, auth bypass, secrets in code)
  • Code quality (naming, dead code, unused imports)
  • Pattern compliance (project conventions)
  • Common bugs (off-by-one, null dereference, race conditions)

Humans Evaluate

  • Intent: Is this the right thing to build?
  • Architecture: Does it fit the system’s direction?
  • Risk: What could go wrong in production?
  • Business logic: Does the domain behavior match reality?
  • Test quality: Are the tests testing the right things?

AI review tools are good at finding things that are objectively wrong. Humans are needed for things that require judgment about what should be built and how it fits the bigger picture.

When reviewing, categorize findings by type and severity. This creates a shared vocabulary for the team and sets clear expectations about what blocks a merge.

CategorySeverityExamplesBlocks Merge?
BugCritical / HighLogic error, race condition, unhandled edge caseYes
SecurityCritical / HighAuth bypass, injection, secrets exposure, SSRFYes
ArchitectureMedium / HighBoundary violation, tight coupling, DDD driftYes (if high)
PerformanceMediumN+1 queries, memory leak, unbounded collectionDepends
PatternLow / MediumStyle inconsistency, convention violationNo
QualityLowMissing test, unclear name, dead codeNo

Severity threshold: Critical and High findings must be resolved before merge. Medium findings should be addressed or tracked. Low findings are suggestions.

A lightweight checklist that prevents common problems from entering the repository:

CheckWhy
Tests passRegressions caught immediately
Build succeedsBroken builds block the entire team
Lint/format cleanConsistent codebase, no noise in diffs
No TODO without ticket referenceTODOs without tickets are forgotten promises
No debug logging left behindconsole.log("here") does not belong in production
No hardcoded secrets or credentialsEven in test code, use environment variables
MetricTargetWhy
PR sizeUnder 800 LOCSmaller PRs get better reviews
Review turnaroundUnder 24 hoursLong review queues stall modernization
Bug escape rateUnder 5% post-mergeMeasures gate effectiveness
Test coverage (new code)Over 80%New modernization code needs strong coverage

These are starting points. Adjust targets based on your team’s velocity and risk tolerance. The point is to measure — without metrics, “quality” is just a feeling.