Skip to content

Parity Testing

Parity testing answers a single question: does the new implementation produce the same outputs as the legacy system for the same inputs? This is the bridge between “we built it” and “we can ship it.” ModernizeSpec’s parity-tests.json captures the test cases, expected outputs, and confidence scores that determine when extraction is complete.

Michael Feathers introduced characterization tests in “Working Effectively with Legacy Code.” The insight is simple: don’t test what the code should do. Test what it actually does.

  1. Pick a function or workflow in the legacy system
  2. Call it with known inputs and record the output
  3. Write a test that asserts that exact output — even if the output seems wrong
  4. That test is now a characterization of the system’s actual behavior

The characterization test does not judge whether the behavior is correct. It captures reality. If the legacy system rounds tax to 2 decimal places when it should use 4, the characterization test asserts 2 decimal places. The new system must reproduce this behavior (or the team must explicitly decide to fix it and document the deviation).

ApproachTests AgainstRisk
Specification testsWhat the system should do (requirements docs)Requirements may be outdated, incomplete, or wrong
Characterization testsWhat the system actually does (runtime output)Captures bugs as “expected” behavior

For migration, characterization tests are safer. The legacy system has been running in production — its behavior, including its bugs, is what users depend on. Changing behavior during migration introduces risk that is separate from the extraction itself.

When a characterization test captures a known bug:

  1. Document it in the test: “Legacy rounds to 2 decimal places; should be 4”
  2. Preserve the behavior in the new implementation initially
  3. Create a separate task to fix the bug after migration is proven
  4. Mark it in parity-tests.json with a knownDeviation field

The most scalable approach to parity testing is table-driven: a matrix of inputs and expected outputs, run through both implementations.

InputLegacy OutputNew OutputMatch
Invoice: 3 items, GST 18%Total: 11,800.00, Tax: 1,800.00Total: 11,800.00, Tax: 1,800.00Pass
Invoice: 1 item, exemptTotal: 500.00, Tax: 0.00Total: 500.00, Tax: 0.00Pass
Invoice: discount + taxTotal: 9,440.00, Tax: 1,440.00Total: 9,440.00, Tax: 1,440.00Pass
Invoice: multi-currencyTotal: 850.00 USD, Tax: 153.00Total: 850.00 USD, Tax: 153.00Pass

Extract real inputs and outputs from the legacy system’s database or logs:

  1. Query the legacy database for completed transactions
  2. Record the input state (what was sent to the system)
  3. Record the output state (what the system produced)
  4. Use data subsetting to create a manageable fixture set

Advantage: Captures real-world scenarios including edge cases you would never think to write.

Risk: Requires anonymization for PII.

Each row in the table becomes an entry in parity-tests.json:

{
"id": "tax-calc-gst-18",
"module": "taxation",
"description": "Standard GST 18% on 3-item invoice",
"input": {
"items": [
{ "amount": 5000 },
{ "amount": 3000 },
{ "amount": 2000 }
],
"taxRate": 0.18
},
"expectedOutput": {
"subtotal": 10000.00,
"taxAmount": 1800.00,
"total": 11800.00
},
"source": "production-capture",
"status": "passing"
}

Behavioral snapshots are a heavier-weight version of characterization tests. Instead of testing individual functions, they capture the full response of the legacy system to a realistic request.

ArtifactHow to CaptureStorage
API responsesRecord HTTP response body, headers, statusJSON files
Database writesCapture rows written after an operationSQL or JSON fixtures
Computed valuesLog intermediate calculationsStructured log entries
Side effectsRecord emails sent, events emitted, files writtenEvent log

Store snapshots as “golden files” — reference outputs that the new system must reproduce exactly.

fixtures/
├── tax-calculation/
│ ├── input-001.json # Input to the function
│ ├── golden-001.json # Expected output (captured from legacy)
│ ├── input-002.json
│ └── golden-002.json
└── gl-posting/
├── input-001.json
└── golden-001.json # Expected GL entries

The test runner:

  1. Reads each input-*.json
  2. Passes it through the new implementation
  3. Compares the output to the corresponding golden-*.json
  4. Reports exact differences (field-level diff, not just pass/fail)

When the new system intentionally deviates from legacy behavior (bug fixes, improvements):

  1. Document the deviation in parity-tests.json with knownDeviation
  2. Update the golden file to reflect the new expected output
  3. Record the reason for the change in version control

Not all parity is equal. A module with 50 passing tests on happy paths but zero tests on error paths has limited real confidence. Confidence scoring quantifies how trustworthy the parity evidence is.

DimensionWeightMeasurement
Happy path coverage1xPercentage of normal workflows tested
Error path coverage2xPercentage of error/exception paths tested
Edge case coverage2xBoundary values, empty inputs, maximum sizes
Data variety1.5xDiversity of test inputs (currencies, date ranges, entity types)
Production traffic representation3xHow closely test inputs match actual production usage patterns

Error paths and production representation are weighted highest because they are where surprises emerge in production.

ScoreLabelMeaningDecision
0-30LowMinimal testing, major gapsDo not proceed to shadow mode
31-60ModerateCore paths tested, gaps in edgesProceed with caution, add tests
61-85HighComprehensive testing, few gapsReady for shadow mode
86-100Very HighExhaustive testing including production traffic replayReady for production cutover

Confidence scores are recorded per module in parity-tests.json:

{
"module": "taxation",
"confidence": {
"overall": 78,
"happyPath": 95,
"errorPath": 45,
"edgeCases": 72,
"dataVariety": 80,
"productionRepresentation": 60
}
}

This makes confidence transparent to AI agents and team leads reviewing migration progress.

Once parity is proven, the tests serve a second purpose: regression guards. Any future change to the new system that breaks an established parity test must be intentional and documented.

Capture baseline ──▶ Prove parity ──▶ Guard regressions ──▶ Retire
(when legacy is │
fully decommissioned)

Parity tests are retired only after the legacy system is completely removed. Until then, they remain active as regression guards.

Run parity tests on every pull request that touches an extracted module:

  1. PR modifies code in src/taxation/ → run taxation parity tests
  2. Any failure blocks merge
  3. If a deviation is intentional, the PR must update the golden file and add a knownDeviation entry

Michael Feathers documents 24 techniques for making untestable legacy code testable. The most relevant for parity testing:

Extract Interface

Create an interface for a concrete class. The legacy class implements it. The new class implements it. Test both through the same interface.

Sprout Method

When adding new behavior to an existing method, write the new behavior in a separate method. Test the sprouted method independently.

Wrap Method

Wrap an existing method to add behavior before/after without modifying the original. Useful for inserting recording hooks.

Parameterize Constructor

Replace hard-coded dependencies with constructor parameters. Pass test doubles during parity testing.

Subclass and Override

Create a test subclass that overrides framework-dependent methods with test-friendly implementations. Call the business logic through the subclass.

Skin and Wrap

Create a thin wrapper around a legacy class. The wrapper exposes a clean interface. The new system targets the wrapper’s interface, not the legacy class.

These techniques create seams for testing without modifying the legacy system’s behavior — which is exactly what you need when building characterization tests.

Team Zeta in the PearlThoughts internship independently achieved 100% parity on tax calculation using table-driven tests:

ScenarioPython OutputGo OutputMatch
GST 18% on single itemTax: 1,800.00Tax: 1,800.00Pass
GST 18% + CESS 1% compoundTax: 1,918.00Tax: 1,918.00Pass
Inclusive pricing (tax-in-price)Net: 8,474.58Net: 8,474.58Pass
Multi-rate (5% + 18% items)Tax: 1,150.00Tax: 1,150.00Pass
Zero-rated exportTax: 0.00Tax: 0.00Pass

They captured Python outputs first, then built Go implementations until every row matched. No specification documents were needed — the Python system was the specification.