Skip to content

Data Subsetting

Parity testing (proving the new system behaves identically to the old) requires realistic data. Production databases are too large for developer machines, random sampling destroys referential integrity, and synthetic data misses real-world edge cases. Data subsetting solves this by extracting a minimal but complete slice of production data that preserves all relationships.

ApproachSizeReferential IntegrityEdge CasesRisk
Full production copy100%PreservedAll presentPII exposure, storage cost
Random sampling1-5%BrokenRandomly lostBroken FK references, useless tests
Synthetic generationN/ABy constructionNoneDoes not reflect reality
Seed-based subsetting0.1-2%PreservedPreserved for seed entitiesManageable with anonymization

Subsetting gives you the smallest dataset that is still correct — every foreign key resolves, every child record has its parent, and every lookup table entry referenced by the data is included.

The core algorithm is breadth-first search starting from a “seed” entity — a specific record that represents a meaningful business scenario.

  1. Choose a seed record (e.g., a specific customer, invoice, or transaction)
  2. Add the seed to the subset
  3. Find all foreign key references from the seed record
  4. Add referenced records to the subset
  5. Repeat from step 3 for each newly added record
  6. Stop when no new records are discovered (or depth limit reached)
Seed: Invoice #1042
├──▶ Customer "Acme Corp"
│ ├──▶ Address "123 Main St"
│ └──▶ Contact "Jane Doe"
├──▶ Invoice Item × 3
│ ├──▶ Product "Widget A"
│ ├──▶ Product "Widget B"
│ └──▶ Product "Gadget C"
│ └──▶ Product Category "Electronics"
├──▶ Tax Detail × 2
│ ├──▶ Tax Rule "GST 18%"
│ └──▶ Tax Rule "CESS 1%"
├──▶ Company "PearlThoughts"
│ ├──▶ Currency "INR"
│ └──▶ Fiscal Year "2025-2026"
└──▶ GL Entry × 6
└──▶ Account × 4
└──▶ Account Type "Revenue"

Starting from one invoice, the traversal collected 25+ records across 12 entity types. This connected subgraph is the minimum viable dataset for testing invoice processing.

StrategyWhen to UseExample
High-connectivity seedMaximum coverage per seedAn invoice with many line items, taxes, and payments
Edge case seedTesting specific scenariosA cancelled invoice, a partial payment, a multi-currency transaction
Representative seedGeneral regression testingAn average transaction that exercises the common code path
Multiple seedsBroad coverageOne seed per major workflow (invoice, payment, credit note, journal entry)

Pick 3-5 seeds that together cover the critical business scenarios. This typically produces a subset that is 0.1-2% of the production database but exercises 60-80% of the code paths.

The traversal depends on discovering and following foreign key relationships. In well-designed schemas, explicit FKs make this straightforward. Legacy systems rarely have well-designed schemas.

Schema-defined foreign keys — the easy case.

-- DDL tells you the relationship
ALTER TABLE invoice_item
ADD CONSTRAINT fk_invoice
FOREIGN KEY (invoice_id)
REFERENCES invoice(id);

Parse DDL or query information_schema.key_column_usage to extract all explicit FKs automatically.

Define FK resolution rules in a configuration file rather than hardcoding them:

{
"relationships": [
{
"from": { "table": "invoice_item", "column": "invoice_id" },
"to": { "table": "invoice", "column": "id" },
"type": "explicit"
},
{
"from": { "table": "gl_entry", "column": "party" },
"to": { "table": "$party_type", "column": "name" },
"type": "polymorphic",
"discriminator": "party_type"
},
{
"from": { "table": "sales_invoice", "column": "company" },
"to": { "table": "company", "column": "name" },
"type": "implicit"
}
]
}

This configuration is reusable across subset runs and version-controlled alongside the modernization spec.

Subsetting extracts real data. Before using it in development or CI environments, anonymize personally identifiable information (PII) and sensitive business data.

TechniqueHow It WorksPreservesUse Case
Faker replacementReplace real values with fake but realistic onesData type and formatNames, emails, phone numbers
Consistent hashingSame input always produces the same fake outputReferential consistencyCustomer names referenced from multiple tables
Format-preserving encryptionEncrypted value has same format as originalFormat constraints (e.g., phone number length)Regulated data with format requirements
NullificationReplace with NULL or empty stringNothingData that is genuinely unnecessary
BucketingReplace precise values with rangesApproximate distributionSalaries ($50K-$60K), ages (30-35)

If “Jane Doe” appears as a customer name in invoice.customer_name, payment.party_name, and address.contact_name, all three must map to the same fake name. Otherwise, the subset has broken references that pass FK checks but fail business logic that compares names.

Real: "Jane Doe" → hash → deterministic seed → Faker → "Alice Chen"
invoice.customer_name: "Jane Doe" → "Alice Chen"
payment.party_name: "Jane Doe" → "Alice Chen" (same output)
address.contact_name: "Jane Doe" → "Alice Chen" (same output)
{
"rules": [
{ "table": "*", "column": "email", "method": "faker", "type": "email" },
{ "table": "*", "column": "phone", "method": "faker", "type": "phone" },
{ "table": "*", "column": "*_name", "method": "consistent_hash", "seed": "project-secret" },
{ "table": "employee", "column": "salary", "method": "bucket", "size": 10000 },
{ "table": "*", "column": "password_hash", "method": "nullify" }
]
}

The entire subsetting process — seeds, traversal rules, depth limits, anonymization — should be defined in configuration, not code. This makes it repeatable, auditable, and version-controlled.

{
"version": "1.0",
"seeds": [
{
"table": "sales_invoice",
"filter": { "name": "SINV-2025-001042" },
"label": "Standard invoice with taxes"
},
{
"table": "payment_entry",
"filter": { "name": "PE-2025-000891" },
"label": "Multi-currency payment"
}
],
"traversal": {
"maxDepth": 10,
"excludeTables": ["__global_search", "activity_log", "error_log"],
"excludeColumns": ["_liked_by", "_comments"]
},
"anonymization": {
"enabled": true,
"rules": "See anonymization config above"
},
"output": {
"format": "sql",
"splitByTable": true,
"directory": ".agents/modernization/fixtures/"
}
}

Developer Onboarding

New developers run the subset tool to populate their local database. They get a small, realistic, anonymized dataset without needing access to production.

CI Pipeline

CI runs subset extraction weekly from a staging environment. Parity tests run against this fresh subset. Results feed into parity-tests.json confidence scores.

Parity Testing

Each parity test references specific subset records. The test knows which seed produced the data and can reproduce the exact scenario.

Migration Dry Runs

Before migrating a bounded context, run the subset through both old and new implementations. Compare outputs record by record.

To test the GL Entry engine in isolation, you need a connected subgraph starting from a single invoice:

StepRecords CollectedEntity Types
Seed: Sales Invoice SINV-0010421Sales Invoice
Items, taxes, payment schedule8Sales Invoice Item, Sales Taxes and Charges, Payment Schedule
Customer, company, cost center3Customer, Company, Cost Center
Accounts (revenue, tax, receivable, COGS)4Account
GL Entries for this invoice6GL Entry
Currency, fiscal year, tax rule3Currency, Fiscal Year, Tax Rule
Account types, parent accounts5Account (parents in hierarchy)
Total3012 entity types

30 records across 12 entity types — compared to a production database with millions of rows. This subset exercises the full GL posting workflow while fitting in a test fixture file.

Data subsetting supports parity testing, which feeds parity-tests.json:

Subsetting OutputSpec Integration
Subset configurationReferenced in parity-tests.json as fixtureSource
Fixture filesStored in .agents/modernization/fixtures/
Seed descriptionsDocumented in parity test description fields
Coverage metricsFeeds confidence scores in parity test results