Data Subsetting

Parity testing (proving the new system behaves identically to the old) requires realistic data. Production databases are too large for developer machines, random sampling destroys referential integrity, and synthetic data misses real-world edge cases. Data subsetting solves this by extracting a minimal but complete slice of production data that preserves all relationships.

Why Subsetting Matters

Approach	Size	Referential Integrity	Edge Cases	Risk
Full production copy	100%	Preserved	All present	PII exposure, storage cost
Random sampling	1-5%	Broken	Randomly lost	Broken FK references, useless tests
Synthetic generation	N/A	By construction	None	Does not reflect reality
Seed-based subsetting	0.1-2%	Preserved	Preserved for seed entities	Manageable with anonymization

Subsetting gives you the smallest dataset that is still correct — every foreign key resolves, every child record has its parent, and every lookup table entry referenced by the data is included.

Seed-Based BFS Traversal

The core algorithm is breadth-first search starting from a “seed” entity — a specific record that represents a meaningful business scenario.

Algorithm

Choose a seed record (e.g., a specific customer, invoice, or transaction)
Add the seed to the subset
Find all foreign key references from the seed record
Add referenced records to the subset
Repeat from step 3 for each newly added record
Stop when no new records are discovered (or depth limit reached)

Visual

Seed: Invoice #1042
    │
    ├──▶ Customer "Acme Corp"
    │        ├──▶ Address "123 Main St"
    │        └──▶ Contact "Jane Doe"
    │
    ├──▶ Invoice Item × 3
    │        ├──▶ Product "Widget A"
    │        ├──▶ Product "Widget B"
    │        └──▶ Product "Gadget C"
    │             └──▶ Product Category "Electronics"
    │
    ├──▶ Tax Detail × 2
    │        ├──▶ Tax Rule "GST 18%"
    │        └──▶ Tax Rule "CESS 1%"
    │
    ├──▶ Company "PearlThoughts"
    │        ├──▶ Currency "INR"
    │        └──▶ Fiscal Year "2025-2026"
    │
    └──▶ GL Entry × 6
             └──▶ Account × 4
                      └──▶ Account Type "Revenue"

Starting from one invoice, the traversal collected 25+ records across 12 entity types. This connected subgraph is the minimum viable dataset for testing invoice processing.

Choosing Seeds

Strategy	When to Use	Example
High-connectivity seed	Maximum coverage per seed	An invoice with many line items, taxes, and payments
Edge case seed	Testing specific scenarios	A cancelled invoice, a partial payment, a multi-currency transaction
Representative seed	General regression testing	An average transaction that exercises the common code path
Multiple seeds	Broad coverage	One seed per major workflow (invoice, payment, credit note, journal entry)

Pick 3-5 seeds that together cover the critical business scenarios. This typically produces a subset that is 0.1-2% of the production database but exercises 60-80% of the code paths.

FK Resolution

The traversal depends on discovering and following foreign key relationships. In well-designed schemas, explicit FKs make this straightforward. Legacy systems rarely have well-designed schemas.

Relationship Discovery

Schema-defined foreign keys — the easy case.

-- DDL tells you the relationship
ALTER TABLE invoice_item
  ADD CONSTRAINT fk_invoice
  FOREIGN KEY (invoice_id)
  REFERENCES invoice(id);

Parse DDL or query information_schema.key_column_usage to extract all explicit FKs automatically.

Convention-based references — columns named *_id or *_name that reference other tables without a schema constraint.

Detection strategies:

Column name patterns: customer_id, parent_type, owner
Value inspection: sample column values and check if they match PKs in other tables
ORM model definitions: Link("Customer") in Frappe, belongs_to :customer in Rails

These are the most common type in legacy systems and the hardest to discover reliably.

Type + ID pairs where a discriminator column determines which table the ID references.

party_type = "Customer"    →  customer.name = party
party_type = "Supplier"    →  supplier.name = party
party_type = "Employee"    →  employee.name = party

Requires framework-specific knowledge or runtime analysis to resolve. The subsetter must read the discriminator value to know which table to follow.

String literals in code that reference entities by name.

frappe.get_doc("Sales Invoice", invoice_name)
frappe.db.get_value("Customer", customer_name, "credit_limit")

These are invisible in the schema. Discovery requires scanning application code for entity name patterns or monitoring database queries at runtime.

Resolution Configuration

Define FK resolution rules in a configuration file rather than hardcoding them:

{
  "relationships": [
    {
      "from": { "table": "invoice_item", "column": "invoice_id" },
      "to": { "table": "invoice", "column": "id" },
      "type": "explicit"
    },
    {
      "from": { "table": "gl_entry", "column": "party" },
      "to": { "table": "$party_type", "column": "name" },
      "type": "polymorphic",
      "discriminator": "party_type"
    },
    {
      "from": { "table": "sales_invoice", "column": "company" },
      "to": { "table": "company", "column": "name" },
      "type": "implicit"
    }
  ]
}

This configuration is reusable across subset runs and version-controlled alongside the modernization spec.

Anonymization

Subsetting extracts real data. Before using it in development or CI environments, anonymize personally identifiable information (PII) and sensitive business data.

Techniques

Technique	How It Works	Preserves	Use Case
Faker replacement	Replace real values with fake but realistic ones	Data type and format	Names, emails, phone numbers
Consistent hashing	Same input always produces the same fake output	Referential consistency	Customer names referenced from multiple tables
Format-preserving encryption	Encrypted value has same format as original	Format constraints (e.g., phone number length)	Regulated data with format requirements
Nullification	Replace with NULL or empty string	Nothing	Data that is genuinely unnecessary
Bucketing	Replace precise values with ranges	Approximate distribution	Salaries ($50K-$60K), ages (30-35)

Consistent Hashing: Why It Matters

If “Jane Doe” appears as a customer name in invoice.customer_name, payment.party_name, and address.contact_name, all three must map to the same fake name. Otherwise, the subset has broken references that pass FK checks but fail business logic that compares names.

Real: "Jane Doe" → hash → deterministic seed → Faker → "Alice Chen"

invoice.customer_name:  "Jane Doe"  →  "Alice Chen"
payment.party_name:     "Jane Doe"  →  "Alice Chen"   (same output)
address.contact_name:   "Jane Doe"  →  "Alice Chen"   (same output)

Anonymization Configuration

{
  "rules": [
    { "table": "*", "column": "email", "method": "faker", "type": "email" },
    { "table": "*", "column": "phone", "method": "faker", "type": "phone" },
    { "table": "*", "column": "*_name", "method": "consistent_hash", "seed": "project-secret" },
    { "table": "employee", "column": "salary", "method": "bucket", "size": 10000 },
    { "table": "*", "column": "password_hash", "method": "nullify" }
  ]
}

Configuration-Driven Subsetting

The entire subsetting process — seeds, traversal rules, depth limits, anonymization — should be defined in configuration, not code. This makes it repeatable, auditable, and version-controlled.

Configuration Structure

{
  "version": "1.0",
  "seeds": [
    {
      "table": "sales_invoice",
      "filter": { "name": "SINV-2025-001042" },
      "label": "Standard invoice with taxes"
    },
    {
      "table": "payment_entry",
      "filter": { "name": "PE-2025-000891" },
      "label": "Multi-currency payment"
    }
  ],
  "traversal": {
    "maxDepth": 10,
    "excludeTables": ["__global_search", "activity_log", "error_log"],
    "excludeColumns": ["_liked_by", "_comments"]
  },
  "anonymization": {
    "enabled": true,
    "rules": "See anonymization config above"
  },
  "output": {
    "format": "sql",
    "splitByTable": true,
    "directory": ".agents/modernization/fixtures/"
  }
}

Workflow Integration

Developer Onboarding

New developers run the subset tool to populate their local database. They get a small, realistic, anonymized dataset without needing access to production.

CI Pipeline

CI runs subset extraction weekly from a staging environment. Parity tests run against this fresh subset. Results feed into parity-tests.json confidence scores.

Parity Testing

Each parity test references specific subset records. The test knows which seed produced the data and can reproduce the exact scenario.

Migration Dry Runs

Before migrating a bounded context, run the subset through both old and new implementations. Compare outputs record by record.

ERPNext Example

To test the GL Entry engine in isolation, you need a connected subgraph starting from a single invoice:

Step	Records Collected	Entity Types
Seed: Sales Invoice SINV-001042	1	Sales Invoice
Items, taxes, payment schedule	8	Sales Invoice Item, Sales Taxes and Charges, Payment Schedule
Customer, company, cost center	3	Customer, Company, Cost Center
Accounts (revenue, tax, receivable, COGS)	4	Account
GL Entries for this invoice	6	GL Entry
Currency, fiscal year, tax rule	3	Currency, Fiscal Year, Tax Rule
Account types, parent accounts	5	Account (parents in hierarchy)
Total	30	12 entity types

30 records across 12 entity types — compared to a production database with millions of rows. This subset exercises the full GL posting workflow while fitting in a test fixture file.

Feeding the Spec

Data subsetting supports parity testing, which feeds parity-tests.json:

Subsetting Output	Spec Integration
Subset configuration	Referenced in `parity-tests.json` as `fixtureSource`
Fixture files	Stored in `.agents/modernization/fixtures/`
Seed descriptions	Documented in parity test `description` fields
Coverage metrics	Feeds `confidence` scores in parity test results