Developer Onboarding
New developers run the subset tool to populate their local database. They get a small, realistic, anonymized dataset without needing access to production.
Parity testing (proving the new system behaves identically to the old) requires realistic data. Production databases are too large for developer machines, random sampling destroys referential integrity, and synthetic data misses real-world edge cases. Data subsetting solves this by extracting a minimal but complete slice of production data that preserves all relationships.
| Approach | Size | Referential Integrity | Edge Cases | Risk |
|---|---|---|---|---|
| Full production copy | 100% | Preserved | All present | PII exposure, storage cost |
| Random sampling | 1-5% | Broken | Randomly lost | Broken FK references, useless tests |
| Synthetic generation | N/A | By construction | None | Does not reflect reality |
| Seed-based subsetting | 0.1-2% | Preserved | Preserved for seed entities | Manageable with anonymization |
Subsetting gives you the smallest dataset that is still correct — every foreign key resolves, every child record has its parent, and every lookup table entry referenced by the data is included.
The core algorithm is breadth-first search starting from a “seed” entity — a specific record that represents a meaningful business scenario.
Seed: Invoice #1042 │ ├──▶ Customer "Acme Corp" │ ├──▶ Address "123 Main St" │ └──▶ Contact "Jane Doe" │ ├──▶ Invoice Item × 3 │ ├──▶ Product "Widget A" │ ├──▶ Product "Widget B" │ └──▶ Product "Gadget C" │ └──▶ Product Category "Electronics" │ ├──▶ Tax Detail × 2 │ ├──▶ Tax Rule "GST 18%" │ └──▶ Tax Rule "CESS 1%" │ ├──▶ Company "PearlThoughts" │ ├──▶ Currency "INR" │ └──▶ Fiscal Year "2025-2026" │ └──▶ GL Entry × 6 └──▶ Account × 4 └──▶ Account Type "Revenue"Starting from one invoice, the traversal collected 25+ records across 12 entity types. This connected subgraph is the minimum viable dataset for testing invoice processing.
| Strategy | When to Use | Example |
|---|---|---|
| High-connectivity seed | Maximum coverage per seed | An invoice with many line items, taxes, and payments |
| Edge case seed | Testing specific scenarios | A cancelled invoice, a partial payment, a multi-currency transaction |
| Representative seed | General regression testing | An average transaction that exercises the common code path |
| Multiple seeds | Broad coverage | One seed per major workflow (invoice, payment, credit note, journal entry) |
Pick 3-5 seeds that together cover the critical business scenarios. This typically produces a subset that is 0.1-2% of the production database but exercises 60-80% of the code paths.
The traversal depends on discovering and following foreign key relationships. In well-designed schemas, explicit FKs make this straightforward. Legacy systems rarely have well-designed schemas.
Schema-defined foreign keys — the easy case.
-- DDL tells you the relationshipALTER TABLE invoice_item ADD CONSTRAINT fk_invoice FOREIGN KEY (invoice_id) REFERENCES invoice(id);Parse DDL or query information_schema.key_column_usage to extract all explicit FKs automatically.
Convention-based references — columns named *_id or *_name that reference other tables without a schema constraint.
Detection strategies:
customer_id, parent_type, ownerLink("Customer") in Frappe, belongs_to :customer in RailsThese are the most common type in legacy systems and the hardest to discover reliably.
Type + ID pairs where a discriminator column determines which table the ID references.
party_type = "Customer" → customer.name = partyparty_type = "Supplier" → supplier.name = partyparty_type = "Employee" → employee.name = partyRequires framework-specific knowledge or runtime analysis to resolve. The subsetter must read the discriminator value to know which table to follow.
String literals in code that reference entities by name.
frappe.get_doc("Sales Invoice", invoice_name)frappe.db.get_value("Customer", customer_name, "credit_limit")These are invisible in the schema. Discovery requires scanning application code for entity name patterns or monitoring database queries at runtime.
Define FK resolution rules in a configuration file rather than hardcoding them:
{ "relationships": [ { "from": { "table": "invoice_item", "column": "invoice_id" }, "to": { "table": "invoice", "column": "id" }, "type": "explicit" }, { "from": { "table": "gl_entry", "column": "party" }, "to": { "table": "$party_type", "column": "name" }, "type": "polymorphic", "discriminator": "party_type" }, { "from": { "table": "sales_invoice", "column": "company" }, "to": { "table": "company", "column": "name" }, "type": "implicit" } ]}This configuration is reusable across subset runs and version-controlled alongside the modernization spec.
Subsetting extracts real data. Before using it in development or CI environments, anonymize personally identifiable information (PII) and sensitive business data.
| Technique | How It Works | Preserves | Use Case |
|---|---|---|---|
| Faker replacement | Replace real values with fake but realistic ones | Data type and format | Names, emails, phone numbers |
| Consistent hashing | Same input always produces the same fake output | Referential consistency | Customer names referenced from multiple tables |
| Format-preserving encryption | Encrypted value has same format as original | Format constraints (e.g., phone number length) | Regulated data with format requirements |
| Nullification | Replace with NULL or empty string | Nothing | Data that is genuinely unnecessary |
| Bucketing | Replace precise values with ranges | Approximate distribution | Salaries ($50K-$60K), ages (30-35) |
If “Jane Doe” appears as a customer name in invoice.customer_name, payment.party_name, and address.contact_name, all three must map to the same fake name. Otherwise, the subset has broken references that pass FK checks but fail business logic that compares names.
Real: "Jane Doe" → hash → deterministic seed → Faker → "Alice Chen"
invoice.customer_name: "Jane Doe" → "Alice Chen"payment.party_name: "Jane Doe" → "Alice Chen" (same output)address.contact_name: "Jane Doe" → "Alice Chen" (same output){ "rules": [ { "table": "*", "column": "email", "method": "faker", "type": "email" }, { "table": "*", "column": "phone", "method": "faker", "type": "phone" }, { "table": "*", "column": "*_name", "method": "consistent_hash", "seed": "project-secret" }, { "table": "employee", "column": "salary", "method": "bucket", "size": 10000 }, { "table": "*", "column": "password_hash", "method": "nullify" } ]}The entire subsetting process — seeds, traversal rules, depth limits, anonymization — should be defined in configuration, not code. This makes it repeatable, auditable, and version-controlled.
{ "version": "1.0", "seeds": [ { "table": "sales_invoice", "filter": { "name": "SINV-2025-001042" }, "label": "Standard invoice with taxes" }, { "table": "payment_entry", "filter": { "name": "PE-2025-000891" }, "label": "Multi-currency payment" } ], "traversal": { "maxDepth": 10, "excludeTables": ["__global_search", "activity_log", "error_log"], "excludeColumns": ["_liked_by", "_comments"] }, "anonymization": { "enabled": true, "rules": "See anonymization config above" }, "output": { "format": "sql", "splitByTable": true, "directory": ".agents/modernization/fixtures/" }}Developer Onboarding
New developers run the subset tool to populate their local database. They get a small, realistic, anonymized dataset without needing access to production.
CI Pipeline
CI runs subset extraction weekly from a staging environment. Parity tests run against this fresh subset. Results feed into parity-tests.json confidence scores.
Parity Testing
Each parity test references specific subset records. The test knows which seed produced the data and can reproduce the exact scenario.
Migration Dry Runs
Before migrating a bounded context, run the subset through both old and new implementations. Compare outputs record by record.
To test the GL Entry engine in isolation, you need a connected subgraph starting from a single invoice:
| Step | Records Collected | Entity Types |
|---|---|---|
| Seed: Sales Invoice SINV-001042 | 1 | Sales Invoice |
| Items, taxes, payment schedule | 8 | Sales Invoice Item, Sales Taxes and Charges, Payment Schedule |
| Customer, company, cost center | 3 | Customer, Company, Cost Center |
| Accounts (revenue, tax, receivable, COGS) | 4 | Account |
| GL Entries for this invoice | 6 | GL Entry |
| Currency, fiscal year, tax rule | 3 | Currency, Fiscal Year, Tax Rule |
| Account types, parent accounts | 5 | Account (parents in hierarchy) |
| Total | 30 | 12 entity types |
30 records across 12 entity types — compared to a production database with millions of rows. This subset exercises the full GL posting workflow while fitting in a test fixture file.
Data subsetting supports parity testing, which feeds parity-tests.json:
| Subsetting Output | Spec Integration |
|---|---|
| Subset configuration | Referenced in parity-tests.json as fixtureSource |
| Fixture files | Stored in .agents/modernization/fixtures/ |
| Seed descriptions | Documented in parity test description fields |
| Coverage metrics | Feeds confidence scores in parity test results |