Skip to content

Codebase Analysis

Before migrating anything, you need a structured understanding of the legacy system. Codebase analysis builds that understanding through five complementary techniques, each producing data that feeds ModernizeSpec’s complexity.json and domains.json specification files.

Legacy systems are complex along multiple axes. Analyzing only lines of code or cyclomatic complexity misses coupling patterns, framework dependencies, and business rule density. A thorough analysis examines the system across 12 dimensions.

DimensionWhat It MeasuresWhy It Matters
Data modelEntities, relationships, cardinalityDefines migration unit boundaries
Domain boundariesBusiness areas and their ownershipDetermines bounded context extraction
Code structureInheritance, mixins, module organizationReveals coupling and seam locations
API surfaceEndpoints, parameters, response shapesDefines the contract to preserve
Business rulesValidation, calculation, workflow logicThe core value to extract
Parity requirementsWhich behaviors must match exactlyScopes the testing effort
Team capacityAvailable skills, domain knowledgeConstrains extraction sequencing
DependenciesInternal and external couplingDetermines extraction order
Test coverageExisting test breadth and depthAffects confidence in changes
Transaction patternsWrite paths, consistency boundariesIdentifies aggregate roots
Event flowsHooks, callbacks, implicit triggersReveals hidden execution chains
Technical debtDead code, workarounds, TODOsIdentifies what not to migrate

The framework is intentionally broad. Not every dimension applies equally to every system — a stateless microservice has minimal transaction pattern complexity, while a financial ledger system has it as the dominant concern.

  1. Score each dimension on a 1-5 scale for the overall system
  2. Score each module/component individually across dimensions
  3. Aggregate scores into a complexity tier (see Complexity Scoring below)
  4. Record results in complexity.json with per-component breakdowns

A dependency graph makes invisible coupling visible. Build a directed graph where nodes are code artifacts (files, classes, functions, modules) and edges represent relationships.

Edge TypeExampleDetection Method
Import/requireimport { Ledger } from './ledger'Static analysis of import statements
Function callcalculateTax(invoice)AST traversal of call expressions
Inheritanceclass SalesInvoice extends TransactionBaseClass hierarchy extraction
Data floworder.total = sum(order.items)Assignment and parameter tracking
Configurationhooks = { "validate": "custom_validator" }Framework config file parsing
String referencefrappe.get_doc("Sales Invoice", name)String literal analysis (framework-specific)

Once constructed, extract these metrics from the graph:

MetricFormulaUse
Fan-inCount of incoming edgesHigh fan-in = widely depended upon, extract early
Fan-outCount of outgoing edgesHigh fan-out = depends on many things, extract late
Coupling factorFan-in + fan-outOverall coupling measure
Betweenness centralityFraction of shortest paths passing through nodeIdentifies bridge components
Strongly connected componentsNodes in mutual dependency cyclesIdentifies tightly coupled clusters

Dependency graphs with hundreds of nodes are unreadable as raw diagrams. Use these techniques:

  • Filter by threshold — Show only edges above a minimum weight
  • Cluster by module — Collapse internal edges, show only cross-module connections
  • Highlight hotspots — Color nodes by complexity tier or coupling factor
  • Interactive exploration — ModernizeSpec’s Dependency Graph visualization supports drill-down

ERPNext’s dependency graph reveals that accounts_controller.py has the highest betweenness centrality in the Accounts module. It sits on the path between every financial transaction and the GL Entry engine. Any extraction plan must account for this bottleneck.

┌──────────────────────┐
│ Sales Invoice (421) │
└──────────┬───────────┘
┌─────────────────────┼─────────────────────┐
│ │ │
┌─────────▼─────────┐ ┌────────▼────────┐ ┌──────────▼──────────┐
│ Tax Calculator │ │ accounts_ │ │ Payment Terms │
│ (fan-in: 12) │ │ controller.py │ │ (fan-in: 8) │
└────────────────────┘ │ (4,412 lines) │ └─────────────────────┘
│ (fan-in: 34) │
│ (fan-out: 28) │
└────────┬────────┘
┌────────▼────────┐
│ GL Entry │
│ Engine │
└─────────────────┘

The fan-in of 34 means 34 other modules directly depend on accounts_controller.py. Changing it without breaking dependents requires extracting it through an anti-corruption layer.

Assign a numeric complexity score to each code artifact. This score determines extraction priority and effort estimation, and feeds directly into ModernizeSpec’s complexity.json.

For each module or file, compute a weighted score:

FactorWeightMeasurement
Lines of code1xRaw LOC (excluding blanks and comments)
Cyclomatic complexity2xBranches per function, averaged
Coupling factor3xFan-in + fan-out from dependency graph
Framework dependency depth2xLayers of framework abstraction (0 = pure logic, 5 = deeply coupled)
Implicit behavior count3xHook registrations, event handlers, convention-driven execution

The weights reflect a key insight: coupling and implicit behavior are harder to deal with than raw code volume. A 5,000-line file of pure business logic is easier to extract than a 500-line file deeply woven into framework hooks.

Score: 0-20. Pure business logic with no framework dependencies. Can be extracted by copying and adding tests.

Examples: Utility functions, math calculations, data transformations, format converters.

TierCountExamples
Tier 1 (Standalone)~40 doctypesMode of Payment, Currency Exchange, Cost Center
Tier 2 (Lightly Coupled)~120 doctypesTax Rule, Pricing Rule, Payment Terms
Tier 3 (Framework Entangled)~280 doctypesSales Invoice, Purchase Order, Stock Entry
Tier 4 (Core Infrastructure)~81 doctypesDocType, Custom Field, Workflow, Report Builder

The distribution reveals that only ~8% of doctypes are easily extractable (Tier 1). The majority sit at Tier 3, requiring significant anti-corruption layer work.

Raw source code is the lowest level of understanding. Code intelligence builds upward through three layers, each adding meaning:

What exists in the code. Functions, classes, variables, constants, types, interfaces.

  • Extracted via AST parsing
  • Language-specific but straightforward
  • Tools: Tree-sitter, language server protocols, ctags

How symbols connect. Function calls, inheritance chains, import graphs, data flow paths.

  • Built from the dependency graph (see above)
  • Combines static analysis with type information
  • Tools: Language servers, call graph generators, import analyzers

What the code means in business terms. This is the hardest layer — it maps code artifacts to business concepts.

  • Requires domain knowledge or AI-assisted classification
  • Maps function calculate_net_total() to business concept “Invoice Total Calculation”
  • Maps class SalesInvoice to bounded context “Invoicing”
  • Maps hook on_submit to business event “Invoice Finalized”

The semantic layer is what ModernizeSpec’s domains.json captures: each bounded context groups code artifacts by their business meaning, not their file location.

Start with Layer 1 (automated, fast), add Layer 2 (semi-automated, moderate effort), then build Layer 3 over time (human-guided, high value).

LayerAutomation LevelEffortValue
SymbolFully automatedHoursFoundation for everything else
RelationshipMostly automatedDaysReveals coupling and seam locations
SemanticHuman-guided, AI-assistedWeeksEnables domain-driven extraction

The entity relationship atlas maps every persistent entity to its relationships. In relational systems, this starts with the database schema. In document-oriented or ORM-heavy systems, it requires parsing model definitions.

TypeDetectionComplexity
Explicit FKSchema-defined foreign keysLow — direct parsing
Implicit FKString-based lookups (get("Customer", name))Medium — requires pattern matching
PolymorphicType discriminator columns, generic relationsHigh — requires runtime analysis
ComputedDerived relationships via queries or aggregationHigh — requires code tracing
Convention-basedFramework naming conventions (e.g., parent_type + parent_name)Medium — framework-specific
  1. Extract explicit relationships from schema definitions (DDL, ORM models, framework entity definitions)
  2. Scan for implicit references by searching for string literals that match entity names
  3. Map polymorphic associations by finding type discriminator patterns
  4. Validate with runtime data by querying actual FK values in production/staging databases
  5. Record in domains.json as relationship edges between bounded context entities

ERPNext’s DocType system uses a Link field type for explicit relationships and Dynamic Link for polymorphic associations. A Sales Invoice has:

  • Explicit links: Customer, Company, Cost Center, Project, Currency
  • Dynamic links: party_type + party (can be Customer, Supplier, Employee, etc.)
  • Child table links: Sales Invoice Item, Sales Taxes and Charges, Payment Schedule
  • Implicit references: accounts_controller.py references GL Entry by string name

The atlas reveals that a single Sales Invoice touches 23 other doctypes directly and 47 transitively — critical information for extraction planning.

These five techniques produce structured data. Map their outputs to ModernizeSpec files:

Technique OutputSpec FileSpecific Fields
Multi-dimension scorescomplexity.jsondimensions[], overallScore
Dependency graph metricscomplexity.jsoncoupling.fanIn, coupling.fanOut
Complexity tier assignmentcomplexity.jsontier, extractionEffort
Semantic layer mappingdomains.jsoncontexts[].capabilities[]
Entity relationship atlasdomains.jsoncontexts[].entities[], relationships[]

Run npx modernizespec validate after populating these files to verify schema compliance.