Codebase Analysis

Before migrating anything, you need a structured understanding of the legacy system. Codebase analysis builds that understanding through five complementary techniques, each producing data that feeds ModernizeSpec’s complexity.json and domains.json specification files.

Multi-Dimension Analysis Framework

Legacy systems are complex along multiple axes. Analyzing only lines of code or cyclomatic complexity misses coupling patterns, framework dependencies, and business rule density. A thorough analysis examines the system across 12 dimensions.

Dimension	What It Measures	Why It Matters
Data model	Entities, relationships, cardinality	Defines migration unit boundaries
Domain boundaries	Business areas and their ownership	Determines bounded context extraction
Code structure	Inheritance, mixins, module organization	Reveals coupling and seam locations
API surface	Endpoints, parameters, response shapes	Defines the contract to preserve
Business rules	Validation, calculation, workflow logic	The core value to extract
Parity requirements	Which behaviors must match exactly	Scopes the testing effort
Team capacity	Available skills, domain knowledge	Constrains extraction sequencing
Dependencies	Internal and external coupling	Determines extraction order
Test coverage	Existing test breadth and depth	Affects confidence in changes
Transaction patterns	Write paths, consistency boundaries	Identifies aggregate roots
Event flows	Hooks, callbacks, implicit triggers	Reveals hidden execution chains
Technical debt	Dead code, workarounds, TODOs	Identifies what not to migrate

The framework is intentionally broad. Not every dimension applies equally to every system — a stateless microservice has minimal transaction pattern complexity, while a financial ledger system has it as the dominant concern.

How to Apply

Score each dimension on a 1-5 scale for the overall system
Score each module/component individually across dimensions
Aggregate scores into a complexity tier (see Complexity Scoring below)
Record results in complexity.json with per-component breakdowns

Dependency Graph Construction

A dependency graph makes invisible coupling visible. Build a directed graph where nodes are code artifacts (files, classes, functions, modules) and edges represent relationships.

Edge Types to Track

Edge Type	Example	Detection Method
Import/require	`import { Ledger } from './ledger'`	Static analysis of import statements
Function call	`calculateTax(invoice)`	AST traversal of call expressions
Inheritance	`class SalesInvoice extends TransactionBase`	Class hierarchy extraction
Data flow	`order.total = sum(order.items)`	Assignment and parameter tracking
Configuration	`hooks = { "validate": "custom_validator" }`	Framework config file parsing
String reference	`frappe.get_doc("Sales Invoice", name)`	String literal analysis (framework-specific)

Graph Metrics

Once constructed, extract these metrics from the graph:

Metric	Formula	Use
Fan-in	Count of incoming edges	High fan-in = widely depended upon, extract early
Fan-out	Count of outgoing edges	High fan-out = depends on many things, extract late
Coupling factor	Fan-in + fan-out	Overall coupling measure
Betweenness centrality	Fraction of shortest paths passing through node	Identifies bridge components
Strongly connected components	Nodes in mutual dependency cycles	Identifies tightly coupled clusters

Visualization

Dependency graphs with hundreds of nodes are unreadable as raw diagrams. Use these techniques:

Filter by threshold — Show only edges above a minimum weight
Cluster by module — Collapse internal edges, show only cross-module connections
Highlight hotspots — Color nodes by complexity tier or coupling factor
Interactive exploration — ModernizeSpec’s Dependency Graph visualization supports drill-down

ERPNext Example

ERPNext’s dependency graph reveals that accounts_controller.py has the highest betweenness centrality in the Accounts module. It sits on the path between every financial transaction and the GL Entry engine. Any extraction plan must account for this bottleneck.

                     ┌──────────────────────┐
                     │  Sales Invoice (421)  │
                     └──────────┬───────────┘
                                │
          ┌─────────────────────┼─────────────────────┐
          │                     │                      │
┌─────────▼─────────┐ ┌────────▼────────┐ ┌──────────▼──────────┐
│  Tax Calculator    │ │  accounts_      │ │  Payment Terms      │
│  (fan-in: 12)      │ │  controller.py  │ │  (fan-in: 8)        │
└────────────────────┘ │  (4,412 lines)  │ └─────────────────────┘
                       │  (fan-in: 34)   │
                       │  (fan-out: 28)  │
                       └────────┬────────┘
                                │
                       ┌────────▼────────┐
                       │  GL Entry       │
                       │  Engine         │
                       └─────────────────┘

The fan-in of 34 means 34 other modules directly depend on accounts_controller.py. Changing it without breaking dependents requires extracting it through an anti-corruption layer.

Complexity Scoring

Assign a numeric complexity score to each code artifact. This score determines extraction priority and effort estimation, and feeds directly into ModernizeSpec’s complexity.json.

Scoring Formula

For each module or file, compute a weighted score:

Factor	Weight	Measurement
Lines of code	1x	Raw LOC (excluding blanks and comments)
Cyclomatic complexity	2x	Branches per function, averaged
Coupling factor	3x	Fan-in + fan-out from dependency graph
Framework dependency depth	2x	Layers of framework abstraction (0 = pure logic, 5 = deeply coupled)
Implicit behavior count	3x	Hook registrations, event handlers, convention-driven execution

The weights reflect a key insight: coupling and implicit behavior are harder to deal with than raw code volume. A 5,000-line file of pure business logic is easier to extract than a 500-line file deeply woven into framework hooks.

Complexity Tiers

Score: 0-20. Pure business logic with no framework dependencies. Can be extracted by copying and adding tests.

Examples: Utility functions, math calculations, data transformations, format converters.

ERPNext Example: Tier Distribution

Tier	Count	Examples
Tier 1 (Standalone)	~40 doctypes	Mode of Payment, Currency Exchange, Cost Center
Tier 2 (Lightly Coupled)	~120 doctypes	Tax Rule, Pricing Rule, Payment Terms
Tier 3 (Framework Entangled)	~280 doctypes	Sales Invoice, Purchase Order, Stock Entry
Tier 4 (Core Infrastructure)	~81 doctypes	DocType, Custom Field, Workflow, Report Builder

The distribution reveals that only ~8% of doctypes are easily extractable (Tier 1). The majority sit at Tier 3, requiring significant anti-corruption layer work.

Code Intelligence Layers

Raw source code is the lowest level of understanding. Code intelligence builds upward through three layers, each adding meaning:

Layer 1: Symbol Layer

What exists in the code. Functions, classes, variables, constants, types, interfaces.

Extracted via AST parsing
Language-specific but straightforward
Tools: Tree-sitter, language server protocols, ctags

Layer 2: Relationship Layer

How symbols connect. Function calls, inheritance chains, import graphs, data flow paths.

Built from the dependency graph (see above)
Combines static analysis with type information
Tools: Language servers, call graph generators, import analyzers

Layer 3: Semantic Layer

What the code means in business terms. This is the hardest layer — it maps code artifacts to business concepts.

Requires domain knowledge or AI-assisted classification
Maps function calculate_net_total() to business concept “Invoice Total Calculation”
Maps class SalesInvoice to bounded context “Invoicing”
Maps hook on_submit to business event “Invoice Finalized”

The semantic layer is what ModernizeSpec’s domains.json captures: each bounded context groups code artifacts by their business meaning, not their file location.

Building Intelligence Incrementally

Start with Layer 1 (automated, fast), add Layer 2 (semi-automated, moderate effort), then build Layer 3 over time (human-guided, high value).

Layer	Automation Level	Effort	Value
Symbol	Fully automated	Hours	Foundation for everything else
Relationship	Mostly automated	Days	Reveals coupling and seam locations
Semantic	Human-guided, AI-assisted	Weeks	Enables domain-driven extraction

Entity Relationship Atlas

The entity relationship atlas maps every persistent entity to its relationships. In relational systems, this starts with the database schema. In document-oriented or ORM-heavy systems, it requires parsing model definitions.

Relationship Types

Type	Detection	Complexity
Explicit FK	Schema-defined foreign keys	Low — direct parsing
Implicit FK	String-based lookups (`get("Customer", name)`)	Medium — requires pattern matching
Polymorphic	Type discriminator columns, generic relations	High — requires runtime analysis
Computed	Derived relationships via queries or aggregation	High — requires code tracing
Convention-based	Framework naming conventions (e.g., `parent_type` + `parent_name`)	Medium — framework-specific

Building the Atlas

Extract explicit relationships from schema definitions (DDL, ORM models, framework entity definitions)
Scan for implicit references by searching for string literals that match entity names
Map polymorphic associations by finding type discriminator patterns
Validate with runtime data by querying actual FK values in production/staging databases
Record in domains.json as relationship edges between bounded context entities

ERPNext Example

ERPNext’s DocType system uses a Link field type for explicit relationships and Dynamic Link for polymorphic associations. A Sales Invoice has:

Explicit links: Customer, Company, Cost Center, Project, Currency
Dynamic links: party_type + party (can be Customer, Supplier, Employee, etc.)
Child table links: Sales Invoice Item, Sales Taxes and Charges, Payment Schedule
Implicit references: accounts_controller.py references GL Entry by string name

The atlas reveals that a single Sales Invoice touches 23 other doctypes directly and 47 transitively — critical information for extraction planning.

Feeding the Spec

These five techniques produce structured data. Map their outputs to ModernizeSpec files:

Technique Output	Spec File	Specific Fields
Multi-dimension scores	`complexity.json`	`dimensions[]`, `overallScore`
Dependency graph metrics	`complexity.json`	`coupling.fanIn`, `coupling.fanOut`
Complexity tier assignment	`complexity.json`	`tier`, `extractionEffort`
Semantic layer mapping	`domains.json`	`contexts[].capabilities[]`
Entity relationship atlas	`domains.json`	`contexts[].entities[]`, `relationships[]`

Run npx modernizespec validate after populating these files to verify schema compliance.

Codebase Analysis

Multi-Dimension Analysis Framework

How to Apply

Dependency Graph Construction

Edge Types to Track

Graph Metrics

Visualization

ERPNext Example

Complexity Scoring

Scoring Formula

Complexity Tiers

ERPNext Example: Tier Distribution

Code Intelligence Layers

Layer 1: Symbol Layer

Layer 2: Relationship Layer

Layer 3: Semantic Layer

Building Intelligence Incrementally

Entity Relationship Atlas

Relationship Types

Building the Atlas

ERPNext Example

Feeding the Spec

See Also