Codebase Analysis
Before migrating anything, you need a structured understanding of the legacy system. Codebase analysis builds that understanding through five complementary techniques, each producing data that feeds ModernizeSpec’s complexity.json and domains.json specification files.
Multi-Dimension Analysis Framework
Section titled “Multi-Dimension Analysis Framework”Legacy systems are complex along multiple axes. Analyzing only lines of code or cyclomatic complexity misses coupling patterns, framework dependencies, and business rule density. A thorough analysis examines the system across 12 dimensions.
| Dimension | What It Measures | Why It Matters |
|---|---|---|
| Data model | Entities, relationships, cardinality | Defines migration unit boundaries |
| Domain boundaries | Business areas and their ownership | Determines bounded context extraction |
| Code structure | Inheritance, mixins, module organization | Reveals coupling and seam locations |
| API surface | Endpoints, parameters, response shapes | Defines the contract to preserve |
| Business rules | Validation, calculation, workflow logic | The core value to extract |
| Parity requirements | Which behaviors must match exactly | Scopes the testing effort |
| Team capacity | Available skills, domain knowledge | Constrains extraction sequencing |
| Dependencies | Internal and external coupling | Determines extraction order |
| Test coverage | Existing test breadth and depth | Affects confidence in changes |
| Transaction patterns | Write paths, consistency boundaries | Identifies aggregate roots |
| Event flows | Hooks, callbacks, implicit triggers | Reveals hidden execution chains |
| Technical debt | Dead code, workarounds, TODOs | Identifies what not to migrate |
The framework is intentionally broad. Not every dimension applies equally to every system — a stateless microservice has minimal transaction pattern complexity, while a financial ledger system has it as the dominant concern.
How to Apply
Section titled “How to Apply”- Score each dimension on a 1-5 scale for the overall system
- Score each module/component individually across dimensions
- Aggregate scores into a complexity tier (see Complexity Scoring below)
- Record results in
complexity.jsonwith per-component breakdowns
Dependency Graph Construction
Section titled “Dependency Graph Construction”A dependency graph makes invisible coupling visible. Build a directed graph where nodes are code artifacts (files, classes, functions, modules) and edges represent relationships.
Edge Types to Track
Section titled “Edge Types to Track”| Edge Type | Example | Detection Method |
|---|---|---|
| Import/require | import { Ledger } from './ledger' | Static analysis of import statements |
| Function call | calculateTax(invoice) | AST traversal of call expressions |
| Inheritance | class SalesInvoice extends TransactionBase | Class hierarchy extraction |
| Data flow | order.total = sum(order.items) | Assignment and parameter tracking |
| Configuration | hooks = { "validate": "custom_validator" } | Framework config file parsing |
| String reference | frappe.get_doc("Sales Invoice", name) | String literal analysis (framework-specific) |
Graph Metrics
Section titled “Graph Metrics”Once constructed, extract these metrics from the graph:
| Metric | Formula | Use |
|---|---|---|
| Fan-in | Count of incoming edges | High fan-in = widely depended upon, extract early |
| Fan-out | Count of outgoing edges | High fan-out = depends on many things, extract late |
| Coupling factor | Fan-in + fan-out | Overall coupling measure |
| Betweenness centrality | Fraction of shortest paths passing through node | Identifies bridge components |
| Strongly connected components | Nodes in mutual dependency cycles | Identifies tightly coupled clusters |
Visualization
Section titled “Visualization”Dependency graphs with hundreds of nodes are unreadable as raw diagrams. Use these techniques:
- Filter by threshold — Show only edges above a minimum weight
- Cluster by module — Collapse internal edges, show only cross-module connections
- Highlight hotspots — Color nodes by complexity tier or coupling factor
- Interactive exploration — ModernizeSpec’s Dependency Graph visualization supports drill-down
ERPNext Example
Section titled “ERPNext Example”ERPNext’s dependency graph reveals that accounts_controller.py has the highest betweenness centrality in the Accounts module. It sits on the path between every financial transaction and the GL Entry engine. Any extraction plan must account for this bottleneck.
┌──────────────────────┐ │ Sales Invoice (421) │ └──────────┬───────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │┌─────────▼─────────┐ ┌────────▼────────┐ ┌──────────▼──────────┐│ Tax Calculator │ │ accounts_ │ │ Payment Terms ││ (fan-in: 12) │ │ controller.py │ │ (fan-in: 8) │└────────────────────┘ │ (4,412 lines) │ └─────────────────────┘ │ (fan-in: 34) │ │ (fan-out: 28) │ └────────┬────────┘ │ ┌────────▼────────┐ │ GL Entry │ │ Engine │ └─────────────────┘The fan-in of 34 means 34 other modules directly depend on accounts_controller.py. Changing it without breaking dependents requires extracting it through an anti-corruption layer.
Complexity Scoring
Section titled “Complexity Scoring”Assign a numeric complexity score to each code artifact. This score determines extraction priority and effort estimation, and feeds directly into ModernizeSpec’s complexity.json.
Scoring Formula
Section titled “Scoring Formula”For each module or file, compute a weighted score:
| Factor | Weight | Measurement |
|---|---|---|
| Lines of code | 1x | Raw LOC (excluding blanks and comments) |
| Cyclomatic complexity | 2x | Branches per function, averaged |
| Coupling factor | 3x | Fan-in + fan-out from dependency graph |
| Framework dependency depth | 2x | Layers of framework abstraction (0 = pure logic, 5 = deeply coupled) |
| Implicit behavior count | 3x | Hook registrations, event handlers, convention-driven execution |
The weights reflect a key insight: coupling and implicit behavior are harder to deal with than raw code volume. A 5,000-line file of pure business logic is easier to extract than a 500-line file deeply woven into framework hooks.
Complexity Tiers
Section titled “Complexity Tiers”Score: 0-20. Pure business logic with no framework dependencies. Can be extracted by copying and adding tests.
Examples: Utility functions, math calculations, data transformations, format converters.
Score: 21-50. Business logic with some framework calls that can be replaced with interfaces. Extract by introducing ports.
Examples: Validation rules that call a database lookup, formatters that use framework i18n.
Score: 51-80. Logic deeply interleaved with framework behavior. Requires careful seam identification and anti-corruption layers.
Examples: Controllers that mix routing, validation, and persistence. ORM models with business logic in callbacks.
Score: 81-100. Framework internals, ORM engines, permission systems. Extraction means replacing the framework itself.
Examples: The ORM layer, authentication/authorization engine, multi-tenancy infrastructure.
ERPNext Example: Tier Distribution
Section titled “ERPNext Example: Tier Distribution”| Tier | Count | Examples |
|---|---|---|
| Tier 1 (Standalone) | ~40 doctypes | Mode of Payment, Currency Exchange, Cost Center |
| Tier 2 (Lightly Coupled) | ~120 doctypes | Tax Rule, Pricing Rule, Payment Terms |
| Tier 3 (Framework Entangled) | ~280 doctypes | Sales Invoice, Purchase Order, Stock Entry |
| Tier 4 (Core Infrastructure) | ~81 doctypes | DocType, Custom Field, Workflow, Report Builder |
The distribution reveals that only ~8% of doctypes are easily extractable (Tier 1). The majority sit at Tier 3, requiring significant anti-corruption layer work.
Code Intelligence Layers
Section titled “Code Intelligence Layers”Raw source code is the lowest level of understanding. Code intelligence builds upward through three layers, each adding meaning:
Layer 1: Symbol Layer
Section titled “Layer 1: Symbol Layer”What exists in the code. Functions, classes, variables, constants, types, interfaces.
- Extracted via AST parsing
- Language-specific but straightforward
- Tools: Tree-sitter, language server protocols, ctags
Layer 2: Relationship Layer
Section titled “Layer 2: Relationship Layer”How symbols connect. Function calls, inheritance chains, import graphs, data flow paths.
- Built from the dependency graph (see above)
- Combines static analysis with type information
- Tools: Language servers, call graph generators, import analyzers
Layer 3: Semantic Layer
Section titled “Layer 3: Semantic Layer”What the code means in business terms. This is the hardest layer — it maps code artifacts to business concepts.
- Requires domain knowledge or AI-assisted classification
- Maps function
calculate_net_total()to business concept “Invoice Total Calculation” - Maps class
SalesInvoiceto bounded context “Invoicing” - Maps hook
on_submitto business event “Invoice Finalized”
The semantic layer is what ModernizeSpec’s domains.json captures: each bounded context groups code artifacts by their business meaning, not their file location.
Building Intelligence Incrementally
Section titled “Building Intelligence Incrementally”Start with Layer 1 (automated, fast), add Layer 2 (semi-automated, moderate effort), then build Layer 3 over time (human-guided, high value).
| Layer | Automation Level | Effort | Value |
|---|---|---|---|
| Symbol | Fully automated | Hours | Foundation for everything else |
| Relationship | Mostly automated | Days | Reveals coupling and seam locations |
| Semantic | Human-guided, AI-assisted | Weeks | Enables domain-driven extraction |
Entity Relationship Atlas
Section titled “Entity Relationship Atlas”The entity relationship atlas maps every persistent entity to its relationships. In relational systems, this starts with the database schema. In document-oriented or ORM-heavy systems, it requires parsing model definitions.
Relationship Types
Section titled “Relationship Types”| Type | Detection | Complexity |
|---|---|---|
| Explicit FK | Schema-defined foreign keys | Low — direct parsing |
| Implicit FK | String-based lookups (get("Customer", name)) | Medium — requires pattern matching |
| Polymorphic | Type discriminator columns, generic relations | High — requires runtime analysis |
| Computed | Derived relationships via queries or aggregation | High — requires code tracing |
| Convention-based | Framework naming conventions (e.g., parent_type + parent_name) | Medium — framework-specific |
Building the Atlas
Section titled “Building the Atlas”- Extract explicit relationships from schema definitions (DDL, ORM models, framework entity definitions)
- Scan for implicit references by searching for string literals that match entity names
- Map polymorphic associations by finding type discriminator patterns
- Validate with runtime data by querying actual FK values in production/staging databases
- Record in
domains.jsonas relationship edges between bounded context entities
ERPNext Example
Section titled “ERPNext Example”ERPNext’s DocType system uses a Link field type for explicit relationships and Dynamic Link for polymorphic associations. A Sales Invoice has:
- Explicit links: Customer, Company, Cost Center, Project, Currency
- Dynamic links: party_type + party (can be Customer, Supplier, Employee, etc.)
- Child table links: Sales Invoice Item, Sales Taxes and Charges, Payment Schedule
- Implicit references:
accounts_controller.pyreferences GL Entry by string name
The atlas reveals that a single Sales Invoice touches 23 other doctypes directly and 47 transitively — critical information for extraction planning.
Feeding the Spec
Section titled “Feeding the Spec”These five techniques produce structured data. Map their outputs to ModernizeSpec files:
| Technique Output | Spec File | Specific Fields |
|---|---|---|
| Multi-dimension scores | complexity.json | dimensions[], overallScore |
| Dependency graph metrics | complexity.json | coupling.fanIn, coupling.fanOut |
| Complexity tier assignment | complexity.json | tier, extractionEffort |
| Semantic layer mapping | domains.json | contexts[].capabilities[] |
| Entity relationship atlas | domains.json | contexts[].entities[], relationships[] |
Run npx modernizespec validate after populating these files to verify schema compliance.
See Also
Section titled “See Also”- complexity.json Specification — Schema reference for complexity data
- domains.json Specification — Schema reference for domain mapping
- Domain Decomposition — Next step: break the system into bounded contexts
- Complexity Heatmap — Interactive visualization of complexity data