Systematically reverse-engineers business requirements from legacy codebases using the Horseshoe Model to bridge the gap between implementation details and architectural intent.
You are an expert Requirements Archaeologist and Software Detective. Your goal is to analyze provided legacy source code, database schemas, and system artifacts to reconstruct the "As-Built" business requirements. You act as a bridge between the raw implementation (Level 1) and the conceptual business intent (Level 4), strictly adhering to the Horseshoe Model for software reconstruction.
Activate this skill when the user:
/data/import or batch payload for /ingestion/batch).{ "nodes": [...], "relationships": [...] } because it is the simplest to generate and upload.packages/shared/src/schemas/batch.schema.json as the canonical rules; the import payload is just the data.nodes + data.relationships subset.references/sysgraph-data.md for mapping rules and naming conventions.scripts/sysgraph_batch_skeleton.py --format import (default) to scaffold an import payload; use --format batch when ingestion metadata is required.Type:namespace:name, labels and relationship types match schema enums, and include source metadata (static_analysis, runtime_telemetry, or manual).Quick path (import):
python scripts/sysgraph_batch_skeleton.py --format import --out sysgraph-import.json
curl -s -F 'file=@sysgraph-import.json' \
-F 'options={"format":"json","merge_strategy":"upsert"}' \
http://localhost:4000/api/v1/data/import
In the contemporary landscape of enterprise software engineering, the modernization of legacy systems represents a paradox of value and liability. These systems—often massive, monolithic codebases written in languages such as COBOL, Fortran, C++, or early Java—constitute the operational backbone of global finance, healthcare, government, and logistics. They are the repositories of decades of business logic, regulatory compliance rules, and operational workflows that define the organization's competitive advantage. Yet, they simultaneously represent a significant operational risk. They are frequently characterized by "calcification," where the software has become rigid and fragile due to years of ad-hoc patching, architectural drift, and the departure of the original architects.1
The central challenge in modernizing these systems—whether through migration to cloud-native microservices, re-platforming, or refactoring—is not primarily technical but semantic. The profound "knowledge gap" between the executing code and the organization's understanding of that code is the single greatest impediment to evolution.3 Documentation is invariably outdated, misleading, or nonexistent. The "tribal knowledge" that once maintained the system has often evaporated with the retirement of key personnel.4 Consequently, the source code itself becomes the only "authoritative source of truth" (ASoT).5
This report presents an exhaustive, expert-level methodology for reverse-engineering a comprehensive requirements list from such large legacy repositories. It posits that requirements recovery is not a passive activity of reading code, but an active, forensic discipline—Requirements Archaeology. This discipline synthesizes techniques from compiler theory (lexical analysis, abstract syntax trees), data science (latent semantic indexing, machine learning), dynamic systems analysis (execution tracing), and software forensics (git history mining) to reconstruct the "As-Built" requirements of the system.
The methodology is grounded in the Software Engineering Institute’s (SEI) Horseshoe Model 6, adapted to incorporate modern advancements in Artificial Intelligence (AI) and forensic analysis tools. It aims to bridge the "abstraction gap" 8—transforming low-level implementation details into high-level business rules and functional specifications compliant with ISO/IEC/IEEE 29148 standards.9
Legacy systems are rarely singular entities; they are typically "distributed monoliths" or complex ecosystems stitched together by job schedulers, shell scripts, database triggers, and file transfers.10 The business logic is not neatly encapsulated in a "business layer" but is smeared across the user interface (validation logic), the database (stored procedures), and the integration layer (ETL scripts).11
Furthermore, these systems suffer from "functional drift." Over decades, features are added to address specific edge cases or regulatory changes without a holistic update to the architectural vision. This results in "spaghetti code," "God classes," and high coupling, where a change in one module can cause cascading failures in seemingly unrelated areas.12 The recovery process must, therefore, be surgical and multi-dimensional, capable of disentangling these dependencies to isolate the atomic units of business intent.
The foundational architecture for this methodology is the "Horseshoe Model" for software reengineering, originally conceptualized by the SEI. While the full horseshoe describes the cycle of reverse engineering (recovery), transformation, and forward engineering (modernization), this report focuses exclusively on the "left-hand side" of the horseshoe: the ascent from concrete implementation artifacts to abstract architectural and requirements representations.6
The recovery process operates on the premise that raw source code contains the truth of the system's behavior, but at a level of granularity that is incomprehensible for business decision-making. The methodology must systematically elevate the level of abstraction through three distinct layers 7:
Table 1: The Abstraction Hierarchy in Requirements Recovery
| Abstraction Level | Primary Artifacts (Inputs) | Analysis Techniques | Derived Artifacts (Outputs) |
|---|---|---|---|
| Level 1: Implementation | Source Code (.java,.c,.cbl), Scripts (.sh,.bat), Configs (.xml,.properties) | Lexical Analysis, Parsing, AST Generation | Abstract Syntax Trees (AST), Control Flow Graphs (CFG), Symbol Tables |
| Level 2: Structural | ASTs, CFGs, Database Schemas, Object Binaries | Static Analysis, Coupling Measurement, Dependency Mapping | Call Graphs, Class Diagrams, Entity-Relationship Models (ERD), Dependency Matrices |
| Level 3: Functional | Call Graphs, Execution Traces, Log Files | Dynamic Analysis, Slicing, Feature Location, User Shadowing | Sequence Diagrams, State Machines, Business Rule Candidates, Data Flow Diagrams (DFD) |
| Level 4: Conceptual | Business Rules, UI Workflows, Commit History | Latent Semantic Indexing (LSI), Pattern Matching, AI Summarization | Requirements Specification (SRS), Use Cases, Traceability Matrix, Domain Model |
A critical theoretical underpinning of this methodology is the recognition of the divergence between the As-Designed architecture (the idealized view often found in old documentation) and the As-Built architecture (the reality of the code running in production).6
In legacy environments, the "As-Built" system almost always contains "dark matter"—logic and behaviors that were never documented or that deviate from the original spec due to emergency fixes, performance optimizations, or misunderstood requirements.2 For instance, a "quick remedy" commit might introduce a hardcoded exception for a specific client directly into a calculation routine, bypassing the standard configuration tables.16 If the recovery process relies on documentation, it will miss this requirement. Therefore, the methodology treats the codebase as the primary witness and documentation as a secondary source for validation and context.17
While the primary output is a requirements list, the context of this recovery is often a migration to cloud-native architectures (microservices, containers, Kubernetes).18 This influences the recovery methodology by necessitating the identification of "seams" in the monolith—natural boundaries where the system can be decoupled. The requirements recovery process must therefore not only list what the system does but also map the dependencies of those requirements to facilitate eventual decomposition.20 This aligns with the "Strangler Fig" pattern of modernization, where specific functional requirements are isolated and rebuilt while the legacy system continues to operate.22
Before deep analytical tools are deployed, a comprehensive inventory and mapping of the system's "surface area" must be conducted. This phase is analogous to an archaeological survey; the goal is to define the boundaries of the dig site and identify the visible structures that imply deeper hidden foundations.
Legacy systems are notoriously messy. The first step is to create a complete catalog of all digital artifacts. This extends far beyond the compiled application code.10 In many cases, critical business logic has "leaked" out of the core application into supporting scripts and configurations.
<MaxLoanAmount>50000</MaxLoanAmount>) is a direct expression of a business rule.23To recover requirements, we must identify where the system interacts with the external world. These "entry points" are the stimuli that trigger business behavior. Every user input, API call, or batch trigger corresponds to at least one (and often many) functional requirements.24
Tools like Swimm, CodeSee, or proprietary "System X-Ray" utilities are employed to generate a high-level visualization of the system's topology.12 This visualization focuses on "modules" rather than classes. It answers questions like: Which modules are the most connected? Where are the 'God Classes' that touch everything?
This step is crucial for risk management. Identifying high-complexity zones (hotspots) allows the requirements recovery team to prioritize their deep-dive analysis. If Module A is coupled to 80% of the system, it likely contains the core domain logic (the "Crown Jewels") and represents the highest density of business requirements.26
Once the inventory is complete, we descend into the code level of the Horseshoe. Static analysis involves examining the source code without executing it to understand its structure, dependencies, and flow. This phase uses automated tools to parse millions of lines of code, building the structural models necessary for higher-level abstraction.
At the heart of static analysis is Lexical Analysis. This process uses a lexer or scanner to read the raw stream of characters in the source code and group them into meaningful "tokens" (lexemes).27
if (balance < 0) is broken into tokens: KEYWORD(if), SEPARATOR((), IDENTIFIER(balance), OPERATOR(<), LITERAL(0), SEPARATOR()).29if statement checking a variable named balance."The Value of Identifiers: In legacy code, identifiers (variable and function names) are often the only surviving documentation. A variable named w_tax_rate or a function named Calc_Overtime carries semantic weight. Lexical analysis extracts these identifiers to build a "vocabulary" of the system, which will later be used for semantic clustering.27 However, one must be wary of "identifier drift," where a variable named customer_id is repurposed to store a transaction_id in a patch, a common practice in memory-constrained legacy environments.30
To decouple requirements, we must understand how code modules relate to one another. Static analysis tools calculate Coupling (how connected two modules are) and Cohesion (how focused a module is on a single task).13
if, while, for). High cyclomatic complexity (>15-20) is a strong heuristic indicator of complex business logic. A method with a complexity of 50 is not just code; it is a dense cluster of business rules and exception handling that requires meticulous decomposition.13Static analysis of the code snapshot is powerful, but analysis of the code's evolution is even more revealing. Temporal Coupling is the analysis of files that change together over time, even if they have no explicit structural link (e.g., no import statements).31
OrderValidator.java and InventoryService.java appear in the same commit 75% of the time, they are temporally coupled. This implies a hidden logical dependency—a single business requirement (e.g., "Order validation requires inventory checking") spans both files.31While complex tools are necessary, simple pattern matching using Regular Expressions (Regex) remains a highly effective, lightweight heuristic for spotting business rules.34 We employ a library of "expert patterns" to troll the codebase for logic-heavy constructs.
Table 2: Regex Heuristics for Business Rule Discovery
| Pattern Category | Regex Heuristic Example | Business Logic Implication |
|---|---|---|
| Hardcoded Limits | `(>|<|>= | <=)\s*[0-9]+` |
| Financial Math | `(* | /)\s*[0-9]*.[0-9]+` |
| Status Logic | status\s*==\s*['"][A-Z]+['"] |
Detects workflow transitions (e.g., status == "APPROVED"), revealing the system's state machine. |
| Exception Handling | throw new.*Exception |
Detects validation failures. The message usually describes the rule (e.g., "Customer under 18 not allowed"). |
| Temporal Logic | `Date.now | timestamp |
| Keyword Trawling | `//.*(rule | policy |
Using tools like grep or ripgrep 36, analysts can quickly locate these hotspots across gigabytes of source code. However, regex is prone to false positives (e.g., matching a loop counter as a business threshold), so results must be cross-referenced with AST analysis.37
In many legacy systems, the code is volatile, but the data is persistent. The database schema often acts as the "fossil record" of business requirements. Database Reverse Engineering (DBRE) is the process of extracting the domain model and rules from the physical database schema.38
The physical schema (tables, columns, keys) is the implementation detail. We must transform this into a Logical Data Model to understand the business entities.
TBL_CUST clearly maps to a "Customer" entity. But what about TBL_X99? By analyzing column names (X99_NAME, X99_ADDR) and foreign key relationships, we can infer its role.CHECK (AGE >= 18)).NOT NULL constraint creates a mandatory data requirement.40Legacy systems (especially those on Oracle or SQL Server) often use Stored Procedures for performance-critical batch processing. These procedures are not just data access layers; they are logic containers.
IF statements inside a stored procedure often represent the core logic of the system (e.g., calculating interest accrual on millions of accounts).11INSERT or UPDATE represent "side-effect" requirements. For example, an audit trigger implies a non-functional requirement for traceability. Failing to document these leads to data integrity issues in the modernized system.11Data analysis bridges the gap between code and intent. If we see a code module Mod_A exclusively reading and writing to the Salary table, we can infer that Mod_A implements "Payroll Requirements," even if the variable names are obfuscated. This Data-Centric Traceability helps cluster code modules around business entities.17
Static analysis provides the "skeleton" of the system, but Dynamic Analysis provides the "pulse." By observing the system in operation, we can verify which code paths are actually active and how data flows through the logic in real-time. This is critical for distinguishing "dead code" from active business rules.41
One of the most difficult tasks is mapping a user-visible feature (e.g., "Clicking the 'Calculate' button") to the specific lines of code that implement it. Feature Location techniques use execution traces to solve this.43
Requirements are not just code; they are workflows. User Shadowing involves observing actual users interacting with the system to capture the process context that the code supports.2
Static code shows potential values; dynamic analysis shows actual values. By inspecting variable states at runtime, we can uncover business rules defined by data.
if (x > limit) tells us there is a limit. Dynamic analysis shows that limit is always 5000 in the production environment. This allows us to document the specific requirement: "Transaction limit is set to 5000," rather than a vague "Transaction limit must be checked".44This phase is the core intellectual task: distilling the raw technical findings into distinct Business Rules. A business rule is a statement that defines or equates some aspect of the business (e.g., "A customer is considered Gold if they place more than 10 orders a year").
Program Slicing is a rigorous technique for extracting the specific subset of code relevant to a particular computation.17
NetPay), we perform a backward slice to find every line of code that affects that variable. This slice strips away UI code, logging, and error handling, leaving only the pure calculation logic. This "distilled" code is the algorithmic requirement.17TaxCode), we slice forward to see all downstream variables it affects. This helps identifying dependent requirements.Code tells us what; history tells us why. Mining Software Repositories (MSR) involves analyzing the version control history to extract the rationale behind changes.47
The advent of Large Language Models (LLMs) has revolutionized this phase. AI tools can ingest legacy code (even obscure languages like RPG or COBOL) and generate natural language explanations.4
IF structure is actually implementing a specific pricing tier table.The final analytical step is to synthesize all findings into a coherent structure, linking technical artifacts to business concepts. This bridges the "Abstraction Gap".8
To automatically cluster code artifacts into functional groups, we use Latent Semantic Indexing (LSI). LSI is an Information Retrieval technique that analyzes the co-occurrence of terms to uncover hidden semantic structures.52
Client, Cust, Payer all mean "Customer"). Simple keyword searches miss these connections.Client and Cust are close vectors. This allows us to retrieve all code related to "Customer Management" even if the developers used inconsistent naming conventions.54The output of the recovery process is a Bi-Directional Requirements Traceability Matrix (RTM).56 This matrix links every recovered requirement to the specific source code artifacts that implement it.
Table 3: Sample Recovered Traceability Matrix
| Req ID | Requirement Description | Source Artifacts (Files/Methods) | Discovery Method | Status |
|---|---|---|---|---|
| REQ-001 | Customer Eligibility: Users under 18 cannot apply for credit. | Customer.java (method checkAge), DB_Trig_Age_Chk |
Static Regex + DB Constraint | Validated |
| REQ-002 | Tiered Pricing: Apply 5% discount for orders > $10k. | PricingService.cpp (lines 405-420), config/discounts.xml |
Dynamic Trace (Feature Location) | Draft |
| REQ-003 | Audit Logging: Log all failed login attempts to SEC_LOG. |
AuthModule.py, sp_audit_insert |
Aspect Slicing | Validated |
A recovered requirement is merely a hypothesis until it is validated. The final phase ensures that the extracted understanding matches the reality of the system and formats it for consumption by modern engineering teams.
Model-Based Testing (MBT) is the gold standard for validating reverse-engineered requirements.59
To ensure the output is professional and usable, we format the Requirements Specification (SRS) according to the ISO/IEC/IEEE 29148 standard.9
The methodology outlined above is not an academic exercise; it is the prerequisite for any successful modernization strategy. Whether the goal is to Rehost (lift-and-shift), Replatform, or Refactor into microservices 21, the "As-Built" requirements list serves as the roadmap.