Compiler Architecture

The XXML compiler is a multi-stage compiler that converts XXML source code to native executables via LLVM IR. It follows a traditional compiler architecture with distinct phases.

Compilation Pipeline

text
Source Code (.XXML)
       ↓
   [Lexer] ← Error Reporter
       ↓
    Tokens
       ↓
   [Parser] ← Error Reporter
       ↓
     AST
       ↓
[Semantic Analyzer] ← Symbol Table ← Error Reporter
       ↓
  Validated AST
       ↓
[LLVM Backend] ← Error Reporter
       ↓
  LLVM IR (.ll)
       ↓
[Clang/LLVM]
       ↓
  Object Code (.obj)
       ↓
[Platform Linker] + Runtime Library
       ↓
  Native Executable

Phase 1: Lexical Analysis

The lexer tokenizes source code into a stream of tokens, handling keywords, identifiers, literals, and operators.

Token Types

  • Keywords: import, Namespace, Class, Method, etc.
  • Identifiers: Regular and angle-bracketed (<name>)
  • Literals: Integer (42i), String ("text"), Boolean
  • Operators: +, -, *, /, ::, ., ->, etc.
  • Delimiters: [, ], {, }, (, ), ;, ,
  • Special: ^, &, % (ownership modifiers)

Phase 2: Syntax Analysis

The parser uses recursive descent with precedence climbing for expressions, building an Abstract Syntax Tree (AST).

AST Hierarchy

text
ASTNode (abstract)
├── Declaration
│   ├── ImportDecl
│   ├── NamespaceDecl
│   ├── ClassDecl
│   ├── PropertyDecl
│   ├── ConstructorDecl
│   ├── MethodDecl
│   ├── ParameterDecl
│   └── EntrypointDecl
├── Statement
│   ├── InstantiateStmt
│   ├── RunStmt
│   ├── ForStmt
│   ├── ExitStmt
│   └── ReturnStmt
├── Expression
│   ├── IntegerLiteralExpr
│   ├── StringLiteralExpr
│   ├── BoolLiteralExpr
│   ├── IdentifierExpr
│   ├── ReferenceExpr
│   ├── MemberAccessExpr
│   ├── CallExpr
│   └── BinaryExpr
└── TypeRef

Expression Precedence

text
Expression (lowest precedence)
├── Logical OR (||)
├── Logical AND (&&)
├── Equality (==, !=)
├── Comparison (<, >, <=, >=)
├── Addition (+, -)
├── Multiplication (*, /, %)
├── Unary (-, !, &)
└── Primary (literals, identifiers, calls)

Phase 3: Semantic Analysis

The semantic analyzer validates the AST using a hierarchical symbol table and performs type checking.

Symbol Table Structure

text
Global Scope
├── Namespace: RenderStar
│   └── Namespace: Default
│       └── Class: MyClass
│           ├── Property: x (Integer^)
│           ├── Property: someString (String^)
│           ├── Constructor: default
│           └── Method: someMethod
│               ├── Parameter: int1 (Integer%)
│               └── Parameter: str (String&)
└── Entrypoint Scope
    ├── Variable: myClass
    ├── Variable: myString
    └── For Loop Scope
        └── Variable: x

Semantic Checks

  1. Name Resolution - Identifiers declared before use, no duplicates
  2. Type Checking - Variable initialization, argument types, return types
  3. Ownership Validation - Owned values not aliased, references point to valid owners
  4. Class Validation - Base classes exist, no circular inheritance
  5. Method Validation - Method exists, correct arguments

Phase 4: LLVM Code Generation

Modular Codegen System

text
ModularCodegen (orchestrator)
├── ExprCodegen     - Expression code generation
│   ├── BinaryCodegen       - Binary operations (+, -, *, /, comparisons)
│   ├── CallCodegen         - Method/function calls
│   ├── IdentifierCodegen   - Variable references
│   ├── LiteralCodegen      - Integer, float, string, bool literals
│   └── MemberAccessCodegen - Property access, method resolution
├── StmtCodegen     - Statement code generation
│   ├── AssignmentCodegen   - Variable assignments
│   ├── ControlFlowCodegen  - If/else, while, for loops
│   └── ReturnCodegen       - Return statements
├── DeclCodegen     - Declaration code generation
│   ├── ClassCodegen        - Class structure generation
│   ├── ConstructorCodegen  - Constructor methods
│   ├── MethodCodegen       - Regular methods
│   ├── NativeMethodCodegen - FFI native methods
│   └── EntrypointCodegen   - Main function generation
├── FFICodegen      - Foreign function interface
├── MetadataGen     - Reflection metadata
├── PreambleGen     - Runtime preamble generation
└── TemplateGen     - Template instantiation

Type-Safe LLVM IR

A compile-time type-safe abstraction prevents invalid IR generation:

ComponentDescription
TypedValue<T>Type-safe value wrappers (IntValue, FloatValue, PtrValue, BoolValue)
AnyValueRuntime type variant when compile-time type is unknown
IRBuilderType-safe instruction builder preventing invalid IR
TypedModuleModule with type context and constant factories

Note

Integer operations only accept/return IntValue, float operations return FloatValue, comparisons return BoolValue. The compiler catches type mismatches at compile-time.

Translation Examples

Classes to Structs

xxml
[ Class <MyClass> Final Extends None
    [ Public <>
        Property <x> Types Integer^;
        Property <name> Types String^;
    ]
]

Becomes:

llvm
%MyClass = type { ptr, ptr }  ; x, name as pointers

Methods to Functions

xxml
Method <foo> Returns Integer^ Parameters (Parameter <x> Types Integer%) Do
{
    Return x.add(Integer::Constructor(1));
}

Becomes:

llvm
define ptr @MyClass_foo(ptr %this, ptr %x) {
entry:
    ; ... method body
    ret ptr %result
}

For Loops to Basic Blocks

xxml
For (Integer^ <i> = 0 .. 10) -> { ... }

Becomes:

llvm
for.init:
    %i = alloca ptr
    store ptr %zero, ptr %i
    br label %for.cond
for.cond:
    %cmp = icmp slt i64 %i.val, 10
    br i1 %cmp, label %for.body, label %for.end
for.body:
    ; ... loop body
    br label %for.inc
for.inc:
    ; increment i
    br label %for.cond
for.end:
    ; continue after loop

Runtime Integration

Compiled programs link against libXXMLLLVMRuntime:

  • Memory management (xxml_malloc, xxml_free)
  • Core types (Integer, String, Bool, Float, Double)
  • Console I/O (Console_printLine, etc.)
  • Reflection runtime for type introspection

Error Reporting

The ErrorReporter accumulates errors during compilation:

text
filename.xxml:10:5: error: Undeclared identifier 'foo' [3000]
    Run System::Print(foo);
                      ^

Error Categories

  • Lexer Errors: Unexpected character, unterminated string, invalid number
  • Parser Errors: Unexpected token, missing delimiter, invalid syntax
  • Semantic Errors: Undeclared identifier, type mismatch, duplicate declaration
  • CodeGen Errors: Internal errors (should not occur)

Design Patterns

Visitor Pattern

Used for AST traversal in semantic analysis and code generation:

cpp
class ASTVisitor {
    virtual void visit(ClassDecl& node) = 0;
    virtual void visit(MethodDecl& node) = 0;
    // ...
};

Builder Pattern

Used in Parser to construct AST nodes.

Strategy Pattern

Different code generation backends (LLVM Backend is current, Interpreter is future).

Performance

Time Complexity

  • Lexing: O(n) where n = source size
  • Parsing: O(n) where n = number of tokens
  • Semantic Analysis: O(n * m) where n = nodes, m = average scope depth
  • Code Generation: O(n) where n = AST nodes

Optimization Opportunities

  • String Interning: Reduce memory for repeated identifiers
  • Arena Allocation: Faster AST node allocation
  • Lazy Symbol Resolution: Delay expensive lookups
  • Incremental Compilation: Recompile only changed modules

LLVM Optimizations

The LLVM backend leverages LLVM's optimization passes:

  • Constant folding (also done at compile-time evaluation)
  • Dead code elimination
  • Inline expansion
  • Common subexpression elimination

Future Enhancements

  • Interpreter: Direct AST interpretation for debugging
  • JIT Compiler: Runtime compilation for dynamic scenarios
  • Debug Information: DWARF debug info in generated binaries

Source Files

ComponentLocation
Lexerinclude/Lexer/, src/Lexer/
Parserinclude/Parser/, src/Parser/
Semantic Analyzerinclude/Semantic/, src/Semantic/
LLVM Backendinclude/Backends/, src/Backends/
Import Resolverinclude/Import/, src/Import/
Common Infrastructureinclude/Common/, src/Common/

Next Steps

Learn about the CLI Reference for compilation options, or explore the Import System for module resolution.