Compiler Architecture
The XXML compiler is a multi-stage compiler that converts XXML source code to native executables via LLVM IR. It follows a traditional compiler architecture with distinct phases.
Compilation Pipeline
Source Code (.XXML)
↓
[Lexer] ← Error Reporter
↓
Tokens
↓
[Parser] ← Error Reporter
↓
AST
↓
[Semantic Analyzer] ← Symbol Table ← Error Reporter
↓
Validated AST
↓
[LLVM Backend] ← Error Reporter
↓
LLVM IR (.ll)
↓
[Clang/LLVM]
↓
Object Code (.obj)
↓
[Platform Linker] + Runtime Library
↓
Native ExecutablePhase 1: Lexical Analysis
The lexer tokenizes source code into a stream of tokens, handling keywords, identifiers, literals, and operators.
Token Types
- Keywords:
import,Namespace,Class,Method, etc. - Identifiers: Regular and angle-bracketed (
<name>) - Literals: Integer (
42i), String ("text"), Boolean - Operators:
+,-,*,/,::,.,->, etc. - Delimiters:
[,],{,},(,),;,, - Special:
^,&,%(ownership modifiers)
Phase 2: Syntax Analysis
The parser uses recursive descent with precedence climbing for expressions, building an Abstract Syntax Tree (AST).
AST Hierarchy
ASTNode (abstract)
├── Declaration
│ ├── ImportDecl
│ ├── NamespaceDecl
│ ├── ClassDecl
│ ├── PropertyDecl
│ ├── ConstructorDecl
│ ├── MethodDecl
│ ├── ParameterDecl
│ └── EntrypointDecl
├── Statement
│ ├── InstantiateStmt
│ ├── RunStmt
│ ├── ForStmt
│ ├── ExitStmt
│ └── ReturnStmt
├── Expression
│ ├── IntegerLiteralExpr
│ ├── StringLiteralExpr
│ ├── BoolLiteralExpr
│ ├── IdentifierExpr
│ ├── ReferenceExpr
│ ├── MemberAccessExpr
│ ├── CallExpr
│ └── BinaryExpr
└── TypeRefExpression Precedence
Expression (lowest precedence)
├── Logical OR (||)
├── Logical AND (&&)
├── Equality (==, !=)
├── Comparison (<, >, <=, >=)
├── Addition (+, -)
├── Multiplication (*, /, %)
├── Unary (-, !, &)
└── Primary (literals, identifiers, calls)Phase 3: Semantic Analysis
The semantic analyzer validates the AST using a hierarchical symbol table and performs type checking.
Symbol Table Structure
Global Scope
├── Namespace: RenderStar
│ └── Namespace: Default
│ └── Class: MyClass
│ ├── Property: x (Integer^)
│ ├── Property: someString (String^)
│ ├── Constructor: default
│ └── Method: someMethod
│ ├── Parameter: int1 (Integer%)
│ └── Parameter: str (String&)
└── Entrypoint Scope
├── Variable: myClass
├── Variable: myString
└── For Loop Scope
└── Variable: xSemantic Checks
- Name Resolution - Identifiers declared before use, no duplicates
- Type Checking - Variable initialization, argument types, return types
- Ownership Validation - Owned values not aliased, references point to valid owners
- Class Validation - Base classes exist, no circular inheritance
- Method Validation - Method exists, correct arguments
Phase 4: LLVM Code Generation
Modular Codegen System
ModularCodegen (orchestrator)
├── ExprCodegen - Expression code generation
│ ├── BinaryCodegen - Binary operations (+, -, *, /, comparisons)
│ ├── CallCodegen - Method/function calls
│ ├── IdentifierCodegen - Variable references
│ ├── LiteralCodegen - Integer, float, string, bool literals
│ └── MemberAccessCodegen - Property access, method resolution
├── StmtCodegen - Statement code generation
│ ├── AssignmentCodegen - Variable assignments
│ ├── ControlFlowCodegen - If/else, while, for loops
│ └── ReturnCodegen - Return statements
├── DeclCodegen - Declaration code generation
│ ├── ClassCodegen - Class structure generation
│ ├── ConstructorCodegen - Constructor methods
│ ├── MethodCodegen - Regular methods
│ ├── NativeMethodCodegen - FFI native methods
│ └── EntrypointCodegen - Main function generation
├── FFICodegen - Foreign function interface
├── MetadataGen - Reflection metadata
├── PreambleGen - Runtime preamble generation
└── TemplateGen - Template instantiationType-Safe LLVM IR
A compile-time type-safe abstraction prevents invalid IR generation:
| Component | Description |
|---|---|
TypedValue<T> | Type-safe value wrappers (IntValue, FloatValue, PtrValue, BoolValue) |
AnyValue | Runtime type variant when compile-time type is unknown |
IRBuilder | Type-safe instruction builder preventing invalid IR |
TypedModule | Module with type context and constant factories |
Note
IntValue, float operations return FloatValue, comparisons return BoolValue. The compiler catches type mismatches at compile-time.Translation Examples
Classes to Structs
[ Class <MyClass> Final Extends None
[ Public <>
Property <x> Types Integer^;
Property <name> Types String^;
]
]Becomes:
%MyClass = type { ptr, ptr } ; x, name as pointersMethods to Functions
Method <foo> Returns Integer^ Parameters (Parameter <x> Types Integer%) Do
{
Return x.add(Integer::Constructor(1));
}Becomes:
define ptr @MyClass_foo(ptr %this, ptr %x) {
entry:
; ... method body
ret ptr %result
}For Loops to Basic Blocks
For (Integer^ <i> = 0 .. 10) -> { ... }Becomes:
for.init:
%i = alloca ptr
store ptr %zero, ptr %i
br label %for.cond
for.cond:
%cmp = icmp slt i64 %i.val, 10
br i1 %cmp, label %for.body, label %for.end
for.body:
; ... loop body
br label %for.inc
for.inc:
; increment i
br label %for.cond
for.end:
; continue after loopRuntime Integration
Compiled programs link against libXXMLLLVMRuntime:
- Memory management (
xxml_malloc,xxml_free) - Core types (Integer, String, Bool, Float, Double)
- Console I/O (
Console_printLine, etc.) - Reflection runtime for type introspection
Error Reporting
The ErrorReporter accumulates errors during compilation:
filename.xxml:10:5: error: Undeclared identifier 'foo' [3000]
Run System::Print(foo);
^Error Categories
- Lexer Errors: Unexpected character, unterminated string, invalid number
- Parser Errors: Unexpected token, missing delimiter, invalid syntax
- Semantic Errors: Undeclared identifier, type mismatch, duplicate declaration
- CodeGen Errors: Internal errors (should not occur)
Design Patterns
Visitor Pattern
Used for AST traversal in semantic analysis and code generation:
class ASTVisitor {
virtual void visit(ClassDecl& node) = 0;
virtual void visit(MethodDecl& node) = 0;
// ...
};Builder Pattern
Used in Parser to construct AST nodes.
Strategy Pattern
Different code generation backends (LLVM Backend is current, Interpreter is future).
Performance
Time Complexity
- Lexing: O(n) where n = source size
- Parsing: O(n) where n = number of tokens
- Semantic Analysis: O(n * m) where n = nodes, m = average scope depth
- Code Generation: O(n) where n = AST nodes
Optimization Opportunities
- String Interning: Reduce memory for repeated identifiers
- Arena Allocation: Faster AST node allocation
- Lazy Symbol Resolution: Delay expensive lookups
- Incremental Compilation: Recompile only changed modules
LLVM Optimizations
The LLVM backend leverages LLVM's optimization passes:
- Constant folding (also done at compile-time evaluation)
- Dead code elimination
- Inline expansion
- Common subexpression elimination
Future Enhancements
- Interpreter: Direct AST interpretation for debugging
- JIT Compiler: Runtime compilation for dynamic scenarios
- Debug Information: DWARF debug info in generated binaries
Source Files
| Component | Location |
|---|---|
| Lexer | include/Lexer/, src/Lexer/ |
| Parser | include/Parser/, src/Parser/ |
| Semantic Analyzer | include/Semantic/, src/Semantic/ |
| LLVM Backend | include/Backends/, src/Backends/ |
| Import Resolver | include/Import/, src/Import/ |
| Common Infrastructure | include/Common/, src/Common/ |
Next Steps
Learn about the CLI Reference for compilation options, or explore the Import System for module resolution.