A derivation of a string for a grammar is a sequence of grammar rule applications that transform the start symbol into the string. A derivation proves that the string belongs to the grammar's language. (Source)
A strategy is followed that deterministically determines the next nonterminal to rewrite:
- in a leftmost derivation, it is always the leftmost nonterminal;
- in a rightmost derivation, it is always the rightmost nonterminal. (Source)
- Also known as parse tree
- Output of a parser
- Complete representation of the input in the leafs
- Names of Rules in the interior nodes
- How it is matched
- Also known as syntax tree
- Output of a parser
- Free of inessential nodes
- e.g. grouping parentheses are implicit in the tree structure
- Incomplete representation of the input in the leafs
- Names of Rules in the interior nodes
- How it is matched
Front end
- Preprocessing
- Lexical analysis (lexing)
- Syntax analysis (parsing)
- leads to an CST or AST
- Semantic analysis
- Generation of intermediate representation
Middle end
- Analysis
- Target independent optimizations
Back end
- Target specific optimizations
- Code generation or interpretation
- Assembly of object files (in case the generated code belongs to an assembly language)
- Linking
Often an existing toolchain for a language such as C/C++ is used as middle and back end. In this case the front end would transpile the custom language into C/C++.
This approach might provide a simple solution in case the custom language is close to C/C++. Otherwise the mapping might turn out to be complex (e.g. if memory management by garbage collection is required), which is a downside of the approach.
This problem is a motivation for the LLVM project.
Top-down parser
first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar Top-down parsing can be viewed as an attempt to find left-most derivations (Source)
Bottom-up parser
recognizes the text's lowest-level small details first, before its mid-level structures, and leaving the highest-level overall structure to last. (Source)
Shift-reduce parser
- Kind of bottom-up parser
Recursive descent parser
- Kind of top-down parser
- Uses mutually-recursive functions rather than tables
- Lower complexity (in many cases, no parser generator is required)
- Faster
Recursive ascent parser
- Kind of bottom-up parser
- Uses mutually-recursive functions rather than tables
- Lower complexity (in many cases, no parser generator is required)
- Faster
Table based parser
- Lower memory footprint
- Higher complexity (usually parser generator is required)
Packrat parser
Motivation
- Placing restrictions on a context-free language grammar keeps runtime complexity at bay
LL(k)
- Top-Down
- Processes input left to right
- Leftmost derivation
- k is a lookahead to amke a decision
LL
- Synonym for LL(1)
LL(*)
- LL(k) which is not restricted to a finite number k
LF(k)
- Also known as strong LL-parser
- Top-Down
LL(k)
- Top-Down
- Processes input left to right
- Leftmost derivation
- k is a lookahead
LR(k)
- Bottom-Up-Parser
- Processes input left to right
- Rightmost derivation
- k is a lookahead to amke a decision
LR
- Synonym for LR(1)
LALR(1)
- Look-Ahead LR parser
-
Also known as compiler compiler
-
Usually supports generation of a lexer and parser based on a given grammar
-
Among the most popular generators are Flex+Bison (LALR) and ANTLR (LL(*))
- Generates a lexer
- Supported target language is C++
License
| Tool | License |
|---|---|
| Lex | various |
| Flex | BSD |
- Generates a parser
- Supported target language is C++
- Table driven (smaller memory footprint)
License
| Tool | License |
|---|---|
| Yacc | various |
| Bison | GPL |
- Generates a lexer, parser, syntax analyzer, visitor (tree walker) and/or listener based on a given LL(*) grammer (
.g4format) - Recursive descent parser
- Requires Java Runtime
- Supported target languages: Java, C#, Python, JavaScript, Go, C++, Swift
- An IDE tool, called ANTLRworks, is available
- May be used along with StringTemplates as shown in this article
References
- Let's build a compiler - wir entwickeln praxisnah einen Compiler mit Java und ANTLR4
- Antlr4 - Visitor vs Listener Pattern
- Supports labels for rules (
rule # label) - Supports labels for rule elements (
label=token)
Resources
| Feature | Visitors | Listeners |
|---|---|---|
| Explicit call required to include children | Yes | No |
| Use a return value | Yes | No |
| Support for large parse trees | No | Yes |
References
Follow instuctions on antlr.org
antlr4 -o src Demo.g4
javac -cp lib/antlr.jar src/*.java
grun src/Demo expr
1+2*3Emmit EOF (Linux: Ctrl+D, win: Ctrl+Z)
antlr4 -o src -Dlanguage=Cpp -package theNamespace Demo.g4Resources
- Initially started by Chris Latner
- Motivated by the way JIT compilers worked at the time, which performed optimization during run-time. This would therefore be repeatet each time the application was executed unless the result is cached. Thus the working hypothesis was to move as much of the optimization to compile-time.
- Since typically programs are optimized by manipulating an intermediate representation (IR), the term Low Level Virtual Machine (LLVM) was coined.
- Since then LLVM developed to an umbrella project, featuring A core project as well as other subprojects concerned with the development of the
clangfront-end and various tools
Resources
- Optimizer
- JIT
- GC
The IL can be provided in 2 representations:
- human readable ascii representation
- binary bitcode representation
Bidirectional transformation between representations is possible using the following commands:
llvm-as(transforms IL in ascii representation to bitcode representation)llvm-dis(transforms IL in bitcode representation to ascii representation)
Language aspects:
;introduces a line comment- identifiers are prefixed according to their nature:
@for global identifiers (functions, global variables)%for local identifiers (register names, types)$for comdats!for metadata#for attribute groups^for a entry of a module summary
- Pointer types are speficied by
<type> * - Array types are speficied by
[ <size> x <type> ] - Vector types are speficied by
< <size> x <type> > - Structure types are specified by either
{ <type list> }(regular) or<{ <type list> }>(packed)
In order to compile for a specific target additional information must be provided:
- a datalayout which specifies
- Byte order (endianness)
- Type specific allocation sizes
- Type specific memory alignments
- Style of name mangling
- …
- a target triple which specifies
- Architecture
- Vendor
- Operating system
- optional: Environment
- The source filename
A module may also specify flags for upcoming stages in the toolchain such as e.g. linker options and magic lobal variables.
- A front-end that outputs the IL is required
- This may be achieved by utilization of the core libraries (C++) whilch provide classes such as e.g. an
IRBuilder
- This may be achieved by utilization of the core libraries (C++) whilch provide classes such as e.g. an
References
Commands (excerpt):
llvm-link(linker for IL in bitcode representation)lli(executes IL in bitcode representation using JIT compilation)lldb(debugger)opt(optimizer)llc(compiles IL in any representation to assembly code)
See also LLVM Command Guide for a full listing
Build from source: Requirements: Make, CMake, Compiler Repository: https://github.com/llvm/llvm-project
Pre-build: http://releases.llvm.org/download.html
Based on multiple passes of different types:
- Module Pass
- Call Graph Pass
- Function Pass
- Basic Block Pass
- Immutable Pass, Region Pass, MachineFunction Pass
Within each of the mentioned passes:
- Analysis Pass
- Transform Pass
clang++ -S -emit-llvm source.cppObservations
- IL uses SSA
opt -dot-cfg-only input.ll > /dev/null
xdot cfg.main.dot#include "llvm/Pass.h"
#include "llvm/IR/Function.h"
#include "llvm/Support/raw_ostream.h"
using namespace llvm;
namespace {
struct DemoPass : public FunctionPass {
static char ID;
DemoPass() : FunctionPass(ID) {}
bool runOnFunction(Function &F) override {
errs().write_escaped(F.getName()) << '\n';
return false;
}
}
}
char DemoPass::ID = 0;
static RegisterPass<Hello> X("name_of_the_pass",
"Description of the pass",
false,
false);opt -load path/to/DemoPass.so -name_of_the_pass < input.bc- LLVM Compiler Infrastructure
- LLVM Tutorial: Table of Contents
- LLVM
- LLVM for Grad Students
- Introduction to LLVM Building simple program analysis tools and instrumentation
- YOW! 2016 Erik Corry - Building Your Own Compiler The Slightly Easier Way With LLVM
- My First LLVM Compiler
Build demo
- Install ANTLR into
demodirectory - Install LLVM from source
- Build demo using the following commands
cd demo
cmake .
cmake --build .