-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the femtoc wiki! This wiki serves as developer (and eventually, user) documentation for the reference femto compiler toolchain.
Femtoc is written in the zig systems programming language. While zig is still maturing, it was deemed stable enough to use and allowed me to get started with a prototype a lot faster than C or Rust, and while avoiding the complexity of C++. Nevertheless, it performs at least as well as these other popular systems langs, and has very good FFI support for directly calling into the C ABI.
The femtoc codebase sits somewhere between a toy language and a full fledged compiler frontend project (like zig) in terms of complexity. There is an emphasis on performance, correctness, and good architecture that isn't always present in toy languages. However, since femto is early in development, I haven't committed to the huge codebases of the likes of zig (or rust), whose frontends are many tens of thousands of lines of code.
The codebase is heavily inspired by and based on the Zig self hosting compiler. However, it has been simplified (Zig is a much more complicated language and ambitious project than femto) and will continue to diverge from zigc.
Femtoc has a fairly tradition structure for a compiler frontend. By frontend, I mean that the majority of femtoc's code is dedicated to converting .fm source code into an intermediate representation (IR) that can be used by a compiler backend project to generate machine-specific object and linked binary files. The primary backend is the LLVM project, the state of the art compiler backend that powers the clang C/C++ compiler, as well as the reference Rust and Zig compilers, and dozens of other projects. LLVM provides high quality optimizations, support for many architectures, assembly and linkage, and debug (DWARF) support, which allows me to prototype quickly.
However, it is also explicitly my goal to NOT develop a hard dependency on LLVM. I recognize that there are other backends in existence, which may be useful for certain scenarios. While it is not currently (as of 03/13/23) a high priority to develop other backends, the code will be structured so that LLVM can be plugged in as one of the backends. Examples of other backends that we may want to support:
-
qbe is a very simple compiler backend written in pure C. It is most notably used by the hare programming language, as well as the
cprocc compiler. QBE has a much more limited set of optimizations and supported architectures. However, the simplicity makes it much easier to validate and contribute to, and it runs very fast compared to LLVM. - I plan to have a C backend that will generate (hopefully MISRA compliant) C code. This helps with bootstrapping femto code on a system with only a C compiler, allows easier migration for adopters that want to start combining femto with C code, etc.
- We may create a custom backend for use with embedded devices that LLVM and QBE have limited support for. This includes older architectures like AVR and PIC, and if we feel like it, esoteric and fun architectures like 6502, (e)Z80, and 68K. This serves multiple purposes - it's easier to design a backend ourselves for this compared to adding to LLVM, allows us to tailor optimizations towards low end, simple architectures (which have very different performance characteristics than modern X86/AARCH64/RISCV64, and also serves as a learning opportunity into creating a compiler backend.
That said, what is femtoc's frontend architecture? It consists of the following steps:
- lexing: Source code is split into a stream of lexical tokens which represent conceptual elements like keywords, identifiers, literals, and punctuation.
- parsing: The token stream is matched against known language constructs to generate an abstract syntax tree (nested tree structure) of function declarations, statements, expressions, and literals.
- hir generation: The AST is used to generate a flatter (but still block-nested) high-level intermediate representation, which splits nested expressions into multiple consecutive operations, performs identifier resolution, and converts more complex structures and syntax sugar to simpler sets of instructions.
- mir generation: HIR is converted to a mid-level intermediate representation. During this stage, types are checked (HIR is only loosely aware of types), compile time code evaluation is performed, nested functions and structures are pulled out into the top level, and all source files/HIR translation units are combined into a single compilation unit. MIR is responsible for most of the semantic analysis - type checking, overflow and bounds checking, making sure functions are called correctly, etc.
- codegen: MIR is sent to a backend (such as LLVM) to generate an object file
- linking: The object file is linked with external libraries/etc to generate an executable binary.
Here is a WIP list of resources for learning more about compiler design: