In which way should I structure a compiler/Interpeter? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 8 years ago.

Improve this question

For a couple of months now Im writing a interpeter / compiler for a programming language in C#.

I have encountered some issues recently which make the code feel incorrectly written

Classes change a main state
Classes are not really testable
Classes are hardly maintainable

Firstly

Im looking for a good structure for the different steps of compilation

Currently I have a class that runs the steps one after another changing the runner's state with each step.

I think this is hardly maintainable and probably not the preferred way of structuring things.

Secondly

On a recent code review I was advised to change the Tokenizer class to a class that returns an IEnumerable<Token> stream that I can use in the parsing part, that proved to be easy to write tests for.

How would i write a parser in such a way? The parser still changes states and I am unsure what I could do to make the class structure better.

Its built with alot of SyntaxNode's (an interface) which inherits different Node's, the parser is a collection of void functions that assign the tokens to different nodes to build a kind of AST.

I think my AST is structured weirdly, some Nodes contain other nodes, some contain tokens

as an example:

public class FunctionDeclerationNode : SyntaxNode
{
    public Token FunctionNameToken;
    public FunctionBindParamNode FB_Node;
    public FunctionReturnParamNode FR_Node;
}
public class FunctionBindParamNode : SyntaxNode
{
    public List<BindParamNode> bindparams = new List<BindParamNode>();
}
public class FunctionReturnParamNode : SyntaxNode
{
    public ReturnParamNode returnparam;
}

This causes me to initialize and change public fields inside these Nodes all the time, usually seen as a big nono in OOP.. it ended up this way and i dont like it.

The parser is built out of void methods that Parse the code top down out of the stream, that seems fine but not optimal because it requires mundane "Container Nodes" which just have lists of other Nodes like for example a ProgramNode which contains all the ClassNodes but it seems like a weird way to structure things.

Parser method example:

    public void ParseReturnParam(FunctionReturnParamNode FRP_Node)
    {
        ReturnParamNode RP_Node = CreateNode<ReturnParamNode>();

        FRP_Node.returnparam = RP_Node;

        Eat();
        ParseVarType(RP_Node);
    }

Lastly

I have a class responsible for linking the different functions and variables by their token name, and its proving to create alot of duplicate code and structures

Binded structure:

public class CallFunctionCommand : IPreservedInfo, ICommand, ILExpressionElement
{
    public ILFunction CalledFunction;
    public ILExpression[] Parameters;

    public CallFunctionCommand(ILFunction CalledFunction, ILExpression[] Parameters)
    {
        this.CalledFunction = CalledFunction;
        this.Parameters = Parameters;
    }
}

I iterate thru the AST and create new structures with a reference to each other.

as an example: an expression with a function call in it (a + func1()) would have as an element have a CallFunctionCommand in it that references an ILFunction which contains all the different parameters and the return type of the function to enable syntaxic analysis (knowing the return type of such an expression)

This structure proves to be difficult to run analysis on.

Conclusion

All suggestions which would help me create a better structured program are very welcome, I would like to apologize for the long question but the different parts are very connected to each other and therfore it ended up this way.

The best way to help me would be showing me design patterns, functions and and interfaces i should implement in different parts of my code, and simply just insight on the art that is compiler design (As you might tell I did not study it).

It seems you have a lot of issues with your code. I often find it helps to rewrite it entirely from scratch, assuming you haven't dedicated too much time into it. It'll come easier the second time through, and now that you're aware of some issues, restructuring to account for these problems will also come easier as well. — Neil
– Neil, Commented Jun 20, 2017 at 10:19
That would be my 4th rewrite, im preparing for it currently, but thats why im doing research on what i should do better next time! I dont know how to properly structure the different parts so im reaching out for advice — downrep_nation
– downrep_nation, Commented Jun 20, 2017 at 10:21
I've written compilers in the past and a mistake I commonly made was to try to fit too many steps in one pass. My advice would be to focus only on each step and whether the code is valid to the extent of that one step. For instance, if you're tokenizing, worry only that you receive an invalid token. If you're parsing, ensure that each token is an expected token for its position. You can perform higher level checks like type checking once you know what the code is supposed to do. — Neil
– Neil, Commented Jun 20, 2017 at 10:30
Youre completely right, but before that i think my rewrite is calling for a general structure of the program. i need some concepts and ideas — downrep_nation
– downrep_nation, Commented Jun 20, 2017 at 10:41
I think I'd be doing you a disservice by answering your question, as there is no one way to do it. But my approach would be to divide each step cleanly from the next by having a class which only deals with each step, requiring input that can only come from the previous (don't try to combine or optimize). You may want to reconsider using classes to represent tokens. It is enough to have a class SyntaxNode with an enum type and an optional string representation. SyntaxNode for left parenthesis wouldn't need it, but you would for a SyntaxNode for a string token for instance. — Neil
– Neil, Commented Jun 20, 2017 at 11:28

Community · Accepted Answer · 2020-06-16 10:01:49Z

Looking at your description of the problems you are encountering and at the architecture and design you are describing in general (i.e. without even considering the fact that it is a compiler), your problems seem to be pretty standard ones:

shared (even worse global) mutable state that makes it unclear what is actually going on
large units of functionality

And the solutions are also pretty standard:

don't share, don't mutate, or both, i.e. have pure functions that take an AST as argument and return an updated AST as a return value
break them up into smaller ones

Of course, that is easier said than done … after all, you probably would have already done it if it was easy.

So, looking specifically at compilers, what can we do about shared mutable state? Like I said: instead of mutating a single AST, have the functions take an AST as an argument and return a new one as a result. This may sound pretty expensive, since it requires copying the entire AST, but unless you actually measure it, you cannot be sure. And once you do determine that it is slowing you down, the nice thing about immutable trees is that you can do structural sharing which means you have to copy only the path from the root to the updated node, and the rest of the tree can be shared between the old and new versions.

Breaking up the large pieces of functionality can be done by making a multi-pass compiler. Originally, compilers used multiple passes because the whole compiler, or the whole data set (or both) simply didn't fit into the extremely restricted memory of the early computers. But, it turns out that you can use multiple passes to simplify each individual pass.

Basically, each pass is like its own mini-compiler: it reads input in some language, does "stuff" with it, and writes output in some other "language". For example, the lexer pass reads C♯ and writes a token stream, the parser pass reads a token stream and writes a Parse Tree, the semantic analyzer pass reads a Parse Tree and writes an Abstract Syntax Tree, the typer pass reads an Abstract Syntax Tree and writes a Type-Annotated AST, and so on.

You can do this recursively on multiple scales, i.e. break the compiler up into a front end, a middle end, and a back end, break each of those again into multiple passes, and so on.

Don't forget that modern compilers must do a lot of stuff that older compilers typically didn't do: an IDE is kind-of a compiler, too (in fact, it does pretty much everything a compiler does, except the actual code generation, but it does lexing, parsing, type checking, type inference, name resolution, etc, all in order to support syntax highlighting, warnings, errors, quick fixes, refactorings, auto completion, documentation popups, and all those little light bulbs, squiggly lines, hints, and helpers that you are used to). So, why write two compilers? Why not use the same compiler for both?

In order to do that, the compiler needs to be able to process incomplete and invalid input (after all, you want to do all that while the programmer is typing, but while he is typing, the program 99.9% of the time doesn't compile; the compiler not only needs to deal with that, it also needs to be helpful). Also, re-compiling the entire project at each keystroke would be insane, so the compiler needs to be able to compile small pieces of code individually while the programmer types them and integrate that with its view of the rest of the code. The compiler needs to be asynchronous, concurrent, and re-entrant, since it not only runs in parallel with the rest of the IDE, but often even in parallel with itself (i.e. building the project while the programmer is already writing new code, which needs to be highlighted etc.)

These requirements are very different from a traditional batch compiler, yet, it makes sense to use the same compiler for "compiling" and the IDE: that way, they can never disagree and never get out of sync.

Here are a couple of pointers to compilers that are written in a somewhat non-traditional way that you won't find in text books, but that IMO improve maintainability and evolvability:

Roslyn

It is actually very close to a traditional compiler design, but its ASTs are completely immutable (and actually persistent). This allows multiple compilation processes to operate on the same AST without stepping on each other's toes, which is important if you integrate the compiler into an IDE (e.g. the syntax highlighter and the code style checker may traverse the AST at the same time that the code generator is building a solution).

Dotty

Dotty is both a language that is intended to test out concepts for future versions of Scala, as well as a compiler that is intended to test out concepts for future versions of the Scala compiler. We will ignore Dotty the language for this question, only the compiler is relevant.

The Dotty compiler takes things even further than Roslyn when it comes to immutability. The Dotty compiler is built like a database, more precisely, like a temporal database.

See Martin Odersky's talk Compilers are Databases at the JVM Language Summit 2015 about the design of the dotty compiler.

The basic idea of the dotty compiler is that there is no mutable state. Everything is fully immutable and purely functional. This is achieved by taking ideas from purely functional (aka "temporal") databases. Data that in a traditional compiler would change over time (such as a symbol table) is instead represented as a pair of (timestamp, current_value), i.e. as values indexed by time. (They don't use actual time, though, rather a notion of time internal to the compiler, based on the run number and the compiler phase.)

In particular, this means that there is no symbol table. Instead, the role that the symbol table plays in a traditional compiler, is split across multiple data structures, all of which are immutable, some are time-invariant, some are time-varying. These are References, Denotations, and Symbols; the discussion of References starts around 30:30, the discussion of Denotations around 34:26, and the discussion of Symbols around 37:30.

Dotty also uses a comparatively large number of passes, compared to other industrial-strength production-quality compilers, with each individual pass being relatively simple. For performance reasons, it has a framework that can automatically fuse multiple passes back together into a single pass, but the important thing is that this large monsterpass was then automatically generated, not written by a human, and thus doesn't hurt maintainability.

Compilers written in Haskell

The Haskell community has a lot of very interesting approaches to various problems, and compilers are no exceptions. For example, structuring compilers as Monad Transformer Stacks, where each individual language feature is represented as a Monad Transformer, and thus the language can be composed of lots of little independent features. Or, the Idris compiler, which is written in Haskell, using an "Elaboration Monad" (you can roughly think of "Elaboration" as "Semantic Analysis"), with different language features written as "elaboration scripts" inside that monad.

Nanopass compilers

The idea of nanopasses is basically: instead of breaking up a compiler into 2, 3, 10 passes, why don't we break it up into 20, 30, 100+ extremely simple, extremely small passes that each do one thing well and only one thing?

If you do that, you typically end up with a lot of code duplication: every pass takes in some input language, does a tiny thing and returns some output language. The code for parsing all those input languages and generating all those output languages is very repetitive, especially since the individual passes are extremely small, so the input languages of consecutive passes tend to be very similar.

That's what the nanopass framework is for, it contains two DSLs, and associated machinery. One DSL for defining languages, with support for defining languages differentially (IOW "inheriting" a language from another and only defining the things which changed). And one DSL for defining nanopasses, with support for only defining code for those parts of the language the pass is actually manipulating and automatically generating no-op "passthrough" code for the rest. As a result, typical language descriptions and passes are only a couple of lines of code each.

The nanopass framework was successfully used to re-architect the Chez Scheme compiler, this effort is described in Andy Keep's PhD thesis A Nanopass Framework for commercial compiler development and summarized in a short paper of the same name, and a talk given at ClojureConj 2013.

Stack Exchange Network

In which way should I structure a compiler/Interpeter? [closed]

Firstly

Secondly

Lastly

Conclusion

1 Answer 1

Roslyn

Dotty

Compilers written in Haskell

Nanopass compilers

Hot Network Questions

Firstly

Secondly

Lastly

Conclusion

1 Answer 1

Compilers written in Haskell

Related