Writing a simple Compiler on my own - Finishing Off The Grammar/Parser (part 1)
[Custom Thumbnail]
All the Code that will be mentioned in this article can be found at the Github repository:
https://github.com/drifter1/compiler
under the folder "Finishing Off The Grammar-Parser-part1".
Introduction
Hey it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison in lots of parts. Because of the "failed" Hardfork, my previous article wasn't voted because it payed out before the utopian-io bot was able to vote again. Either way, now that blockchain operations seem stable again, I'm back to continue with another article/part of this series.
The topics that we will cover today are:
- Talk about the Grammar of the Language (somewhat recap)
- Which grammar rules are already in the parser and which are missing
- Finish off the Bison Parser by adding the missing rules
- Fix-up some of the rules that are easy to fix
Next time we will tweak and fix even more things to get rid of some more grammar-rule bugs and to make our grammar simpler and easier to understand. That's why this is part 1!
Requirements:
Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
- What Compiler Design is (mainly the steps)
- For which exact Language the Compiler is build for (Tokens and Grammar)
- How to use Flex and Bison
- How to implement a lexer and parser for the language using those tools
- What the Symbol Table is and how we implement it
- How we combine Flex and Bison together
- How we can pass information from the Lexer to the Parser
Difficulty:
Talking about the series in general this series can be rated:
- Intermediate to Advanced
Today's topic(s) can be rated:
- Medium, as lots of the bugs will not be fixed today
So, without further ado, let's now finally start with the actual Tutorial...
Actual Tutorial Content
Grammar of the Language
The Language for which we implement the Compiler is a simplified version of the C Programming Language. We already mentioned in previous parts that we have:
- less reserved words and keywords (only the basic ones)
- basic data types char, int, float, double and boolean ("virtually" to check conditional statements), arrays and pointers to all the data types
- no preprocessor, meaning that we don't have include, define etc. statements
- a simple "compiler"-based function for printing which will be recognized by the name "print"
- simple regular expressions for the lexical units and tokens
- a simplified syntax that only lets us define variables at the beginning and doesn't contain a main-function
- static (non-dynamic) semantic rules, meaning that nothing will be checked at Runtime
- Only static memory allocations (for now)
The basic structure of a programm in this Language is:
main function that contains:
- variable declarations (datatype, name and optional initialization, arrays and pointers)
- body/statemets (arithmetic expressions, if-else, for, while statements, function calls)
In this strict order!
other functions that consist of:
- the function declaration (return-value datatypes, name, parameters)
- variable declarations
- main body / statements
Again, in this strict order to stay simple.
An example that contains most of the previously mentioned things is:
// main function
// declarations
int i;
double val = 2.5, res[10];
// statements
for(i = 0; i < 10; i++){
res[i] = operation(val, i);
val = res[i];
print(res[i]);
print('\n');
}
return;
// functions
double operation (double value, int i){ /* function declaration */
// declarations
double res;
// statements
res = value*i + i;
return res;
}
which you can also find with the name "example2.c" in the repository.
Which grammar rules are already in the parser and which are missing
The parser that we implemented in Bison until now, contains most of the needed grammar rules. The main part that is missing is anything that has to with "other" functions besides the main function of the program. We have to declare rules about how we call a function in the "main" or other functions and how we define functions and their statements. Knowing that the program as a whole is expressed by the "program"-rule which is:
program: declarations statements ;
We can clearly see that no functions can be declared yet.
Also, checking the "statement"-rule we can see that there is no such expression that "calls a function". This part of calling functions will be very interesting later on, as we will have to check if the number of parameters and the datatypes of the values are correct. This is something that we will cover from next time, where we will get into Semantics.
All the other rules of the Grammar might seem complete, but they aren't! One big thing that is not been included yet is allowing variables that are declared to also be directly assigned to a value. This initialization will be optional for each "basic" datatype, arrays and pointers. Worth noting is that we will only allow an initialization with constants (ICONST, FCONST, CCONST). For arrays we will define a "special way" of initialization, which will be similar to the one in "regular" C. If an initilization is "allowed" will be controlled by the Semantics, meaning that we don't have to write complex rules that split based on the type of variable, array or pointer that we are declaring.
One more interesting thing is that we don't have a "NULL"-value for pointers yet, something that we will get back to another time. But, just know for now that we will have to change the initialization afterwards to let us initialize pointers to the special "NULL"-value, which means that the pointer doesn't look anywhere.
So, the summarize really quick, what's missing is:
- function declarations that include the "actual" declaration, the declaration of variables inside of a function and the "main" body of a function (statements)
- "function call" statement
- including initialization of variables
Finish off the Bison Parser by adding the missing rules
Let's get into each of the previous things one by one..
Function declaration
The first thing that we have to do is create a new rule that let's us define functions (more than one). The "program"-rule, which is the main rule of the grammar has to be extended to allow such functions to exist. Let's define this new rule as "functions" and each of the functions that it let's us declare as "function". That way the main-rule (program) of the grammar now looks as following:
program: declarations statements functions;
The "functions"-rule is a left-recursive rule that let's us declare more than one functions, which can be done following the same structure as the "declarations" and "statements" rules. This means that the "functions"-rule is defined as follows:
functions: functions function | function;
You can see that we don't let the "functions"-rule be empty directly. This is something that will be "allowed" by the "function"-rule.
So, how do we declare a function? A function declaration follows the structure:
return_type function_name ( //parameters ) {
// declarations
// statements
}
From this we can already see that we have to declare the following:
- a return type rule, which is something that we somewhat have ("type"-rule)
- the name of a function is a identifier (ID)
- a parameters rule that let's us define more than one parameters separated with a comma (',')
- declarations and statements, which we already have implemented!
Let's start by declaring a "return type" rule. We already have a "type"-rule, which contains all the basic datatypes, which are also the ones that we will allow to be used. But, we don't want to allow simple datatypes only, but also pointers to these datatypes. When we declared pointers we already implemented a rule called "pointer" that let's us add one or more '*". Not all the return types have to be pointers, which means that we will also include a sub-rule that is just a type. The "pointer"-rule will that way become optional. Using these two rules we can easily define a "return_type"-rule as follows:
return_type: pointer type | type;
The "parameters"-rule will be similar to how we declare variables (separated with ','). We will use lot's of the already declared rules used in the "declarations"-rule. First of all we want to allow more than one parameters, something that directly shows us that we have to write a left-recursive function for that. Also note that we have to keep in mind that not all functions have parameters. Either way, the "parameters"-rule is:
parameters: parameters COMMA parameter | parameter | /* empty */ ;
Empty parameters will be controlled by the main "function"-rule.
Each parameter is just a datatype followed by a variable, which can be a simple identifier (ID), a pointer or an array. The datatypes are already described by the "type"-rule. The second part is exactly the "variable"-rule that we used for declarations. That way the "parameter"-rule is defined as:
parameter : type variable;
From all that the "function"-rule now ends up consisting of two parts:
function: function_head function_tail | /* empty */ ;
function_head: return_type ID LPAREN param_empty RPAREN;
function_tail: LBRACE declarations statements RBRACE;
The "param_empty"-rule allows us to have empty parameters or not, being:
param_empty: parameters | /* empty */;
Function calling
Looking at the previously declared "function_head"-rule we can tweak that rule to easily call functions as an "statement" and as an "expression" (cause functions can also return values). That way we define a "function_call"-rule that just consists of the name (ID) and parameters in parantheses as:
function_call: ID LPAREN call_params RPAREN;
The "call_params" rule will now be implementing parameters without a type, which is very easy to do. We should also think about the parameter of "STRING"-type that will be used by the compiler-function "print()". That way we end up with:
call_params: call_param | STRING | /* empty */
call_param : call_param COMMA variable | variable ;
The new "statement"-rule that way now is:
statement:
if_statement | for_statement | while_statement | assigment |
CONTINUE SEMI | BREAK SEMI | RETURN SEMI | function_call SEMI
;
Which will be managing functions that don't return a value (void return type). This is something that we will check in the next article :)
The "function_call"-rule will be "inserted" as a new subrule for the "expression"-rule, to allow the use of the return value of a function as a somewhat "variable" of an mathematical expression. Therefore, the "expression"- rule now is:
expression:
expression ADDOP expression |
expression MULOP expression |
expression DIVOP expression |
expression INCR |
INCR expression |
expression OROP expression |
expression ANDOP expression |
NOTOP expression |
expression EQUOP expression |
expression RELOP expression |
LPAREN expression RPAREN |
variable |
sign constant |
function_call
;
Allow Initialization of Variables
Lastly, let's also allow the initialization of variables. An easy way of doing it is by inserting a new optional rule after each variable in the "names"-rule that is used in the "declaration"-rule. This optional rule will firstly allow the initialization using a constant that are already declared by the "constant"-rule and also an array, which is something that we will implement as another rule. Let's define this rule with the name "init". That way the "names"-rule becomes:
names: variable init | names COMMA variable init;
The "init"-rule can be declared easily using the following two rules:
init: ASSIGN init_value | /* empty */ ;
init_value: constant | array_init;
where "array_init" is a rule for array initialization that we will cover in a bit.
Arrays in C can be initialized (at the moment of declaration) following the structure: { value, value, ... }. This is exactly what we will use. The "array_init"-rule has to consist of a "values"-rule that let's us have many constant values separated by ','. With all that in mind we end up with:
array_init: LBRACE values RBRACE;
values: values COMMA constant | constant;
Fixing some more stuff
For statement
The "for_statement"-rule was declared wrong! The first part of the parenthesis should be an "assignment" and not an "expression", meaning that we also have to remove the SEMI, cause a SEMI is already included in that rule. That way the rule becomes:
for_statement: FOR LPAREN assigment expression SEMI expression RPAREN tail ;
Weird Tail rule
The "tail"-rule was a little bit weird. I changed it into this now:
tail: LBRACE statements RBRACE ;
Using arrays in expressions
When declaring variables we might not want to allow an ID to be used in the brackets, but when using them inside of loops the actual Index can be a integer constant (ICONST), identifier (ID) or even a whole expression. That way we have to "extend" the "array"-rule to allow such things. This can be done easily by changing "ICONST" into "expression". The new rule is:
array: array LBRACK expression RBRACK | LBRACK expression RBRACK
Note that we now have to check much more things in the Semantic analysis later on. To prevent that we could separate the rule by having this new rule only for the variables used in expression. We will see how it works out later on and might tweak it if it's needed.
Know when functions are starting
To know when the actual functions are starting we should place a "RETURN SEMI" after the statements and before the functions rule of the "program"-rule. This looks as following:
program: declarations statements RETURN SEMI functions;
The previously added rule should therefore be completely removed from the "statements"-rule, turning it into:
statement:
if_statement | for_statement | while_statement | assigment |
CONTINUE SEMI | BREAK SEMI | function_call SEMI
;
Next time we will continue with this procedure to tweak even more things. Some of them where already mentioned during this article, some others will be completely new. Note that the compiler currently gives us a syntax error even for the example "example.c", which doesn't use any of the rules that we "included" today. Don't worry everything will get fixed next time!
Previous parts of the series
- Introduction -> What is a compiler, what you have to know and what you will learn
- A simple C Language -> Simplified C, comparison with C, tokens, basic structure
- Lexical Analysis using Flex -> Theory, Regular Expressions, Flex, Lexer
- Symbol Table (basic structure) ->Why Symbol Tables, Basic Implementation
- Using Symbol Table in the Lexer -> Flex and Symbol Table combination
- Syntax Analysis Theory -> Syntax Analysis, grammar types and parsing
- Bison basics -> Bison tutorial actually
- Creating a grammar for our Language -> Grammar and first Parser
- Combine Flex and Bison -> lexer and parser combined
- Passing information from Lexer to Parser -> Bug Fix, yylval variable, YYSTYPE union, Setting up the Parser, Passing information "directly" and through the symbol table (special case) with examples.
Final words | Next up on the project
And this is actually it for today's post and I hope that you learned something from this article!
Next up on the series are:
- Continuing on with "Finishing Off The Grammer/Parser (part2)"
- Semantic analysis (predicates, more about the symbol table)
- Intermediate Code generation (Abstract Syntax Tree)
- Machine Code generation (MIPS Assembly)
I would like to note that other articles might come in-between and some might even need more than one article to complete (this exact thing occurs today). Also, after all these things are done we could also get into Optimizations or could even extend the Language by adding complex datatypes, even more rules etc.
See ya next time!
GitHub Account:
Keep on drifting!
Hi @drifter1!
Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your UA account score is currently 3.505 which ranks you at #6140 across all Steem accounts.
Your rank has improved 5 places in the last three days (old rank 6145).
In our last Algorithmic Curation Round, consisting of 407 contributions, your post is ranked at #66.
Evaluation of your UA score:
Feel free to join our @steem-ua Discord server
Nice
Hey, @drifter1!
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!
Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!