Writing a simple Compiler on my own - Action Rules for Expressions [C][Flex][Bison]

in #utopian-io6 years ago (edited)

[Custom Thumbnail]
All the Code of the series can be found at the Github repository: https://github.com/drifter1/compiler

Introduction

    Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will write the needed action rules for Expressions, excluding function call for now!

More specifically, the topics that we will cover today are:

  1. Separating the operator tokens of the lexer
  2. Action Rules for Expressions
  3. Running the compiler

Requirements:

    Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
  • What Compiler Design is (mainly the steps)
  • For which exact Language the Compiler is build for (Tokens and Grammar)
  • How to use Flex and Bison
  • How to implement a lexer and parser for the language using those tools
  • What the Symbol Table is and how we implement it
  • How we combine Flex and Bison together
  • How we can pass information from the Lexer to the Parser
  • How we define operator priorities, precedencies and associativity
  • What Semantic Analysis is (Attributes, SDT etc.)
  • How we do the so called "Scope Resolution"
  • How we declare types and check the type inter-compatibility for different cases, including function parameters
  • How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
  • Intermediate Code Representations, including AST's
  • How we implement an AST (structure and management)
  • Action Rules for Declarations and Initializations

Difficulty:

Talking about the series in general this series can be rated:
  • Intermediate to Advanced
Today's topic(s) can be rated:
  • Medium
So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Separating the operator tokens of the lexer

    Having different operators that return the same token to the parser, we of course can't know which exact operator we have in each case. So, before getting into the actual expression rules, we first have to find a way to overcome this problem! To fix it we can set a specific value for the Value union val.ival, which depends on each operator. Value val is the token type that we chose to have for most of the tokens.

More specifically, the tokens that represent more than one operators are:
  • ADDOP, which is used for '+' and '-'
  • INCR, which is used for "++" and "--"
  • EQUOP, which is used for all the equality operators
  • RELOP, which is used for all the relational operators

    In the AST structure we defined some enums for all these operator types to be able to separate them in each expression node. And so it's wise to use these already defined values for the separation of the operators in the lexer. This means that we have to set the val.ival value and then return the token type to the parser.
So, the lexer.l file now contains:
...
"+" { yylval.val.ival = ADD; return ADDOP; } "-" { yylval.val.ival = SUB; return ADDOP; } "*" { return MULOP; } "/" { return DIVOP; } "++" { yylval.val.ival = INC; return INCR; } "--" { yylval.val.ival = DEC; return INCR; } "||" { return OROP; } "&&" { return ANDOP; } "!" { return NOTOP; } "==" { yylval.val.ival = EQUAL; return EQUOP; } "!=" { yylval.val.ival = NOT_EQUAL; return EQUOP; } ">" { yylval.val.ival = GREATER; return RELOP; } "<" { yylval.val.ival = LESS; return RELOP; } ">=" { yylval.val.ival = GREATER_EQUAL; return RELOP; } "<=" { yylval.val.ival = LESS_EQUAL; return RELOP; }
...

Pretty simple right? Let's now get into the expressions!

Action Rules for Expressions

    The grammar of our language allows many types of expressions. In general, we can say that we have the following expression cases:


So, an expression can be:
  • var_ref, which is a variable with or without reference
  • sign constant, which is a constant with or without sign
  • function_call, that we will cover in a later on article
  • An expression in parenthesis "( )"
  • Increment expression
  • Arithmetic expression node
  • Boolean expression node
  • Equality expression node
  • Relational expression node
Let's take a look at each of them, grouping similar ones together!

Variable with or without reference

A var_ref node can be visualized as following:

    Let's expain this a little bit better. The "var_ref" rule has two sub-rules: "variable" and "REFER variable", where REFER is the symbol '&', mainly used for address references. As we saw last time the variable rule is used for simple variables (ID), pointers and arrays, and so for any kind of variable that can occur in our language. The variable rule gives back the variable information in form of a symbol table entry. So, this information is accessible in both sub-rules. The rule with the '&' symbol will have to "add" some new information, which tells us that we are talking about a reference of a variable. There is no entry for that in the symbol table definition, but we should also not add this information there! Any change in the symbol table entry happens globally, which is not something that we want to happen! You might have noticed that we don't have a AST Node type that stores symbol table entries yet. An expression is an AST node and so passing the information of this variable reference higher, can't be done right now!
    So, where does all this all lead us into? Well, we need a new AST node type for variable references. Let's call this new Node_Type "REF_NODE" and the corresponding structure "AST_Node_Ref". This structure will store a symbol table entry and an integer value "ref", where '0' means that we have no reference, whilst '1' means that we have a reference. Creating this new structure we of course also have to create a function that creates such a node. We should don't forget to make changes in the node printing function "ast_print_node", so that it includes this new type in it's implementation.
The ast.h file will now contain:
typedef enum Node_Type {
	BASIC_NODE,  // no special usage (for roots only)
	// declarations
	DECL_NODE,   // declaration
	CONST_NODE,  // constant
	// statements
	IF_NODE,     // if statement
	ELSIF_NODE,  // else if branch
	FOR_NODE,    // for statement
	WHILE_NODE,  // while statement
	ASSIGN_NODE, // assigment
	SIMPLE_NODE, // continue, break and "main" return statements
	INCR_NODE,   // increment statement (non-expression one)
	FUNC_CALL,   // function call
	// expressions
	ARITHM_NODE, // arithmetic expression
	BOOL_NODE,   // boolean expression
	REL_NODE,    // relational expression
	EQU_NODE,    // equality expression
	REF_NODE,	 // identifier in expression
	// functions
	FUNC_DECL,   // function declaration
	RETURN_NODE, // return statement of functions
}Node_Type;
...
typedef struct AST_Node_Ref{ enum Node_Type type; // node type
// symbol table entry list_t *entry;
// reference or not1 int ref; // 0: not reference, 1: reference }AST_Node_Ref;
...
AST_Node *new_ast_ref_node(list_t *entry, int ref);
...

In the ast.c the node creation is implemented such as:
AST_Node *new_ast_ref_node(list_t *entry, int ref){
	// allocate memory
	AST_Node_Ref *v = malloc (sizeof (AST_Node_Ref));
// set entries v->type = REF_NODE; v->entry = entry; v->ref = ref;
// return type-casted result return (struct AST_Node *) v; }

    The print node function has no big changes and so I will leave it to you to go to GitHub to check it out :P

    So, by having this new node type we can now set the types of the non-terminals that occur in expressions in general and the case that we are covering right now. The "var_ref" rule of course has to be of type "node" being a AST_Node. Of course an expression is also a node and so we have the following non-terminal definition:

%type <node> expression var_ref

    In the expression rule we only have to set the expression node equal to the node that was created by the var_ref rule. So, the actual work will be done in the actual var_ref rule. Depending on the case, var_ref will create an AST_Node of the type "Ref_Node" with or without reference, which is denoted by the value of the second parameter "ref".

These action rules look as following:

expression:
... other sub-rules ...
| var_ref { $$ = $1; /* just pass information */ }
... other sub-rules ...
;
...
var_ref: variable { $$ = new_ast_ref_node($1, 0); /* no reference */ } | REFER variable { $$ = new_ast_ref_node($2, 1); /* reference */ } ;
...

Constant with or without sign

This is a somewhat simple case that can be visualized as following:


    The "sign constant" rule is build up of a constant, that already stores a constant in form of an constant AST node and the optional sign rule. The sign rule either is an ADDOP (where we will have to check if it's an '-') or nothing/empty. Storing an integer value in form of the Value.ival entry we can pass this information over to the "sign constant" rule, through the sign rule. Let's say that '0' means no sign, whilst '1' means a sign which of course is also correct. When having a sign we of course have to inverse the value of the constant depending on the constant type, which is pretty easy to do using a switch case statement!

The token definition of the sign rule is:
%type <val> sign

And so the action code for these rules is:

expression:
... other sub-rules ...
| sign constant { /* sign */ if($1.ival == 1){ AST_Node_Const *temp = (AST_Node_Const*) $2;
/* inverse value depending on the constant type */ switch(temp->const_type){ case INT_TYPE: temp->val.ival *= -1; break; case REAL_TYPE: temp->val.fval *= -1; break; case CHAR_TYPE: /* sign before char error */ fprintf(stderr, "Error having sign before character constant!\n"); exit(1); break; } $$ = (AST_Node*) temp; } /* no sign */ else{ $$ = $2; } ast_traversal($$); /* just for testing */ }
... other sub-rules ...
; sign: ADDOP { /* plus sign error */ if($1.ival == ADD){ fprintf(stderr, "Error having plus as a sign!\n"); exit(1); } else{ $$.ival = 1; /* sign */ } } | /* empty */ { $$.ival = 0; /* no sign */ } ;
...

Simple cases

    The parenthesis and increment rules are pretty easy to do. The parenthesis is the easiest of them all, as the "parent" expression just has to get the value of the expression inside of the parenthesis. In the same way, storing the increment or decrement type inside of the token INCR and knowing the position (prefix or postfix) of the operator in the expression, we can easily create a Incr-Node! So, let's do this!
expression:
... other sub-rules ...
| ID INCR { /* increment */ if($2.ival == INC){ $$ = new_ast_incr_node($1, 0, 0); } else{ $$ = new_ast_incr_node($1, 1, 0); } ast_traversal($$); /* just for testing */ } | INCR ID { /* increment */ if($1.ival == INC){ $$ = new_ast_incr_node($2, 0, 1); } else{ $$ = new_ast_incr_node($2, 1, 1); } ast_traversal($$); /* just for testing */ }
... other sub-rules ...
| LPAREN expression RPAREN { $$ = $2; /* just pass information */ }
... other sub-rules ...
;

Operator cases

The rest of the operators can be summarized in the following visualization:


    All the operators are very simple. They have a left and right side and a operator depending on their type which can be: arithmetic, boolean, equality and relational. The information about the exact type of operator is either directly known from the token (token of one operator) or is stored in the token inside of the Value.ival entry. All except the NOTOP operator have two childs (left an right). For the NOTOP operator we add the single operand as the left child, whilst the other is NULL.

In code the action rules are:
expression:
    expression ADDOP expression
	{ 
	    $$ = new_ast_arithm_node($2.ival, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| expression MULOP expression
	{
	    $$ = new_ast_arithm_node(MUL, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| expression DIVOP expression
	{
		$$ = new_ast_arithm_node(DIV, $1, $3);
		ast_traversal($$); /* just for testing */
	}
... other sub-rules ...
| expression OROP expression { $$ = new_ast_bool_node(OR, $1, $3); ast_traversal($$); /* just for testing */ } | expression ANDOP expression { $$ = new_ast_bool_node(AND, $1, $3); ast_traversal($$); /* just for testing */ } | NOTOP expression { $$ = new_ast_bool_node(NOT, $2, NULL); ast_traversal($$); /* just for testing */ } | expression EQUOP expression { $$ = new_ast_equ_node($2.ival, $1, $3); ast_traversal($$); /* just for testing */ } | expression RELOP expression { $$ = new_ast_rel_node($2.ival, $1, $3); ast_traversal($$); /* just for testing */ }
... other sub-rules ...
;

Running the compiler

    Running the compiler for the "full_example.c" file, the compiler prints out the following info-messages:




    So, our compiler now detects the types of expressions that occur during the syntax analysis. We use the traversal function on each node, which explains why some nodes get print out more than one times! Constants, reference nodes etc. are childs of other nodes and so get print out on their own, but also during the traversal of their parent nodes...and the parent nodes of their parent nodes.

Note that we will also have to do type checking someday in the future!

RESOURCES

References:

No references, just using code that I implemented in my previous articles.

Images:

All of the images are custom-made!

Previous parts of the series


Final words | Next up on the project

     And this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:
  • Semantic analysis (using more action rules in Bison)
  • Intermediate Code generation (using the AST structure for more cases)
  • Machine Code generation (MIPS Assembly)
     Which are all topics that will need more than one article to complete. Also, note that we might also get into Optimizations later on, or could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!

GitHub Account:

https://github.com/drifter1

Keep on drifting! ;)
Sort:  

Thank you for your contribution @drifter1.

Again an excellent tutorial with the theory explained very well. Thanks for your good work on developing the tutorials.

We look forward to seeing upcoming tutorials.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you for your review, @portugalcoin! Keep up the good work!

Hey, @drifter1!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Hi @drifter1!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your UA account score is currently 3.497 which ranks you at #6394 across all Steem accounts.
Your rank has dropped 3 places in the last three days (old rank 6391).

In our last Algorithmic Curation Round, consisting of 213 contributions, your post is ranked at #183.

Evaluation of your UA score:
  • You're on the right track, try to gather more followers.
  • The readers appreciate your great work!
  • Try to work on user engagement: the more people that interact with you via the comments, the higher your UA score!

Feel free to join our @steem-ua Discord server

Coin Marketplace

STEEM 0.17
TRX 0.15
JST 0.028
BTC 58320.47
ETH 2367.43
USDT 1.00
SBD 2.45