Writing a simple Compiler on my own - Action Rules for Expressions [C][Flex][Bison]

drifter1 (67)in #utopian-io • 6 years ago (edited)

[Custom Thumbnail]
All the Code of the series can be found at the Github repository: https://github.com/drifter1/compiler

Introduction

Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will write the needed action rules for Expressions, excluding function call for now!

More specifically, the topics that we will cover today are:

Separating the operator tokens of the lexer
Action Rules for Expressions
Running the compiler

Requirements:

Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:

What Compiler Design is (mainly the steps)
For which exact Language the Compiler is build for (Tokens and Grammar)
How to use Flex and Bison
How to implement a lexer and parser for the language using those tools
What the Symbol Table is and how we implement it
How we combine Flex and Bison together
How we can pass information from the Lexer to the Parser
How we define operator priorities, precedencies and associativity
What Semantic Analysis is (Attributes, SDT etc.)
How we do the so called "Scope Resolution"
How we declare types and check the type inter-compatibility for different cases, including function parameters
How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
Intermediate Code Representations, including AST's
How we implement an AST (structure and management)
Action Rules for Declarations and Initializations

Difficulty:

Talking about the series in general this series can be rated:

Intermediate to Advanced

Today's topic(s) can be rated:

Medium

So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Separating the operator tokens of the lexer

Having different operators that return the same token to the parser, we of course can't know which exact operator we have in each case. So, before getting into the actual expression rules, we first have to find a way to overcome this problem! To fix it we can set a specific value for the Value union val.ival, which depends on each operator. Value val is the token type that we chose to have for most of the tokens.

More specifically, the tokens that represent more than one operators are:

ADDOP, which is used for '+' and '-'
INCR, which is used for "++" and "--"
EQUOP, which is used for all the equality operators
RELOP, which is used for all the relational operators

In the AST structure we defined some enums for all these operator types to be able to separate them in each expression node. And so it's wise to use these already defined values for the separation of the operators in the lexer. This means that we have to set the val.ival value and then return the token type to the parser.
So, the lexer.l file now contains:

...

"+"				 { yylval.val.ival = ADD; return ADDOP; }
"-"				 { yylval.val.ival = SUB; return ADDOP; }
"*"				 { return MULOP; }
"/"				 { return DIVOP; }
"++"			 { yylval.val.ival = INC; return INCR; }
"--"			 { yylval.val.ival = DEC; return INCR; }
"||"			 { return OROP; }
"&&"			 { return ANDOP; }
"!"				 { return NOTOP; }
"=="		     { yylval.val.ival = EQUAL; return EQUOP; }
"!="			 { yylval.val.ival = NOT_EQUAL; return EQUOP; }
">"				 { yylval.val.ival = GREATER; return RELOP; }
"<"				 { yylval.val.ival = LESS; return RELOP; }
">="			 { yylval.val.ival = GREATER_EQUAL; return RELOP; }
"<="			 { yylval.val.ival = LESS_EQUAL; return RELOP; }

...

Pretty simple right? Let's now get into the expressions!

Action Rules for Expressions

The grammar of our language allows many types of expressions. In general, we can say that we have the following expression cases:

So, an expression can be:

var_ref, which is a variable with or without reference
sign constant, which is a constant with or without sign
function_call, that we will cover in a later on article
An expression in parenthesis "( )"
Increment expression
Arithmetic expression node
Boolean expression node
Equality expression node
Relational expression node

Let's take a look at each of them, grouping similar ones together!

Variable with or without reference

A var_ref node can be visualized as following:

Let's expain this a little bit better. The "var_ref" rule has two sub-rules: "variable" and "REFER variable", where REFER is the symbol '&', mainly used for address references. As we saw last time the variable rule is used for simple variables (ID), pointers and arrays, and so for any kind of variable that can occur in our language. The variable rule gives back the variable information in form of a symbol table entry. So, this information is accessible in both sub-rules. The rule with the '&' symbol will have to "add" some new information, which tells us that we are talking about a reference of a variable. There is no entry for that in the symbol table definition, but we should also not add this information there! Any change in the symbol table entry happens globally, which is not something that we want to happen! You might have noticed that we don't have a AST Node type that stores symbol table entries yet. An expression is an AST node and so passing the information of this variable reference higher, can't be done right now!
So, where does all this all lead us into? Well, we need a new AST node type for variable references. Let's call this new Node_Type "REF_NODE" and the corresponding structure "AST_Node_Ref". This structure will store a symbol table entry and an integer value "ref", where '0' means that we have no reference, whilst '1' means that we have a reference. Creating this new structure we of course also have to create a function that creates such a node. We should don't forget to make changes in the node printing function "ast_print_node", so that it includes this new type in it's implementation.
The ast.h file will now contain:

typedef enum Node_Type {
	BASIC_NODE,  // no special usage (for roots only)
	// declarations
	DECL_NODE,   // declaration
	CONST_NODE,  // constant
	// statements
	IF_NODE,     // if statement
	ELSIF_NODE,  // else if branch
	FOR_NODE,    // for statement
	WHILE_NODE,  // while statement
	ASSIGN_NODE, // assigment
	SIMPLE_NODE, // continue, break and "main" return statements
	INCR_NODE,   // increment statement (non-expression one)
	FUNC_CALL,   // function call
	// expressions
	ARITHM_NODE, // arithmetic expression
	BOOL_NODE,   // boolean expression
	REL_NODE,    // relational expression
	EQU_NODE,    // equality expression
	REF_NODE,	 // identifier in expression
	// functions
	FUNC_DECL,   // function declaration
	RETURN_NODE, // return statement of functions
}Node_Type;

...

typedef struct AST_Node_Ref{
	enum Node_Type type; // node type

	// symbol table entry
	list_t *entry;

	// reference or not1
	int ref; // 0: not reference, 1: reference
}AST_Node_Ref;

...

AST_Node *new_ast_ref_node(list_t *entry, int ref);

...

In the ast.c the node creation is implemented such as:

AST_Node *new_ast_ref_node(list_t *entry, int ref){
	// allocate memory
	AST_Node_Ref *v = malloc (sizeof (AST_Node_Ref));

	// set entries
	v->type = REF_NODE;
	v->entry = entry;
	v->ref = ref;

	// return type-casted result
	return (struct AST_Node *) v;	
}

    The print node function has no big changes and so I will leave it to you to go to GitHub to check it out :P

    So, by having this new node type we can now set the types of the non-terminals that occur in expressions in general and the case that we are covering right now. The "var_ref" rule of course has to be of type "node" being a AST_Node. Of course an expression is also a node and so we have the following non-terminal definition:

%type <node> expression var_ref

    In the expression rule we only have to set the expression node equal to the node that was created by the var_ref rule. So, the actual work will be done in the actual var_ref rule. Depending on the case, var_ref will create an AST_Node of the type "Ref_Node" with or without reference, which is denoted by the value of the second parameter "ref".

These action rules look as following:


expression:

... other sub-rules ...

| var_ref
	{ 
		$$ = $1; /* just pass information */
	}

... other sub-rules ...

;

...

var_ref: variable
	{
		$$ = new_ast_ref_node($1, 0); /* no reference */
	}
	| REFER variable
	{
		$$ = new_ast_ref_node($2, 1); /* reference */
	}
; 

...

Constant with or without sign

This is a somewhat simple case that can be visualized as following:

The "sign constant" rule is build up of a constant, that already stores a constant in form of an constant AST node and the optional sign rule. The sign rule either is an ADDOP (where we will have to check if it's an '-') or nothing/empty. Storing an integer value in form of the Value.ival entry we can pass this information over to the "sign constant" rule, through the sign rule. Let's say that '0' means no sign, whilst '1' means a sign which of course is also correct. When having a sign we of course have to inverse the value of the constant depending on the constant type, which is pretty easy to do using a switch case statement!

The token definition of the sign rule is:
%type <val> sign

And so the action code for these rules is:


expression:

... other sub-rules ...

| sign constant
	{
		/* sign */
		if($1.ival == 1){
			AST_Node_Const *temp = (AST_Node_Const*) $2;

			/* inverse value depending on the constant type */
			switch(temp->const_type){
				case INT_TYPE:
					temp->val.ival *= -1;
					break;
				case REAL_TYPE:
					temp->val.fval *= -1;
					break;
				case CHAR_TYPE:
					/* sign before char error */
					fprintf(stderr,
					    "Error having sign before character constant!\n");
					exit(1);
					break;
			}
			$$ = (AST_Node*) temp;
		}
		/* no sign */
		else{
			$$ = $2;
		}
	    ast_traversal($$); /* just for testing */
	}

... other sub-rules ...

;
sign: ADDOP
	{ 
		/* plus sign error */
		if($1.ival == ADD){
			fprintf(stderr, "Error having plus as a sign!\n");
			exit(1);
		}
		else{
			$$.ival = 1; /* sign */
		}
	}
	| /* empty */
	{ 
		$$.ival = 0; /* no sign */
	} 
;

...

Simple cases

The parenthesis and increment rules are pretty easy to do. The parenthesis is the easiest of them all, as the "parent" expression just has to get the value of the expression inside of the parenthesis. In the same way, storing the increment or decrement type inside of the token INCR and knowing the position (prefix or postfix) of the operator in the expression, we can easily create a Incr-Node! So, let's do this!

expression:

... other sub-rules ...

	| ID INCR
	{
		/* increment */
		if($2.ival == INC){
			$$ = new_ast_incr_node($1, 0, 0);
		}
		else{
			$$ = new_ast_incr_node($1, 1, 0);
		}
		ast_traversal($$); /* just for testing */		
	}
	| INCR ID
	{ 
		/* increment */
		if($1.ival == INC){
			$$ = new_ast_incr_node($2, 0, 1);
		}
		else{
			$$ = new_ast_incr_node($2, 1, 1);
		}
		ast_traversal($$); /* just for testing */
	}

... other sub-rules ...

	| LPAREN expression RPAREN
	{
		$$ = $2; /* just pass information */
	}

... other sub-rules ...

;

Operator cases

The rest of the operators can be summarized in the following visualization:

All the operators are very simple. They have a left and right side and a operator depending on their type which can be: arithmetic, boolean, equality and relational. The information about the exact type of operator is either directly known from the token (token of one operator) or is stored in the token inside of the Value.ival entry. All except the NOTOP operator have two childs (left an right). For the NOTOP operator we add the single operand as the left child, whilst the other is NULL.

In code the action rules are:

expression:
    expression ADDOP expression
	{ 
	    $$ = new_ast_arithm_node($2.ival, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| expression MULOP expression
	{
	    $$ = new_ast_arithm_node(MUL, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| expression DIVOP expression
	{
		$$ = new_ast_arithm_node(DIV, $1, $3);
		ast_traversal($$); /* just for testing */
	}

... other sub-rules ...

	| expression OROP expression
	{
		$$ = new_ast_bool_node(OR, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| expression ANDOP expression
	{
		$$ = new_ast_bool_node(AND, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| NOTOP expression
	{
	    $$ = new_ast_bool_node(NOT, $2, NULL);
		ast_traversal($$); /* just for testing */
	}
	| expression EQUOP expression
	{
		$$ = new_ast_equ_node($2.ival, $1, $3);
		ast_traversal($$); /* just for testing */
	}
	| expression RELOP expression
	{
		$$ = new_ast_rel_node($2.ival, $1, $3);
		ast_traversal($$); /* just for testing */
	}

... other sub-rules ...

;

Running the compiler

Running the compiler for the "full_example.c" file, the compiler prints out the following info-messages:

So, our compiler now detects the types of expressions that occur during the syntax analysis. We use the traversal function on each node, which explains why some nodes get print out more than one times! Constants, reference nodes etc. are childs of other nodes and so get print out on their own, but also during the traversal of their parent nodes...and the parent nodes of their parent nodes.

Note that we will also have to do type checking someday in the future!

RESOURCES

References:

No references, just using code that I implemented in my previous articles.

Images:

All of the images are custom-made!

Previous parts of the series

Introduction -> What is a compiler, what you have to know and what you will learn
A simple C Language -> Simplified C, comparison with C, tokens, basic structure
Lexical Analysis using Flex -> Theory, Regular Expressions, Flex, Lexer
Symbol Table (basic structure) ->Why Symbol Tables, Basic Implementation
Using Symbol Table in the Lexer -> Flex and Symbol Table combination
Syntax Analysis Theory -> Syntax Analysis, grammar types and parsing
Bison basics -> Bison tutorial actually
Creating a grammar for our Language -> Grammar and first Parser
Combine Flex and Bison -> lexer and parser combined
Passing information from Lexer to Parser -> Bug Fix, yylval variable, YYSTYPE union, Setting up the Parser, Passing information "directly" and through the symbol table (special case) with examples.
Finishing Off The Grammar/Parser (part 1) -> Adding the missing grammar rules, small fixes
Finishing Off The Grammar/Parser (part 2) -> Operator priorities, precedencies and associativity, complete example file (for testing), grammar rule visualization, finish off grammar/parser
Semantic Analysis Theory -> What is Semantic Analysis about, CSG and CSL, Attribute Grammars, Syntax Directed Translations (SDT)
Semantics Examples -> Visualization of Semantics for different rules/cases, needed attributes, what we have to implement from now on
Scope Resolution using the Symbol Table -> What we have now, scope resolution, integration, compiler output
Type Declaration and Checking -> Type declaration and type checking code, "simple" code integration
Function Semantics (part 1) -> Code for parameter definitions, function declarations and parameter compatibility checking
Function Semantics (part 2) -> Revisit queue concept, basic implementation, inserting undeclared identifiers
Abstract Syntax Tree Principle -> Intermediate code generation and representations, how to design an AST
Abstract Syntax Tree Structure -> Node, operator and value types, AST node structures, some management functions, integration with the rest of the compiler
Abstract Syntax Tree Management -> Tweaking Nodes, Node creation and AST traversal functions, "manual" testing of the AST
Action Rules for Declarations and Initializations -> Replacing many values with a Value union, Other smaller changes, Action rules for declarations, Action rules for initializations

Final words | Next up on the project

And this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:

Semantic analysis (using more action rules in Bison)
Intermediate Code generation (using the AST structure for more cases)
Machine Code generation (MIPS Assembly)

Which are all topics that will need more than one article to complete. Also, note that we might also get into Optimizations later on, or could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!

GitHub Account:

https://github.com/drifter1

Keep on drifting! ;)

#tutorials #programming #compiler #coding

6 years ago in #utopian-io by drifter1 (67)

$19.98

Sort:

Trending

[-]

portugalcoin (72) 6 years ago

Thank you for your contribution @drifter1.

Again an excellent tutorial with the theory explained very well. Thanks for your good work on developing the tutorials.

We look forward to seeing upcoming tutorials.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.

Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

$7.79

11 votes

[-]

utopian-io (71) 6 years ago

Thank you for your review, @portugalcoin! Keep up the good work!

$0.00

[-]

utopian-io (71) 6 years ago

Hey, @drifter1!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

$0.00

1 vote

[-]

steem-ua (64) 6 years ago

Hi @drifter1!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your UA account score is currently 3.497 which ranks you at #6394 across all Steem accounts.
Your rank has dropped 3 places in the last three days (old rank 6391).

In our last Algorithmic Curation Round, consisting of 213 contributions, your post is ranked at #183.

Evaluation of your UA score:

You're on the right track, try to gather more followers.
The readers appreciate your great work!
Try to work on user engagement: the more people that interact with you via the comments, the higher your UA score!

Feel free to join our @steem-ua Discord server

$0.00

1 vote

STEEM 0.17

TRX 0.15

JST 0.028

BTC 58320.47

ETH 2367.43

USDT 1.00

SBD 2.45

Writing a simple Compiler on my own - Action Rules for Expressions [C][Flex][Bison]

Introduction

More specifically, the topics that we will cover today are:

Requirements:

Difficulty:

Actual Tutorial Content

Separating the operator tokens of the lexer

Action Rules for Expressions

Variable with or without reference

Constant with or without sign

Simple cases

Operator cases

Running the compiler

RESOURCES

References:

Images:

Previous parts of the series

Final words | Next up on the project

GitHub Account:

Hi @drifter1!

Evaluation of your UA score:

Coin Marketplace