Writing a simple Compiler on my own - Action Rules for Declarations and Initializations [C][Flex][Bison]

in utopian-io •  last year  (edited)

[Custom Thumbnail]
All the Code of the series can be found at the Github repository: https://github.com/drifter1/compiler

Introduction

    Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will write the needed Action Rules for Declarations, including initialization cases, by creating the needed AST Nodes and passing information between the non-terminal rules and through temporary global variables.

The topics that we will cover today are:

  1. Replacing many values with a Value union
  2. Other smaller changes
  3. Action rules for declarations
  4. Action rules for initializations
  5. Running the Compiler for "full_example.c"

Requirements:

    Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
  • What Compiler Design is (mainly the steps)
  • For which exact Language the Compiler is build for (Tokens and Grammar)
  • How to use Flex and Bison
  • How to implement a lexer and parser for the language using those tools
  • What the Symbol Table is and how we implement it
  • How we combine Flex and Bison together
  • How we can pass information from the Lexer to the Parser
  • How we define operator priorities, precedencies and associativity
  • What Semantic Analysis is (Attributes, SDT etc.)
  • How we do the so called "Scope Resolution"
  • How we declare types and check the type inter-compatibility for different cases, including function parameters
  • How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
  • Intermediate Code Representations, including AST's
  • How we implement an AST (structure and management)

Difficulty:

Talking about the series in general this series can be rated:
  • Intermediate to Advanced
Today's topic(s) can be rated:
  • Medium
So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Replacing many values with a Value union

    In the constant nodes of the AST, the value of the constant is being stored in a custom union structure called "Value". This makes things much easier as we don't have to check which value of the 4 types: "int, char, real, string", we have to set. Having separate variables of course also costs in memory, making this "new" way more memory-efficient.

Symbol Table Header (symtab.h)

    Knowing that the AST structures uses symbol table entries, we should move this union to the "symtab.h" file. It's not so important, but is a nice touch! Taking a look at the different structures of this file, we can see that we define a different variable for each of the types: "int, char, real". All these variables can be replaced by the "Value" union.
So:
  • The "Param" structure will now contain a single "val" variable of type "Value"
  • The "list_t" symbol table entry structure has to contain a "val" variable for the value of the entry and a dynamic "vals" array to store the array elements.
In code these structures now looks as following:
typedef struct Param{
	// parameter type and name
	int par_type;
	char param_name[MAXTOKENLEN];
// to store the value Value val; int passing; // value or reference }Param;
...
typedef struct list_t{ // name, size of name, scope and occurrences (lines) char st_name[MAXTOKENLEN]; int st_size; int scope; RefList *lines;
// to store value Value val;
// type int st_type;
// for arrays (info type), for pointers (pointing type) // and for functions (return type) int inf_type;
// array stuff Value *vals; int array_size;
// function parameters Param *parameters; int num_of_pars;
// pointer to next item in the list struct list_t *next; }list_t;

Syntax Analyzer (parser.y)

    Until now, our YYSTYPE union contained each of the different types of variables in different entries. Let's change that by replacing all of them with one single "Value" union entry, that does all at once! So, now the new YYSTYPE union structure looks as following:
/* YYSTYPE union */
%union{
	// different types of values
	Value val;   
// structures list_t* symtab_item; AST_Node* node;
// Other entries }

Lexical Analyzer (lexer.l)

    In the same way, changing the YYSTYPE union of the parser, to contain a single union of all the different types, we of course now have to tweak our return code of the constant tokens! The new action rules of the lexer for the constants become:
{ICONST} 	    { yylval.val.ival = atoi(yytext); return ICONST; }
{FCONST} 		 { yylval.val.fval = atof(yytext); return FCONST; }
{CCONST} 		 { yylval.val.cval = yytext[0];    return CCONST; }
{STRING} 		 { 
				   yylval.val.sval = malloc(yyleng * sizeof(char));
				   strcpy(yylval.val.sval, yytext);  return STRING;
				 }
    Pretty simple right? We are just setting the corresponding entry of the Value union, instead of setting the yylval."something" entry of the YYSTYPE union, that we did before.

Other smaller changes

    Thinking about the language, the language of course can contain variables and functions of type "void", which is something that we don't included yet! Including this new type and seeing that the symbol table dump file was a "mess" when talking about alignment, I will also tweak the code of the "symtab_dump.out" file generation.

    Including the new type of void is pretty straight forward! Token types are defined inside of the Semantics header "semantics.h". So, what we have to do is add a new "define" statement for the void type. Keeping the order and "pattern" of the other types, we have to define a token type with the name "VOID_TYPE", which has the integer value '7'. In code the new token type definitions look as following:
#define UNDEF 0
#define INT_TYPE 1
#define REAL_TYPE 2
#define CHAR_TYPE 3
#define ARRAY_TYPE 4
#define POINTER_TYPE 5
#define FUNCTION_TYPE 6
#define VOID_TYPE 7

    Tweaking the code of the "symtab_dump()" function actually happened after I finished off all of the action rules, that we will get into in a bit. As seeing the changes after an actual file generation makes it simpler to make changes. In general the changes that I made are:
  • First of all, including the new void type as a type of variable, pointer and function.
  • Changing the "spacing" of the titles Name, Type and Scope to fit larger types
  • Changing the "spacing" of the different keywords and text that get's print out for the Type
The new code of the function is:

void symtab_dump(FILE * of){
  int i;
  fprintf(of,"------------ -------------- ------ ------------\n");
  fprintf(of,"Name         Type           Scope  Line Numbers\n");
  fprintf(of,"------------ -------------- ------ ------------\n");
  for (i=0; i < SIZE; ++i){ 
    if (hash_table[i] != NULL){ 
      list_t *l = hash_table[i];
      while (l != NULL){ 
        RefList *t = l->lines;
        fprintf(of,"%-13s",l->st_name);
        if (l->st_type == INT_TYPE)        fprintf(of,"%-15s","int");
        else if (l->st_type == REAL_TYPE)  fprintf(of,"%-15s","real");
        else if (l->st_type == CHAR_TYPE)  fprintf(of,"%-15s","char");
        else if (l->st_type == VOID_TYPE)  fprintf(of,"%-15s","void");
        else if (l->st_type == ARRAY_TYPE){
          fprintf(of,"array of ");
          if (l->inf_type == INT_TYPE)         fprintf(of,"%-6s","int");
          else if (l->inf_type  == REAL_TYPE)  fprintf(of,"%-6s","real");
          else if (l->inf_type  == CHAR_TYPE)  fprintf(of,"%-6s","char");
          else fprintf(of,"%-13s","undef");
        }
        else if (l->st_type == POINTER_TYPE){
          fprintf(of,"pointer to ");
          if (l->inf_type == INT_TYPE)         fprintf(of,"%-4s","int");
          else if (l->inf_type  == REAL_TYPE)  fprintf(of,"%-4s","real");
          else if (l->inf_type  == CHAR_TYPE)  fprintf(of,"%-4s","char");
          else if (l->inf_type  == VOID_TYPE)  fprintf(of,"%-4s","void");
          else fprintf(of,"%-4s","undef");
        }
        else if (l->st_type == FUNCTION_TYPE){
          fprintf(of,"func ret ");
          if (l->inf_type == INT_TYPE)         fprintf(of,"%-6s","int");
          else if (l->inf_type  == REAL_TYPE)  fprintf(of,"%-6s","real");
          else if (l->inf_type  == CHAR_TYPE)  fprintf(of,"%-6s","char");
          else if (l->inf_type  == VOID_TYPE)  fprintf(of,"%-6s","void");
          else fprintf(of,"%-4s","undef");
        }
        else fprintf(of,"%-15s","undef"); // if UNDEF or 0
        fprintf(of,"  %d  ",l->scope);
        while (t != NULL){
          fprintf(of,"%4d ",t->lineno);
          t = t->next;
        }
        fprintf(of,"\n");
        l = l->next;
      }
    }
  }
}

    After finishing off the Action Rules for Declarations and setting the types of the entries successfully, our "symtab_dump.out" file gives us:

Just a little sneak-peak of what's to follow now!

Action Rules for Declarations

The declaration rule can be visualized as following:


    This graph shows us what we are expecting and from which rule we have to get it! The simplest is of course the data type which is a plain integer that get's set depending on the datatype token that we have in the declaration. This can of course be char, int, float, double or void, which correspond to the token types INT_TYPE, CHAR_TYPE, REAL_TYPE, REAL_TYPE and VOID_TYPE. But, how exactly do we store information in non-terminals and then pass it to others?

Non-terminal definitions

    In the same way that we defined tokens and their type inside of "<>", we can also define the type of the non-terminals. Thinking about declarations, by not taking care of initializations yet, we understand that:
  • The program, declarations and declaration rules of course have to be of type "node"
  • The type rule has to store some integer value that corresponds to a datatype
  • The variable rule has to store an symbol table entry (symtab_item)
  • The array rule has to store the array size
These rule declarations look as following:
%type <node> program
%type <node> declarations declaration
%type <data_type> type
%type <symtab_item> variable
%type <array_size> array

The datatype

    The datatype is pretty simple! We just have to set the value of the "type" non-terminal to the corresponding value that each token type has. Accessing the left-part and so "type" works using "$$". The actions are written inside of "{ }" after each of the different cases of the "type" rule. Thinking about all that the new "type" rule that includes actions looks as following:
type: INT  	   { $$ = INT_TYPE;   }
	| CHAR 		{ $$ = CHAR_TYPE;  }
	| FLOAT 	{ $$ = REAL_TYPE;  }
	| DOUBLE 	{ $$ = REAL_TYPE;  }
	| VOID 		{ $$ = VOID_TYPE;  }
;

The names rule

    Names are of course far more complicated than that. The "names" rule of course allows two different kinds of sub-rules: "variable" and "init".
Visualizing the "names"rule we can see that:


    And so we understand that both of them have to insert a value into an "names" array, while also keeping the actual count of names (names_count) somewhere. The information of the sub-rules can be given to the parent node "names", but to make it even simpler, we can make those variables global! So, what we have to do is just add the symbol table entry that we get from the "variable" and "init" rule to the global "names". This should be better done using an function let's say "add_to_names()" defined in the parser. Accessing the information of an non-terminal inside of a rule can be done using the '$' followed by the position in the rule, where the non-terminal occurs. So, knowing that both "variable" and "init" contain a symbol table entry, the "names" rule becomes:
names: names COMMA variable
	{
		add_to_names($3);
	}
	| names COMMA init
	{
		add_to_names($3);
	}
	| variable
	{
		add_to_names($1);
	}
	| init
	{ 
		add_to_names($1);
	}
;

    The global variables that we include will be stored in the beginning of the parser file. We of course also have to declare the "add_to_names()" function declaration. So, the parser now begins as following:
%{
	...
// for declarations void add_to_names(list_t *entry); list_t **names; int nc = 0; %}

The function that we are calling is defined as:
void add_to_names(list_t *entry){
	// first entry
	if(nc == 0){
		nc = 1;
		names = (list_t **) malloc( 1 * sizeof(list_t *));
		names[0] = entry;
	}
	// general case
	else{
		nc++;
		names = (list_t **) realloc(names, nc * sizeof(list_t *));
		names[nc - 1] = entry;		
	}
}

    You can see that we are splitting it up, depending on the value of "nc" which stores the names_count! After that we either have to allocate memory for the "names" array or have to just reallocate it to be able to store another entry. In the end we just insert the "new" entry. Managing the "nc" variable will be crucial, but as you will see later on in the "declaration" rule, it will be very easy!

The variable rule

The variable rule can be visualized such as:


    We can see that we can have 3 different types of variables. Of course the "variable" non-terminal will store an symbol table entry. The simple ID is pretty straight-forward, cause we just have to set the "variable" equal to that ID. In pointers we also have to set the st_type , so that we know that we have to do with pointers. Arrays are a little bit more complicated, as we will also have to get the array_size, which also makes things more interesting! Allowing any expression will not be allowed now and so we add a new sub-rule for the "array" rule that has an ICONST (integer constant) inside of the brackets, giving us the array_size. Seeing an expression when declaring (easily checked with "declare == 1") we will throw off an error!
So, in the end the "variable" and "array" rules are:
variable: ID { $$ = $1; }
	| pointer ID
	{
		$2->st_type = POINTER_TYPE;
		$$ = $2;
	}
	| ID array
	{
		$1->st_type = ARRAY_TYPE;
		$1->array_size = $2;
		$$ = $1;
	}
;
pointer: pointer MULOP | MULOP ; /* for now we suppose everything is a simple pointer! */
array: array LBRACK expression RBRACK { // if declaration then error! if(declare == 1){ fprintf(stderr, "Array declaration at %d contains expression!\n", lineno); } } | LBRACK expression RBRACK { // if declaration then error! if(declare == 1){ fprintf(stderr, "Array declaration at %d contains expression!\n", lineno); } } | LBRACK ICONST RBRACK { // set array_size for declaration purposes $$ = $2.ival; } ;

The declaration rule

    Leaving initializations for later and knowing what the code for them will do, we can easily add an action rule for creating the declaration node! After the SEMI, we will create an declaration node using the information stored in the non-terminal "$1", which is "type" and has the data_type of the declaration. We will also pass the "names" and "nc" global variables, by also re-initializing the "nc" variable to '0' afterwards! After that we can create a new temporary "AST_Node_Decl" pointer called "temp" so that we can type-cast the newly created node and then go to the actual important aspect of declaring the types of the names, looping through each of them, depending on their type. Of course pointers and arrays will need a special treatment! In the end and for testing purposes, we will also traverse through the node (print it out actually) using the function "ast_traversal()"!
So, the "declaration" rule and it's action code looks as following now:
declaration: type { declare = 1; } names { declare = 0; } SEMI
    {
        int i;
        $$ = new_ast_decl_node($1, names, nc);
        nc = 0;
AST_Node_Decl *temp = (AST_Node_Decl*) $$;
// declare types of the names for(i=0; i < temp->names_count; i++){ // variable if(temp->names[i]->st_type == UNDEF){ set_type(temp->names[i]->st_name, temp->data_type, UNDEF); } // pointer else if(temp->names[i]->st_type == POINTER_TYPE){ set_type(temp->names[i]->st_name, POINTER_TYPE, temp->data_type); } // array else if(temp->names[i]->st_type == ARRAY_TYPE){ set_type(temp->names[i]->st_name, ARRAY_TYPE, temp->data_type); } } ast_traversal($$); /* just for testing */ } ;

Action rules for Initializations

    What is left now is taking care of initializations. Visualizing them we can see that:


So, there are two cases:
  1. A variable init, where we will get the value of the constant (stored in a corresponding constant ast node) and then set the ST entry's value of ID equal to that value.
  2. An array init, where we will set the array_size using the "array" rule, but will also have to add the values of the different array entries.
Let's first define the new non-terminals:
%type <symtab_item> init var_init array_init
%type <node> constant

Doing we of course change the YYSTYPE union again to:
%union{
	// different types of values
	Value val;
// structures list_t* symtab_item; AST_Node* node;
// for declarations int data_type; int const_type;
// for arrays int array_size; }

    Both cases of initialization will return a symtab_entry and so we can already write the "init" rule as:
init:
	var_init { $$ = $1; }
	| array_init { $$ = $1; }
; 

Let's now get into each case separately...

ID initialization

    The constant rule is pretty simple to make. We just create a new node depending on the type of constant and set the "constant" non-terminal equal to that node! So, the new rule is:
constant:
	ICONST   { $$ = new_ast_const_node(INT_TYPE, $1);  }
	| FCONST { $$ = new_ast_const_node(REAL_TYPE, $1); }
	| CCONST { $$ = new_ast_const_node(CHAR_TYPE, $1); }
;

    By creating this rule, we can now easily write the action code of "var_init", which of course get's the value and datatype of the constant, type-casting it correctly, returning the "changed" ID to the main rule "var_init". So, the new rule becomes:
var_init : ID ASSIGN constant
{ 
	AST_Node_Const *temp = (AST_Node_Const*) $$;
	$1->val = temp->val;
	$1->st_type = temp->const_type;
	$$ = $1;
}
;

Array initialization

    Initializing an array, we of course also have to get the values of the array. To get the values we can use a similar principle to the names! Just make two new global variables and a function that adds values to this new "vals" array. The global variables and function are:
%{
	...
	// for the initializations of arrays
	void add_to_vals(Value val);
	Value *vals;
	int vc = 0;
%}
...
void add_to_vals(Value val){ // first entry if(vc == 0){ vc = 1; vals = (Value *) malloc(1 * sizeof(Value)); vals[0] = val; } // general case else{ vc++; vals = (Value *) realloc(vals, vc * sizeof(Value)); vals[vc - 1] = val; } }

    Using this function, the "values" rule is pretty simple. We just create a temporary node, type-casting the constant node so that we can access the value, and in the end call the function "add_to_vals()" with this value as parameter! So, the new "values" rule is:
values: values COMMA constant 
	{
		AST_Node_Const *temp = (AST_Node_Const*) $3;
		add_to_vals(temp->val);
	}
	| constant
	{
		AST_Node_Const *temp = (AST_Node_Const*) $1;
		add_to_vals(temp->val);
	}
;

    In the end, we of course also have to write the actual "array_init" rule. First we get the needed "array_size" stored in the "array" rule and set the corresponding entry of ID. The "vals" array is global and can be accessed easily and so we also set the ID entry "vals" equal to that. The actual vals count stored in the global variable "vc" has to be re-initialized to '0' of course. Of course we should check if the number of values used for the initialization is equal to the "array_size" variable, else we will return an error! So, in the end the "array_init" rule is:
array_init: ID array ASSIGN LBRACE values RBRACE
{
	if($1->array_size != vc){
	    fprintf(stderr, "Array init at %d doesn't contain the right amount of values!\n", lineno);
	}
	$1->vals = vals;
	$1->array_size = $2;
	$$ = $1;
	vc = 0;
}
;

Running the Compiler for "full_example.c"

    Running the compiler for an example like "full_example.c" our compiler will now set the types of the variables and also their initialization values, when talking about an initialization that happens while declaring. So, what we are expecting to see is messages from the ast_traversal() function that tell us the data-type of declaration and how many names have been declared. Also, the "symtab_dump.out" file, for which I gave a sneak-peak previously in this article, will now longer have "UNDEF" as the type of variable, but will print out the actual type, including pointers and arrays. Function are not taken care yet and we will get into them later on!

    Knowing that we are hiding scopes and so removing elements from the declarations that happen inside of functions, we are expecting to see messages around all of the declarations, but only the variables of the main function inside of the dump file!

The declarations of the main function of "full_example.c" are:
int i;
char c = 'c';
double val = 2.5, res[6];
double *p;

Running the compiler we get:


    You can see that there are 4 messages that have to do with these exact 4 declarations! The number of names is of course right, seeing that the 3rd declaration has 2 names, whilst the others have 1. The data-type values are also right, knowing that we defined int as '1', real as '2' and char as '3', something that we also saw again previously!
The dump file is:


Which is clearly right, seeing that:
  • 'i' is an integer variable
  • 'c' is a char variable
  • 'val' and 'res' are both doubles (noted as real) and res being an array of it
  • 'p' is an pointer of double and so real

RESOURCES

References:

No references, just using code that I implemented in my previous articles.

Images:

All of the images are custom-made!

Previous parts of the series


Final words | Next up on the project

     And this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:
  • Semantic analysis (using more action rules in Bison)
  • Intermediate Code generation (using the AST structure for more cases)
  • Machine Code generation (MIPS Assembly)
     Which are all topics that will need more than one article to complete. Also, note that we might also get into Optimizations later on, or could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!

GitHub Account:

https://github.com/drifter1

Keep on drifting! ;)
Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Another great piece of work @drifter1, and glad to see a first output of running the compiler, great stuff ! :)

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you for your review, @mcfarhat! Keep up the good work!

Very nice!

Hey, @drifter1!

Thanks for contributing on Utopian.
Congratulations! Your contribution was Staff Picked to receive a maximum vote for the tutorials category on Utopian for being of significant value to the project and the open source community.

We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

As a follower of @followforupvotes this post has been randomly selected and upvoted! Enjoy your upvote and have a great day!

Hi @drifter1!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your post is eligible for our upvote, thanks to our collaboration with @utopian-io!
Feel free to join our @steem-ua Discord server