5.4 Translation of Arithmetic Expressions

To give an example of translation programs in Refal, we consider the translation of an arithmetic expression into an assembly language for a one-address computer. Elementary operands in arithmetic expressions will be variables (represented by identifiers) and whole numbers. The five arithmetic operations are allowed: + , - , * , / , ^ (the last for exponentiation) with the usual priorities; parentheses can also be used.

The one-address computing machine has an accumulator and memory storage. The computer code consists of operations LOAD , STORE , ADD , SUB , MUL , DIV , POWER and MINUS , separated by semicolons. Operation LOAD x , where x is an address in the memory, puts the value stored at x in the accumulator; STORE x puts the contents of the accumulator into the address x in the memory. Binary operations are performed on the accumulator and a memory address, and the result is put in the accumulator, e.g., SUB x subtracts the contents of x from the accumulator and puts the difference back in the accumulator. The MINUS operation changes the sign in the accumulator.

We shall translate arithmetic expressions into the assembler language for the described machine. This means that symbolic addresses will be used instead of real machine addresses. We shall use variable identifiers as their symbolic addresses. Instead of storing a constant at a certain address and then using this address, we shall use the constants themselves. The translation of an arithmetic expression is a program which computes its value and puts it in the accumulator. For instance, the expression:

  (x1 + 25)*factor
will be translated into:
  LOAD x1;
  ADD 25;
  MUL factor;

When we evaluate an arithmetic expression, we often need additional locations in the memory to store intermediate results of computation. The symbolic addresses for these locations will be $1, $2, $3 ... etc. The expression:

  (a + 318)*(b - c)
will be translated as:
  LOAD b;
  SUB c;
  STORE $1;
  LOAD a;
  ADD 318;
  MUL $1;

We assign the name TRAREX to our translation program. The full text of the program is listed below, followed by comments on its development.


  * Program TRAREX, 
  * Translation of arithmetic expressions
  $ENTRY Go { =  <Open 'r' 1 <Arg 1>>
                 <Open 'w' 2 <Arg 2>>
                 <Write <Translate <Input 1>>>; }
   
  $EXTRN Input;
   
  Translate {
      = ;
     e1 = <Code-gen (1) <Parse e1> <DG 'compl'>>
            }
  * Last symbol from a given list in expression
  * <Last (e.Symb-List) e.Expr ()>  ==  e1 sX(e2)
  *    where sX is the symbol looked for, 
  *    and e1 sX e2 is e.Expr, 
  * or                             ==  (e.Expr)
  *    if there are no such at top level
  Last {
     (eA sX eB) e1 sX(e2) = e1 sX(e2);
     (eA) e1 tX(e2) = <Last (eA) e1(tX e2)>;
     (eA) (e2) = (e2);  }
   
  * <Parse e.Arithm-expr>  ==  e.Tree
  * e.Tree  == s.Operation (e.Tree-1) e.Tree-2
  *         == s.Operand
  *         ==  empty
  Parse {
    e.Expr, <Last ('+-') e.Expr()>: e1 s.Op(e2) =
                     s.Op(<Parse e1>) <Parse e2>;
    e.Expr, <Last ('*/') e.Expr()>: e1 s.Op(e2)
          = s.Op(<Parse e1>) <Parse e2>;
     e1'^'e2 = '^'(<Parse e1>) <Parse e2>;
     s.Symb, <Type s.Symb>: 'F's.Symb = s.Symb;
     s.Symb, <Type s.Symb>: 'N's.Symb = s.Symb;
     (e.Expr) = <Parse e.Expr>;
      = ;
     e.Exp = 
        <Prout 'Syntax error.Cannot parse 'e.Exp>
        <BR 'compl=' Fail>;
         };
  * Code generation.
  * <Code-gen (sN) e.Parsing-tree>  
  *                     ==  assembler code
  * sN is the number of the first location 
  * free for intermediate results.
  Code-gen {
     e1 Fail  = ;
     (sN) '-' ()e2 =    
               <Code-gen (sN) e2>
               'MINUS ;'; 
     (sN) s.Op(e1)s2 =    
               <Code-gen (sN) e1>
               <Code-op s.Op> <Outform s2>';';
     (sN) '+'(s1)e2 =  
               <Code-gen (sN) e2>
               <Code-op '+'> <Outform s1>';';
     (sN) '*'(s1)e2 =    
               <Code-gen (sN) e2>
               <Code-op '*'> <Outform s1>';';
     (sN) s.Op(e1)e2 =    
               <Code-gen (sN) e2>
               'STORE $'<Symb sN>';'
               <Code-gen (<Add (sN)1>) e1>
               <Code-op s.Op> '$'<Symb sN>';';
     (sN) s.Symb =    
               'LOAD '<Outform s.Symb> ';';
     (sN) eX = (Synt-err)';';
           };
   
  * Convert operands into character strings
  Outform { 
        sS, <Type sS>:
           {'F'sS = <Explode sS>;
            'N'sS = <Symb sS>; }; 
           }
  * Assembler codes of operations
  Code-op {
     '+' = 'ADD ';
     '-' = 'SUB ';
     '*' = 'MUL ';
     '/' = 'DIV ';
     '^' = 'POWER ';
           };
  * Write assembler code in a column      
  * (on screen and on disk)
  Write {
     (Synt-err)';' e2 = <Prout 'Syntax error'>;
     e1';' e2 = <Prout <Put 2 e1>> <Write e2>;
     = ;  }

This program reads the expression from one file and puts the translation into another file. It also shows the translation on the terminal. The program requires one external function, Input , which is defined in the module reflib supplied with the Refal system package; we assume that reflib.rsl is stored in the current directory. To call trarex , enter:

  refgo trarex+reflib  input-file output-file

The starting function Go uses the built-in function Arg to make parameters entered with the call of the program available to the Refal machine. <Arg 1> is evaluated to input-file, and <Arg 2> to output-file. Function Open associates these files with reading and writing by channels 1 and 2, respectively.

Translation from one programming language to another includes the following three stages:

  1. Lexical analysis. At this stage characters of the input text are converted into the minimal lexical units of the source language.
  2. Syntax analysis, or parsing. Here the text in the source language is converted into a tree structure which expresses the meaning of the text in a form convenient for the third stage.
  3. Code generation. The target language code is produced from the parsing tree.

In our case, lexical units are identifiers, whole numbers, operation signs, and parentheses. The function Input takes care of all lexical analysis we need. Not only does it implode strings of characters representing identifiers and numbers into atomic symbols, it converts character-parentheses into Refal parentheses which are properly paired. This is already a beginning of parsing tree formation; it makes the next stage, syntax parsing by function Parse , easier.

NOTE: The character - on the keyboard is used as both a hyphen and a minus sign. We use the former in identifiers. Therefore, x-1 will be taken as an identifier. To subtract, surround - by blanks.

The input of Parse is the result of Input . Its output is a parsing tree to be used by the code-generation function Code-gen (see the definition of Translate ). The exact format of this tree should be chosen so as to make the code generation easier.

The main criterion for designing a representation of a tree-like object in Refal is that all cases of the object which we may wish to distinguish in programs must be distinguishable through the use of simple patterns. The use of open e-variables in patterns should be limited to only those cases where it is strictly necessary, such as when we look for an element of a list. Unnecessary parentheses should be avoided. Often we include in the format of the objects a few symbols to distinguish between cases and to enhance readability. In our case the tree has a simple structure. In code generation we must distinguish between elementary operands and subtrees. We use a format where elementary operands are always symbols, while subtrees never are. Thus we need no additional discriminators.

The case of an elementary operator (a leaf of the tree) is given by the formula:

  e.Tree == s.Operand
while the case of a binary operation is:
  e.Tree == s.Operation (e.Tree-1) e.Tree-2

We must also allow for one unary operation in arithmetic expressions: the change of sign represented by the same symbol '-' as the binary operation of subtraction. When we parse an expression we can treat the unary minus in the same fashion as the binary minus; the only difference is that in the case of the unary minus the first operand is empty. Thus we add one more case for the data type e.Tree :

  e.Tree  == - () e.Tree

and we consider this as a separate case at the stage of code generation.

While designing data structures in Refal we have to take into account the interests of both the make-it function, and the use-it function. It often happens that a structure originally designed in the interests of one of the parties needs modifications in order to better satisfy the other party. Sometimes you will do a few such adjustments before coming to final data structures as you go into details of the algorithm. This may require some effort, but what you gain is a better representation and, almost certainly, a better understanding of the algorithm. This will pay off at the time of testing and debugging.

In parsing we use the ability of the Refal machine to jump from a left or right parenthesis to the the other member of the pair and to pass expressions both from left to right and right to left. For left-associative operations, namely + , - , * , and / , the topmost operation in the parsing tree is the last one at the main parentheses level, e.g.:

  a - b - c = (a - b) - c

We cannot use open e-variables to locate the proper minus sign in this expression, because the left-to-right order is built into the definition of lengthening. But we can examine term by term from right to left until we discover the first minus. This is done by the function Last , which finds the last entry of a symbol from a given list (``to find'' in Refal means to put parentheses in the appropriate places). (The form of the input and output of Last is described in detail by the program's comments.)

The exponentiation operation ^ is right-associative, so we can use open variables in top-down parsing when looking for it.

All this can be clearly seen in the definition of Parse . Each sentence there corresponds to one syntactic category of the source language. First we break the expression by the top-level additive operations + and - ; then the multiplicative operations * and / . The third sentence takes care of ^ . The order of these sentences is important; it reflects the relation of seniority between operations.

But what to do with the unary minus operation? Its bond is tighter than the additive and multiplicative operations, but looser than that of exponentiation. One solution is to add the corresponding sentence between the second and the third sentences. But then we need to be sure that the first sentence will not work with the empty value of e1 . This could be done by replacing e1 by tX e1 . However, we use another solution in the program. We simply allow the first (as well as the second) argument of binary operations to be empty, thus extending the definition of a parsing tree. When it comes to code generation we allow only one case where the leaf of the tree is empty -- that of the unary - .

However, this simplicity does not come without a price to pay. An expression like ()-x will be considered legitimate and equivalent to -x . (See an exercise below for a better solution.)

The remaining five sentences of Parse take care of identifiers, numbers, parentheses, empty expressions, and error situations.

Error handling is an important part of translation. In our program lexical errors (including unpaired parentheses) are caught by the standard function Input . In the process of parsing many parallel calls of the function Parse may appear in the view field. If one of these calls meets a syntax error, there is no sense in generating the code. The program should send an error message and either stop the process or go on parsing in order to discover other possible errors, but stopping short of code generation in any case. The latter is preferable, and we have chosen this option in our program. To send a message to Code-gen , Parse buries the symbol Fail under the name 'compl' (for `completion'). In the call of <Code-gen> we dig 'compl' and add it on the right side of the argument. If nothing was buried (no errors), the addition will be empty. The first sentence of Code-gen checks whether Fail is there and terminates execution if it is.

Function Code-gen generates the assembler code from the parsing tree of an arithmetic expression. It has one more argument -- the number indicating the first free location for intermediate results of computation. It is 1 in the beginning.

When our target machine executes an operation, the first operand is in the accumulator and the second by the address specified. However, a good code generator should use the commutativity of addition and multiplication and reverse the order of operands when it can improve the code. For instance, with the order of operands specified in the expression

  a * (b - c)
the translation is:
  LOAD b;
  SUB c;
  STORE $1;
  LOAD a;
  MUL $1;
There is, of course, a better translation:
  LOAD b;
  SUB c;
  MUL a;
It is this translation that will be given by our program. There are special sentences in the definition of Code-gen which take care of the situations where the reverse order is preferable.

Some errors may not be discovered until code generation, e.g., an empty operand (except in the case of the first operand of - ). The last sentence of Code-gen will discover such an error, and the output function Write will print out a message and stop. As for other sentences of Code-gen , it is easier to grasp how they work from the Refal text, than from written explanations.

Exercise 5.3 Modify TRAREX so that it works in the interactive mode, asking the user to supply an expression, translating it and asking again, until the user types END instead of an expression.

Exercise 5.4 As mentioned above, TRAREX accepts ()-x as -x . Modify Parse so that an empty sub-expression is always considered an error.