Pascal Implementation by Steven Pemberton and Martin Daniels
Code generation can be split into several parts:
Code is produced in symbolic form. Each line output is a comment or a label definition, an instruction or an end assembly line.
A comment is a line beginning with the letter I followed by a number: this is purely for the human reader, and just numbers every tenth instruction output.
For example
I 10
(Routine putic does this, and is its only task.)
Label definitions are lines beginning with the letter L
followed by a number, optionally followed by an equals sign
followed by another number. For example:
L 10
L 12=4
The first case is a code label, and is the destination of a jump or
call within code. The second case defines a value for use by
ENT
instructions (ENT
instructions are
generated before the value is known).
Instruction lines start with a space followed by a three-letter name optionally followed by other information. An instruction can be 'typed' or 'untyped', and can have zero, one or two parameters.
An example of an untyped instruction without parameters is the stop instruction:
STP
Typed instructions are qualified by an extra letter giving information on the type of its operands. For example
EQUI
compares two integers for equality
EQUR
compares two reals.
Parameters consist of a P and a Q parameter, either of which may be absent. Typically Q is the major parameter, for instance the address of a variable, with P qualifying it in some way, such as which region to find it in. Examples:
LAO 9
Load onto the stack the address of the
variable at address 9 in the outer region.
LDA 0 5
Load the address of the variable
at address 5 in the local region.
LDOI 10
Load the contents of the outer
integer variable at address 10.
LODI 1 5
Load the contents of the integer
variable at address 5 of the region one level out.
An end assembly line consists solely of the letter Q:
Q
The names of these routines have a loosely adhered to coding for their names: the word gen, followed by a digit, optionally followed by a t. The digit indicates the number of parameters, and the t means 'typed'.
This is the simplest of this set of routines, and a good place to start. It is used to output untyped instructions with no parameters:
ABI
, ABR
, SQI
,
SQR
, TRC
, ODD
,
CHR
, EOF
, FLT
,
NOT
, SGS
, UNI
,
MPI
, FLO
, MPR
,
INT
, DVR
, DVI
,
MOD
, AND
, NGI
,
NGR
, ADI
, ADR
,
SBI
, SBR
, DIF
,
IOR
, INN
, UJC
,
STP
.
[1833] Fop is the index of the instruction to be output, and is used to index the array mn [292], to give its name, and cdx [294], to give its effect on the stack (see later).
[1835] Prcode is the boolean indicating whether code is to be output or not. Puttic simply outputs a comment line in the code every tenth instruction.
Mn is initialised at [3890-905] with the names of the instructions, so this writeln prints the name of the required instruction.
[1836] Variable ic, the instruction counter, is incremented for each instruction output. Mes measures the effect that this instruction has on the runtime stack, in terms of operands put on or taken off the stack.
procedure gen0(fop: oprange); begin gen(fop); writeln(prr) end;
At run-time the main program, and each invocation of a routine, has associated with it a 'stack frame' which contains the data areas for the routine.These areas are:
The maximum size necessary for each of these areas can be calculated at compile-time, and mes is used for calculating the size of the local stack area.
Each instruction may have an effect on this local stack at
run-time. For instance LDC
, 'load constant', loads a
constant on to the top of the stack, and so can be said to have an
effect of +1. Similarly, ADI
takes the top two stack items and
replaces them by their integer sum, so this can be said to have an
effect of -1.
Similarly all other instructions can be so classified, with zero for instructions with no effect.
This information is recorded in the array cdx initialised [3948-63].
In the same way, the standard functions and procedures have an effect on the stack, and these effects are recorded in the array pdx, initialised [3964-9]. The P-code names of the standard routines are held in the array sna, initialised [3880-5].
When a routine is compiled, there are two variables, topnew and topmax, initialised to lcaftermarkstack [3472]. Topnew indicates the current size of the local stack area, adjusted for every instruction output, topmax the largest value topnew has had in the current routine and therefore the required size of the local stack within the frame.
[1826] Calculate the new value of topnew.
[1827] Update topmax if necessary.
MST
(mark stack) instruction, generated for every
routine call, is changed from zero to
lcaftermarkstack.LDC
it does not know what type of
constant is being loaded) it assumes the worst, and multiplies the
change value by the size of the largest possible item that can be
loaded on the stack, maxstack. The value of
maxstack is the maximum of intsize,
realsize, charsize, boolsize,
ptrsize, and setsize, rounded to a multiple of
stackal if necessary (see [34-56]).This routine is used for untyped instructions with one parameter
(Q): CSP
, LCA
, LAO
,
IXA
, MST
, MOV
and for
RET
which is typed with no parameters.
[1844-8] For the instruction CSP
(call
standard procedure), the effect of this instruction depends on which
routine is being called, so pdx is used instead of
cdx. Fp2 is the index of the standard routine
being called.
[1850-9] Instruction LCA
loads the address of a
string. Here fp2 indexes an array of strings, and the characters of
the string are output, surrounded by single quotes. Note that the
string is padded out with spaces to make it exactly strglgth
characters long.
[1860] Instruction RET
is actually a
typed instruction, with a single character: P for procedures, and I,
R, C, B, A for integer, real, character, boolean, and pointer
functions respectively. Fp2 is just the ord of
the character. However, gen1 is only called for the
RETP
case. Gen0t is used for the others.
[1861] In all other cases, fp2 is just a value to be printed, for example
MOV 6
CSP
, LCA
, and RET
should each
have their own routine; it is pure contortion fitting them here.The LCA
case needs greater study, because of its
use of the array cstptr.
Referring to the declaration of cstptr [1813], there is a comment indicating that it is used for non-integer constants (real, string, and set; chars are treated as integers anyway), so that they can be treated as integers within the compiler. (There is also a confusing mention of a nonexistent procedure writeout; more on that later).
However, locating all the places where cstptr is used, the following cases emerge: [1853] in gen1, [1885] and [1893] in gen2, where the constant stored in cstptr is printed out; and [1976], [2018], [2880] where a value is added to the array (printing an error message if full). But in these three latter cases, putting a value in is immediately followed by a call to either gen1 or gen2 using that value. Cstptrix the index into cstptr is never decremented, and cstptr is never used except in these six cases.
In other words, all the values stored are used once and once only, immediately after they are stored, and so cstptr need not be an array, but only a single variable - 64 elements are completely wasted, and error 254 need never occur!
In fact, this array is probably a left-over from a previous version of the compiler. Throughout the compiler there are clues - like the comment at [1815], and the restriction that goto's may not lead out of a routine - that once upon a time the compiler stored the code of a routine in an array, and printed it out at the end of the routine. This probably included filling in the addresses of goto's as well, and when this organisation was changed, the storing of constants was overlooked.
Anyway, the cure for all this is to scrap cstptr completely, delete [1849-59], and introduce a new procedure genlca(fcp: csp); and replacing [2015-20] by genlca(cval.valp) and similarly for gen2 and the calls around [1976] and [2880].
This routine should be used for untyped instructions with two
parameters (P and Q); however the only one that conforms to this is
LDA
. The others are:
EQU
, GEQ
, GRT
,
LEQ
, LES
, NEQ
which are typed with no parameters unless the type is M when there is one, and
LDC
which is typed with one parameter, except for LDCN
which
has none.
[1874-5] For LDA
(load address,
fop = 50), output the two parameters. Gen2 is
never called with fop = 45, 54,or 56 (CHK
,
LOD
, STR
).
[1876-80] The compare instructions EQU
,
GEQ
, GRT
, LEQ
,
LES
, and NEQ
. The type character is printed,
and if this is M, for string comparisons, the length of the string is
also printed.
[1881-98] LDC (load constant). In this case fp1 indicates what type of constant is to be loaded (1 integer, 2 real, 3 boolean, 4 nil (in which case fp2 is not used), 5 set, and 6 character), and fp2 is the constant to be loaded, or a pointer to it.
LDA
, the comparison instructions,
and for LDC
.LCA
, which
loads the address of a string (the value of a string is never loaded),
should really be a LDC
typed for string, and let the
interpreter deal with the difference, especially since the interpreter
already does special things for reals, sets and large integers.[1895] The ':3' is a mistake. If sethigh is greater than 99 (which is quite possible) then k may well be a three digit number, in which case subsequent values would run into each other. A better coding would be
if k in pval then write(' ', k:1)
This routine is used from gen0t, gen1t, gen2t to output the type character of typed instructions.
[1909] The type character for integer is 'i'.
[1911] For boolean 'b'.
[1913] For character 'c'.
[1915] Enumerations apart from boolean are treated as integer.
[1916] 'r' for real.
[1917] For subranges, gentypindicator is called recursively for the type of the subrange.
[1918] 'a' (address) for pointers.
[1919] 's' for sets.
[1920] 'm' (multiple) for arrays and records.
[1921] It should never be possible for this routine to be called for files, tagfields, and variants, so this would be a compiler error.
[1909-16] could be written more simply as
if fsp=boolptr then write(prr, 'b') else if fsp=charptr then write(prr, 'c') else if fsp=realptr then write(prr, 'r') else write(prr, 'i') (*intptr and enumerations*)
since it does not repeat the 'i' case (see [3060-5]).
These are all used for typed instructions. Gen0t for instructions with no parameters:
STO
, ORD
, and RET
,
gen1t for instructions with one parameter:
LDO
, IND
, SRO
, INC
, and DEC
,
and gen2t for instructions with two parameters:
LOD
, STR
, CHK
.
CHK
will have a first
parameter larger than maxlevel [33].Instruction CHK
is used to check that the value on
top of the stack is within the limits of the two parameters, for
example
CHKI 0 9
would check that the integer is between 0 and
9. However CHKA
, for checking addresses, does not need
two parameters, since the correct range of values is determined by the
interpreter not the compiler, except for whether the address is
allowed to be nil or not. CHKA
is generated from two
places, [2211] when dereferencing pointers, where the first parameter
of 1 indicates that the address being checked may not be nil, and [3131] when assigning pointers, where the zero
indicates nil is acceptable. In fact these two calls
are the only major uses of maxaddr, which could easily be
eliminated.
These are for the instructions
FJP
, UJP
, XJP
, CUP
, and ENT
.
Despite its name, genujpxjp is
also used in one place for FJP
[3379]. Otherwise the
routine names reflect their uses.
These five instructions all share the property that they
have a label parameter, indicated in the code by the letter L
followed by an integer. For example,
FJP L 10
FJP
.In fact the letter L
before the label number is
redundant, and can only be for human readability, since it is always
known from context whether a parameter is a label or not (for instance
the parameter of FJP
is always a label, so FJP
10
would be sufficient).
If this letter were removed, then these three routines could be removed, and gen1 and gen2 used instead. Of course, this would also imply a slight change in the interpreter.
These two routines are for creating labels to be referred to in the code, and outputting them.
[848] Intlabel, declared [298], initialised to zero [3802], is just incremented each time genlabel is called, thus delivering a unique integer for each label.
[2077] A generated label is output. (This is done only once for each label).
Genlabel could be a function, replacing code like [1722-3]
genlabel(lbname); pfname:=lbname
with
pfname:=genlabel
When assigning an address to a variable, two values have to be taken into account:
Both these values of course depend on the target machine, which is why the compiler is parameterised with a set of these values for all the simple types [34 et seq.].
For example, some machines require four store units to hold an integer, and require that it be stored starting on an even store boundary. This would be reflected by values of 4 for intsize and 2 for intal.
When addresses are assigned to objects, they are assigned as offsets relative to a fixed point. So the first item is assigned the initial offset, and subsequent offsets increase from there.
As an example, suppose that integers need 2 units on an even boundary, and characters need one unit on any boundary. Then intsize and intal are 2, charsize and charal are 1. Also, assume the initial offset is zero. Now, space is to be allocated for two integers, a character, and another integer.
The first integer can be assigned offset zero, and the displacement incremented by intsize.
Now the next integer can start at offset 2, as this is an even boundary:
The next item is a character, which may start at any boundary:
Now the next item is an integer, and the current displacement is odd, so it has to be aligned to the next even boundary before allocating the address, so that location 5 is wasted:
All objects are allocated in this way: variables, parameters, fields of records, and elements of arrays.
For variables, there is a global variable lc[213] that holds the current displacement within the current routine.
There are two routines to achieve address assignment: alignquot [646-66] for calculating the alignment value for an object, and align [668-74] for making sure the current displacement is a multiple of the alignment.
For instance, to allocate an integer variable in the current procedure would need
align(intptr, lc); lc:=lc+intsize
All procedures and functions use the first few locations of their data space for system values and function results, so that the initial value of the displacement for these cases is not zero, but the value called lcaftermarkstack[55].
[648] Default of 1.
[652-6] Standard types: get the value from the appropriate constant.
[657] Parameters to procedures. Because of the way they are evaluated by putting them on the stack, they have an alignment constant independent of their types.
[658] The alignment of a subrange is that of its rangetype.
[659-61] Pointers, sets, files.
[662] The alignment of an array is that of its element type.
[663] Records have an alignment constant independent of their fields.
[664] Alignquot is not called for variants and tag fields.
The first 3 lines could be rephrased as
if fsp=nil then alignquot:=i else with fsp^ do
The workings of this routine are not immediately apparent. It is required to make sure that flc, the current displacement, is a multiple of k the required alignment value. The effect wanted is
while (flc mod k) <> 0 do flc:=flc+1
but without using a loop.
Since flc may already be a multiple of k, the problem can be restated as finding the next multiple of k that is greater than (flc - 1).
The next multiple of k below or equal to a value X is X - X mod k
Therefore the multiple of k above X is
X - X mod k+k.
Substituting (flc - 1) for X gives
(flc-1) - (flc-1) mod k + k.
However, flc at its smallest may be zero, so (flc-1)can be negative, and Pascal's mod is not defined for negative operands. So since
(a mod b) = ((a+b) mod b),
k can be added, giving
(flc - 1) - (flc - 1 + k) mod k + k.
Rearranging
(flc-1+k) - (flc-1+k) mod k.
This has been simplified to
l:=flc-1; flc:=l+k - (l+k) mod k
A better simplification would be
l:=f-1+k; f:=l-l mod k
This procedure makes sure that code has been generated to completely load the current expression onto the stack.
[1961] If typtr is nil, a compilation error occurred with this expression, and so no code need be generated for it.
[1964-81] Generate code to load a constant: boolean [1965], character [1968], integer or enumeration [1969], nil [1971], real [1978], and set [1980]. (String constants are loaded in loadaddress.)
[1982-8] Code to load a variable.
[1984] Load a variable at the outermost level, that is, a global variable.
[1985] Load a variable local to some routine. Level - vlevel is the number of levels out from this level. Level is the current level, vlevel the level that the variable was declared at. Therefore a difference of zero means the variable is local to this routine, a difference of one means it comes from the surrounding routine, and so on.
[1986] Code to load a value indirectly, via an address on the top of the stack.
[1989] If kind=expr the expression is already fully loaded.
[1991] Set the kind to expression.
LDO
, that loads a global
variable, is an optimisation, since LOD
could
still be used here. The same holds for the other two global
accessing instructions LAO
and SRO
.Unlike load and loadaddress which both use the global attributes gattr, store is passed the attributes it is to use via the formal parameter fattr. This is because for instance with the statement
i:=1
by the time that store is called, gattr refers to the right-hand side, whereas store needs the attributes of the left-hand side, which must therefore be saved and passed to the routine (see [3112], [3128], [3133], [3135]).
[1997] Fattr.kind is always varbl.
[1998] If typtr is nil then some compilation error occurred.
[2000] Store into a direct global variable.
[2001] Store into a direct non-global variable.
[2002] Loadaddress is called for indirect variables prior to calling store, and sets idplmt to zero [2031]. Therefore if idplmt is not zero it is a compiler error.
[2003] Store indirect.
This is the equivalent of load, but loads addresses rather than values.
[2014-21] Load is used to load small values. Only the addresses of potentially large objects are loaded.
This section then, is for loading the address of
a string constant, for which there is a special instruction
LCA
(load constant address).
[2023] Load the address of a global variable.
[2024] Load the address of a non-global variable
[2025-6] If access = indrct then an address is already loaded. More code only needs to be generated for the case of a field of this variable being accessed (for example a[i].left), in which case the address only needs to be incremented by the offset of the field, using an INCA instruction (increment address).
Nilptr is a structure with
form pointer, used in gentypindicator
[1904] to generate the A of INCA
.
[2029] Loadaddress is never called for kind = expr.
[2031] Reset the attributes of gattr.
As mentioned before, there is no need for a
separate LCA
instruction. LDCM
would do as
well, and the assembler could differentiate as necessary.
This generates code to check if a value is in a required range. It is only ever called if debug is true, i.e. if the doption has been set. It is called from two places: [2647] to check an actual parameter fits a formal parameter, and [3127] to check the right hand side of an assignment fits the left, if the left is a subrange variable.
Fsp [2062] is the type of the target variable (for example a in a:=b).
Other productions of the instruction CHK
occur at
[2153] for array subscripts
[2211] for dereferencing a pointer
[3132] for assigning a pointer
[3270] for the controlling expression of a case statement.
Unfortunately, the generation of checking code is far from optimal. For instance a call to a procedure like
p('+');
or an assignment
ch:='+';
would both generate code to check that '+' was a character. You might like to consider what changes would be necessary to improve on this. See (Welsh, 1978) for an interesting article on optimal checking code.
Pascal Implementation by Steven Pemberton and Martin Daniels
Copyright © 1982, 2002 Steven Pemberton and Martin Daniels, all rights reserved.