Part
I. Intoduction to Parsing
A.
Introduction
This is the
first of a series of articles on language translation. This is
commonly called “parsing”, and I will no doubt lapse from time to
time into this usage, but it is better to reserve the word for one
phase of translation.
The techniques I
will discuss are used in compilers and interpreters, of course, but
also in many more-common circumstances. For example, when you add a
formula to a speadsheet, the application must translate the formula
into a form it can use. Many database programs let you specify
formats for input fields -- these must translate your format, and
then use the result to translate the input. Consider the expression
search in MPW. I guarantee you there’s a translator operating
there. And, of course, my own program Idealiner uses a translator
when you specify topic numbers.
All the code
examples in this series of articles will be MPW tools, to minimize
interface considerations. I doubt if you’ll see a single Macintosh
trap before the tenth article. I will focus instead on the internals
of language translation.
Language
translation can be divided into three stages: first, the LEXICAL
ANALYZER (also called the “scanner”) divides the input text into
individual symbols, pieces like identifiers, operators and constants;
then the PARSER interprets the sequence of symbols according to the
language’s grammar; and lastly the CODE GENERATOR responds in an
appropriate fashion to the statements the parser has processed. These
three phases operate simultaneously, the parser calling the lexical
analyzer for input, and then calling the code generator for output.
A(1).
Parsing
The first three
parts in the series will focus on parsing.
The first part
of the article (this very one) will introduce parsing and the
parser-generator YACC (Yet Another Compiler Compiler). YACC converts
a set of GRAMMAR RULES into C source for a parser that will accept
that grammar. You still have to write the lexical analyzer and the
code generator (although YACC provides a framework for code
generation) but YACC takes care of the most tedious part of the job.
The second part
will go beneath the surface and explore the inner workings of YACC.
This will be somewhat technical, and may seem unnecessary, but it
will provide an essential foundation for later topics. The article
will cover deriving parse tables from YACC output, parsing
expressions by hand (using the parse tables), and YACC’s debug
output.
The next article
will discuss ambiguities, association and error detection. An
understanding of the first two topics are essential to writing
correct and efficient grammars. The third covers detecting and
reporting errors by type and location.
A(2).
Lexical Analysis
With the third
article, I will return to the first stage of language translation,
lexical analysis. The article will begin with a discussion of state
machines, then present a simple but effective table-driven analyzer.
The fourth
article will be an excursion into a seemingly unrelated field: hash
tables and binary trees. The idea is to develop some tools to
increase the power of the lexical analyzer in the following article.
The fifth
article will extend the analyzer by adding a symbol table. The
routines developed in the fifth article will give us a way to save
symbols; the example program will be an improved version of the MPW
Canon tool.
A(3).
And a Useful Example
Now, I’m
pretty sure of the preceding, because I’ve already written the
articles! What follows is a forecast of what I’ll do next.
The plan is to
build an MPW tool that will preprocess either Pascal or C source
files and convert inline assembly code into Pascal inline routines or
C direct functions, as appropriate. That is, you will be able to
write assembly language instructions, and the preprocessor will
convert your assembly into machine language. Your assembler routines
will be declared with high-level headers, a la Lightspeed C, and you
will be able to refer to routine arguments and local variables by
name, rather than indexing off the stack, a real convenience. I’m
going to write this tool because, damn it, I want to use it!
This will be a
major project: I’ll need to parse Pascal and C well enough to find
routines, and I’ll need to parse assembler completely. I’ll then
need to write a more-or-less complete assembler. It could take six or
eight columns.
B.
Grammar Descriptions
Here’s a
reference for this topic: Compilers: Principles, Techniques and
Tools, by Aho, Sethi and Ullman (Addison-Wesley 1986). You may have
heard of some of these guys. I will refer to it as “ASU”.
A computer
language is composed of a set of symbols (the “words” of the
language), and a set of grammar rules that determine how these words
are formed into statements (the “sentences” of a computer
language). As I said earlier, the lexical analyzer is in charge of
extracting the “words” from the input; it is the job of the
parser to make meaningful “sentences” from these “words”.
A bit of
terminology: classes of symbols (such as “integer” or
“identifier”) are called TOKENS, and specific instances of tokens
(such as “252” or “sysbeep”) are called LEXEMES. The lexical
analyzer will typically determine both the type and value of a symbol
that it reads; the former is the token classification and the latter
is the lexical value. The parser cares about tokens; the code
generator cares about values.
I guarantee that
I will use the word “token” to mean both token and lexeme. The
meaning should be clear from the context.
B(1).
Grammar Rules
A grammar rule
is called a PRODUCTION. It is a substitution rule; the single symbol
of the left-hand side (often, but not by me, abbreviated LHS)
produces the expansion on the right-hand side (RHS). Here’s an
example:
stmt
-> ID ‘=’ expr ‘;’
A ‘stmt’ can
be expanded to (that’s what the ‘->’ means) a series of
symbols consisting of an ‘ID’, an ‘=’ sign, and ‘expr’
and a semicolon. Symbols within quotation marks are explicit, and
must appear as written; other symbols are token classes.
Recursive
definition is permitted. Here’s an example:
expr
-> expr ‘+’ NUM
Applying this
rule to the first, we discover that:
stmt
-> ID ‘=’ expr ‘+’ NUM ‘;’
stmt
-> ID ‘=’ expr ‘+’ NUM ‘+’ NUM ‘;’
And so on. It is
possible to determine a complete language such as Pascal or C with
relatively few such productions, even though there are
infinitely-many legal statements in the language.
Now, everyone
can make up his own conventions, of course, but I will distinguish
two kinds of non-explicit symbols by showing one in all-caps and one
in all-lower-case. All-caps symbols are defined by the lexical
analyzer, not by the parser. Thus, they will not appear on the
left-hand side of any production. These symbols are call TERMINALS,
because they terminate expansion. When you expand to a terminal, you
can expand no further. The lower-case symbols are, surprise!,
NON-TERMINALS. They are defined by the parser rather than the lexical
analyzer, appear somewhere as the left-hand side of one or more
productions, and do not terminate expansion. These distinctions are
important!
As the above
examples suggest, several productions can share the same left-hand
side:
expr
-> expr ‘+’ NUM
expr
-> NUM
This pair of
productions expands to arbitrary sums. Just start with the first
production and substitute the first production into it to add another
term, or the second to terminate the expansion.
B(2).
An Example Grammar
This will be
fun. If you don’t want to mess with abstract symbols, just skip
this whole section; the result is all you need. My development of the
grammar won’t be very theoretical, though.
A parser reads
the input symbols from left to right, until it has read the
right-hand side of a production. Then it REDUCES the input, running
the production backwards and replacing the right-hand side with the
left-hand side. This is the point at which code is generated, and
expressions evaluated: when a reduction occurs.
So to see if a
grammar does what we want, we can start with a test statement that
should be accepted by the grammar, and see what happens when we play
parser.
The grammar we
are shooting for will accept simple algebraic expressions, involving
addition, subtraction, multiplication and division of numbers.
B(2)(a).
First Try
Earlier, we saw
productions that can expand to arbitrary sums:
expr
-> expr ‘+’ NUM
expr
-> NUM
We can add a few
more to get the other three usual operators:
/*
1 */
(1)
expr -> expr ‘+’ NUM
(2)
expr -> expr ‘-’ NUM
(3)
expr -> expr ‘*’ NUM
(4)
expr -> expr ‘/’ NUM
(5)
expr -> NUM
Let’s try a
simple test:
NUM
‘+’ NUM ‘*’ NUM
->
expr ‘+’ NUM ‘*’ NUM
(rule 5;
remember that we read from the left)
->
expr ‘*’ NUM (rule 1)
->
expr (rule 3)
The addition was
the first thing to go; therefore, it was performed first. This is
contrary to our expectations, I hope (my past as a math teacher shows
itself!). The grammar won’t work. We need to make sure that
multiplication and division are performed before addition and
subtraction.
B(2)(b).
Second Try
So let’s
introduce another non-terminal. We want to cluster products together,
to ensure that they are evaluated first, so let’s make them a
separate non-terminal:
/*
2 */
(1)
expr -> expr ‘+’ term
(2)
expr -> expr ‘-’ term
(3)
expr -> term
(4)
term -> term ‘*’ NUM
(5)
term -> term ‘/’ NUM
(6)
term -> NUM
Now try the test
string again:
NUM
‘+’ NUM ‘*’ NUM
->
term ‘+’ NUM ‘*’ NUM
(rule
6)
->
expr ‘+’ NUM ‘*’ NUM
(rule
3)
->
expr ‘+’ term ‘*’ NUM
(rule
6)
Oops, it looks
like we have a choice here: rule 1 or rule 4. We really don’t,
though; reducing by rule 1 would leave us with “expr ‘*’ NUM”,
and we don’t have any rule with “expr ‘*’” in it.
->
expr ‘+’ term
(rule
4)
->
expr
(rule
1)
So the
multiplication happened first, just like we wanted.
B(2)(c).
Third Try
Suppose,
however, that we wanted the addition to occur first. Then we need to
add some parentheses:
/*
3 */
(1)
expr -> expr ‘+’ term
(2)
expr -> expr ‘-’ term
(3)
expr -> term
(4)
term -> term ‘*’ NUM
(5)
term -> term ‘/’ NUM
(6)
term -> ‘(‘ expr ‘)’
(7)
term -> NUM
Now we treat an
addition or subtraction that occurs within parentheses as if the sum
or difference was a ‘term’ (rule 6). And that should do it.
C.
YACC: Yet Another Compiler Compiler
YACC is a UNIX
tool that builds parsers from grammars. We can take the grammar just
developed, supplement it with a minimal bit of C code, and YACC will
write the complete parser for us. Or, if the parser is only a small
part of a big program, YACC will write that small part, and we can
then link it to the main program. In either case, YACC saves us from
the unbearable tedium of computing parse tables by hand.
A YACC input
file is divided into three parts: the declaration part, the grammar
part, and the program part. (Remind you of anything? My past as a
COBOL programmer coming out!) The three sections are separated by
lines beginning:
%%
YACC writes a C
source file as its output; it will also write a file of parser
information if you desire it. We’ll look at that next time.
C(1).
The Declaration Section
The declarations
include a set of C-style declarations that are used by YACC to build
the parser. For now, there are only two sorts of declarations that
concern us.
The first is the
“%token” declaration. YACC will automatically recognize
one-character tokens like ‘+’ and ‘-’. All other terminal
tokens must be declared with the %token statement, e.g.,
%token
NUM
Non-terminals do
not need to be declared; they are implicitly declared in the grammar
rules.
The second
declaration type lets us pass regular C declarations through YACC to
the C compiler. These look like this:
/*
4 */
%{
#define
blip 12
%}
Anything between
the %{ and the %} is passed through unchanged, and written by YACC to
the C source file.
It is customary
to include a #define for YYSTYPE. What is YYSTYPE? It is the type of
the parser’s stack. For now, there’s no need to worry about it.
It is “int” by default, and int will work fine for what we’re
doing this month. Later, after I’ve discussed how the parser
operates, and we know what sort of things go on the stack, we’ll
come back to it.
C(2).
The Grammar Section
The grammar
section includes all the grammar productions, like those we discussed
earlier. They are written in a somewhat non-standard format (taking
ASU’s notation as the standard, which I think is reasonable). They
also provide a framework for code generation.
C(2)(a).
Production Rule Format
The arrow in
productions is replaced with a colon, and productions with the same
left-hand side are combined into one, with the right-hand sides
separated with vertical bars, |. The last right-hand side is
terminated with a semicolon. The usual practice is to format the
productions like this:
/*
5 */
expr
: expr ‘+’ term
|
expr ‘-’ term
|
term
;
term
: term ‘*’ NUM
|
term ‘/’ NUM
|
‘(‘ expr ‘)’
|
NUM
;
in place of:
expr
-> expr ‘+’ term
expr
-> expr ‘-’ term
expr
-> term
term
-> term ‘*’ NUM
term
-> term ‘/’ NUM
term
-> ‘(‘ expr ‘)’
term
-> NUM
C(2)(b).
Code Generation
You can follow
each right-hand side with some pseudo-C code that the parser will
then call when the production rule is executed (i.e., if you read
that stuff about reductions, when the input is reduced by that
production).
Here’s an
example. Suppose you have the production rules:
expr
: expr ‘+’ term
|
expr ‘-’ term
|
term
;
The question is,
what is the value of the “expr”? In the first case, it’s the
value of the first token of the right-hand side plus the value of the
third; in the second, it’s the first minus the third; and in the
third, it’s just the value of the first token. So we write:
/*
6 */
expr
: expr ‘+’ term
{
$$
= $1 + $3;
}
|
expr ‘-’ term
{
$$
= $1 - $3;
}
|
term
{
$$
= $1;
}
;
This isn’t
hard to figure out; $$ is the value of the left-hand side, $1 is the
value of the first token of the right-hand side, $3 the value of the
third token. Using those symbols, you just write straight C code. You
are not limited to a single line, and you can call routines that you
have written elsewhere. Don’t forget the braces or the semicolons.
C(3).
The Program Section
The program
section is made up of straight C code, and is copied unaltered into
the C source file. While YACC requires that you supply some routines
to it, you can put them in another file, and this section can be
completely empty. In simpler cases, however, the program section
allows you to write your entire program in the YACC source file.
Here’s what
YACC requires: a lexical analyzer, called yylex(), and an error
routine, yyerror(). Common sense requires a main() routine as well.
The prototype
for yylex() is:
/*
7 */
int
yylex();
The parser
routine, called yyparse(), will call yylex() whenever it needs input.
yylex() reads the next token, sets the global variable “yylval”
to the value of the token (optional; this is the “$” value used
in the production code), and returns the token type. You might wonder
where it reads the token from, since it hasn’t any arguments. The
answer is, it uses global variables of some sort, such as a global
string or a global file reference, that is set up by the main()
routine.
The error
routine is:
/*
8 */
void
yyerror(char *message);
The message is
generated by the parser when it detects an error; yyerror()’s job
is to notify the user that something has gone wrong. Of course, you
don’t have do live with YACC’s default error messages, which are
rather unilluminating; you can call yyerror() yourself. More on this
in a future article.
D.
An MPW Hex Calculator
Now for some
real code. Our example program is an MPW tool, a hex calculator. It
will evaluate expressions using +, -, * and /, and will also properly
evaluate expressions with parentheses in them. The name of the tool
is “Hex”; you can invoke it with an expression, e.g.:
Hex
‘2 - ( 3 - 4 )’
in which case it
will evaluate, print the result, and exit. Notice that in this case,
the expression must be in quotation marks, or MPW will treat each
token separately. Note also that the tokens must be separated by
spaces. This is to simplify the lexical analyzer; we will relax this
requirement in a future version.
The tool may
also be invoked without an expression to evaluate. It will then go
into a loop, reading in expressions and evaluating them, e.g.:
Hex
?
2 - ( 3 - 4 )
=
3
?
64 * 8
=
320
?
The loop ends on
a blank line.
Here are the
input globals and the code for the main() routine:
/*
9 */
char
*input;
char
*token;
void
main(int argc, char *argv[])
{
char
thestring[256];
if
(argc < 1)
printf(“\tImpossible
error!\n”);
else
if (argc > 2)
printf(“\tHey!
One at a time!\n”);
else
if (argc == 2)
{
input
= argv[1];
yyparse();
}
else
{
printf(“?
“);
while
(strlen(gets(thestring)) > 2)
{
input
= &thestring[2];
yyparse();
printf(“?
“);
}
}
}
“input” is
the input buffer, and holds the expression to evaluate. “token”
is filled by yylex() with the current token. It’s useful for
debugging and error reporting. Finally, in the read loop, notice that
the gets() routine will read the “? ” prompt as well as the
expression, this being MPW, which is why “input” points to
thestring[2].
D(1).
The Lexical Analyzer and Error Routines
This is about as
simple as a lexical analyzer can be. The strtok() routine will return
the tokens in the input string, so long as they’re separated by
spaces, or newline at the end of input. Then if sscanf() can read a
hex number, that’s what the token must be; if it isn’t a number,
it must be an operator or a parenthesis, so return the first
character.
This routine is
so simple it is vulnerable to pathological input -- please don’t
TRY to break it! Where’s the challenge? This is just a stopgap,
good enough to serve until we get a real lexical analyzer.
/*
10 */
int
yylex()
{
if
(input == 0)
token
= strtok(0, “ “);
else
{
token
= strtok(input, “ “);
input
= 0;
}
if
(token == 0)
return(‘\n’);
else
if (sscanf(token, “%x”, &yylval) == 1)
return(NUM);
else
return(token[0]);
}
The error
routine is even simpler. It just prints out the parser’s default
error message, which is the blindingly helpful “syntax error”. We
do add the current token, which may be helpful.
/*
11 */
#define
yyerror(x)
{
printf(“\t%s
[%s]\n”, x, token);
return(0);
}
(This is divided
into separate lines to fit in Mac Toot’s narrow columns; that’s
not the way we’d write it in C, of course!) Note the return
statement. yyerror is called from within the parser routine
yyparse(), so that’s what we’re returning from. The effect is to
abort the translation.
D(2).
The Grammar
The grammar we
will use is the same as that developed earlier. There’s one
additional production: we have to give YACC a START SYMBOL, which is
the left-hand side of the first production. In this case, “prob”
is short for “problem”.
prob
-> expr ‘\n’
The newline is a
single-character token returned by yylex() to signal the end of
input. So if the entire input string is an expression, we’ve got a
complete problem. Any other occurrence of a newline is an error.
/*
12 */
prob
-> expr ‘\n’
expr
-> expr + term
expr
-> expr - term
expr
-> term
term
-> term * NUM
term
-> term / NUM
term
-> ( expr )
term
-> NUM
D(3).
The Value Calculation and Output
Here’s the
actual YACC input file, declaration and grammar sections. The
declaration section consists of a single %token declaration, making
NUM a terminal symbol. The grammar section includes all the
productions listed above, each with some associated code.
/*
13 */
%token
NUM
%%
prob
: expr
{
printf(“\t=
%X\n”, $1);
return(0);
}
;
expr
: expr ‘+’ term
{
$$
= $1 + $3;
}
|
expr ‘-’ term
{
$$
= $1 - $3;
}
|
term
{
$$
= $1;
}
;
term
: term ‘*’ NUM
{
$$
= $1 * $3;
}
|
term ‘/’ NUM
{
$$
= $1 / $3;
}
|
‘(‘ expr ‘)’
{
$$
= $2;
}
|
NUM
{
$$
= $1;
}
;
D(4).
The Make File
YACC goes right
into your make file, just like any other MPW tool. Hmm. I thought I
said that YACC was a UNIX tool... The truth is that there is an MPW
version of YACC available, called MACYACC (I have renamed the tool on
my system to make typing easier).
#14
Hex.c
ƒ Hex.make Hex.y
Yacc
-VHex.out Hex.y
Hex.c.o
ƒ Hex.make Hex.c
C
-r Hex.c
Hex
ƒƒ Hex.make Hex.c.o
Link
-w -t MPST -c ‘MPS ‘ �
Hex.c.o
�
“{Libraries}”Interface.o
�
“{CLibraries}”CRuntime.o
�
“{CLibraries}”StdCLib.o
�
“{CLibraries}”CSANELib.o
�
“{CLibraries}”Math.o
�
“{CLibraries}”CInterface.o
�
“{Libraries}”ToolLibs.o
�
-o
Hex
No comments:
Post a Comment