Compiling SPAD

This is one of a set of pages about the Axiom CAS and its forks, particularly FriCAS, these pages start here. This particular page discusses the SPAD compiler. For an attempt at specifying an approximate syntax for SPAD see this page.

There are many possible improvements to SPAD such as:

Although there is a lot of good work going on in the various axiom forks (and a lot of divergence from each other) the improvements seem very gradual. I have therefore been thinking about what the issues would be in building an alternative SPAD compiler.

A common way for compilers to be generated is for the language to be defined in a recursive way by using a BNF-like language, the compiler can then be generated automatically by using programs such as ANTLR or YACC. The compilation is divided into stages like:

Unfortunately SPAD can't be defined in this way because of the special syntax of SPAD and in particular the requirement for pattern matching of expressions. In order to know which function to call the compiler needs to know, not only the name of the function, but also the types of the parameters and the return type of the function. Where functions are nested this requires a recursive matching.

The SPAD therefore uses a style of parser known as a Pratt parser (see box on right) which is defined by a table of operator precedence.

Pratt parser

Based on a theory of parsing presented in:
Pratt, Vaughan R., "Top Down Operator Precedence", ACM Symposium on Principles of Programming Languages
Boston, MA; October, 1973.

The SPAD compiler and the interpreter are both driven from database files: (interp.daase, operation.daase, category.daase, compress.daase, browse.daase) and that this involved a lot of custom code (different for the compiler and interpreter) written in Lisp (and possibly 'boot' code see below).

'Boot' code is part way between Lisp and SPAD. Presumably the compiler is not written in SPAD to avoid the bootstrap issue. Axiom no longer uses boot code. Boot has special support for pattern matching so it is still used by FriCAS.

Could we combine the best features of both compiler types?

It seems to me that this pattern matching is an issue concerned with the way that the language is presented in text. That is an operator, like: * , is defined differently for multiplication of say floating point numbers and multiplication on say matrices but we use the same symbol. We need to do this because:

So we need to convert from a single symbol for human readers to many differen function calls required by the computer. Could this stage be separated out as a front end allowing the rest to be done using standard Parse generator?

So I was thinking of a first stage Pratt compiler which would take a line like:

myVector * myMatrix

and convert it to an intermediate form like:

_*(myVector:Vector DF,myMatrix:Matrix DF)$Matrix DF

This intermediate form would be very hard for humans to read but it could be more easily read by a standard parser generator.

Current Parser

The current parser has the following stages:

Stage Format
Spad syntax  
prefix internal
language
lisp s-expressions using:
"(DEF ...", "(IN ...", "(LEAVE ..." and other forms
that tend to mirror Spad language constructs.
These are the same as AST forms in a compiler. Lisp can easily manipulate these forms and a lot of optimizations are applied before they get compiled.
   
   
When using GCL, the
common lisp gets translated to C and the C is compiled using GCC.
 
   
   
   
   
   

Tim:

The issue of better error messages is a question of keeping context around. Good error messages require that you keep a trail of where you are in the original code and how each sub-expression in the original code relates to the compiled form. This is a matter of careful book-keeping. It also
requires that you related the failure to the intention of the compiler, a more subtle issue. Even more subtle is the nuance of the language details. Good error messages are hard.

The Spad compiler was written as a research effort to explore ideas like
encapsulation and dispatching not only on the types of the arguments but
also on the type of the returned value (something very few languages STILL
do not allow). The people who wrote it were also the only users of it.
So the error messages were perfectly obvious to us.


Waldek:

Spad compiler is divided into stages. Lexical analysis and parsing are not a serious problem. Standard tools could help a bit but actually part that tools can handle is very simple while some other aspects (like whitespace sensitivity) must be done in ad-hoc way because tools do not support them.

When one comes to semantic analysis things get more tricky. First, when trying to write simple specification of what compiler should do one quickly sees that desired behaviour is not computable. So one has to introduce restrictions to turn problem into computable one (but retain as much as possible from nice features). Second difficulty is that currently at some places the compiler is doing rather strange things, but if one tries to disable such behaviour the algebra no longer compiles.

Let me add that I do not expect really good error messages from Spad parser or lexer: basically Spad syntax contains so little redundancy that parser has no chance to guess what malformed code is intended to say. But messages could be better than now: currently indentaion and parentheses get mixed in lexer and parser so indentation errors give quite strange error messages and similarely unbalanced parentheses. I have experimental code which uses Spad parser with lexer from the interpreter. When ready this code should give much clearer messages about indentation errors and unbalanced parenthesis. But ATM it interprets indentation differently that old Spad, so I need to fix this.

Concerning tools: of existing tools in my experience parser and lexer generators are useful, other I have found of little use. I particular I have met generators of data structures and tree-walker generators. I would say that I feel that language should have apropriate support for data structures and tools doe not change much here. Similarely for tree walkers: we use tree walkers to perform some operation and typically tree walker code is small compared to operations. Actually, IME pattern ML-style matching helps more for creating tree walkers than external tools.

As I wrote, for FriCAS lexer generator would help. But we have working lexer, it has few hunderd lines so not using lexer generator is not a big deal. I have doubts if parser generator would help: Spad syntax is very well suited to existing parser and not so well typical parser generators. More precisely, bulk of syntax is in operator priorities. Because we use _four_ priorites per operator Spad priorities are awkward to describe in a parser generator. OTOH part of FriCAS parser handling priorites is quite small and the other parts of Spad syntax are rather simple, so it is easy to handle them in hand written parser. For some comparison, current parser has 543 lines, _including_ priority tables. Previous version using a home-grown parser generator (META) has slightly more than 200 lines. Grammar rules of PL/1 parser have more than 600 lines. Grammar plus actions for GNU Pascal have more than 2000 lines. So, potential gain form using parser generator is quite limited.

As I wrote, for tree walker I find ML-style pattern matching quite useful. Now, Boot has special support for pattern matching -- not such good as ML, but better than many "standard" languages.

Now, if you want to use ML (or Haskell, ...) and associated tools and develop alternative compiler, than go on.

However: - Both compiler and Spad runtime have to deal with types. It is preferable to share type machinery

- We have "interpter", I feel that "interpter" and Spad compiler could share a lot (currently there is only limited sharing)

So code of Spad compiler should communicate with algebra code, which is easy if Spad compiler is in Spad, Boot or Lisp but gets tricky otherwise. Let me say that I considered using lexer generator and communicate with lexer via FFI -- my conclusion was that gain from lexer generator was not worth the trouble with FFI.

Concering difficulty of compiling Spad: the main work is in "type checking". Parametric types, overloading and conditionals involving types make it more difficult than in other languages -- doing it really well is a research topic. After type checking it is possible to releatively easy generate code. Spad compiler could do more optimizations after type checking and probably code generation could be more efficient. But currently most of compile type is spent in type checking, type checking generates error messages so clearly improvement in type checking will be visible to users. Also, currently type checking resolves overloading and after that essentially forgets types (it keeps only type of result). Which means that to do more after type checking one probably needs to re-work typechecking to keep types (and other information) longer.


From newaux.lisp.pamphlet

Some have both (e.g. - ). This terminology is from the Pratt parser.

The translator for SPAD is a modification of the Pratt parser which branches to special handlers when it is most convenient and practical to do so (Pratt's scheme cannot handle local contexts very easily).

Both LEDs and NUDs have right and left binding powers. This is meaningful for prefix and infix operators. These powers are stored as the values of the LED and NUD properties of an atom, if the atom has such a property.


The Special-Handler is the name of a function to be evaluated when that keyword is encountered.

The default values of Left and Right Binding-Power are NIL. NIL is a legitimate value signifying no precedence. If the Special-Handler is NIL, this is just an ordinary operator (as opposed to a surfix operator like if-then-else).

Led table

Used for infix/suffix operators. That is an operator with a Led property takes an operand on its left.

Operator Left-Binding-Power Right-Binding-Power Special handling
* 800 801  
|rem| 800 801  
|mod| 800 801  
|quo| 800 801  
|div| 800 801  
/ 800 801  
** 900 901  
^ 900 901  
|exquo| 800 801  
+ 700 701  
\- 700 701  
\-\> 1001 1002  
\<\- 1001 1002  
\: 996 997  
\:\: 996 997  
\@ 996 997  
|pretend| 995 996  
\.
\! 1002 1001  
\, 110 111  
\; 81 82 PARSE-SemiColon
\< 400 400  
\> 400 400  
\<\< 400 400  
\>\> 400 400  
\<= 400 400  
\>= 400 400  
= 400 400  
^= 400 400  
\~= 400 400  
|in| 400 400  
|case| 400 400  
|add| 400 120  
|with| 2000 400 PARSE-InfixWith
|has| 400 400  
|where| 121 104  
|when| 112 190  
|otherwise| 119 190 PARSE-Suffix
|is| 400 400  
|isnt| 400 400  
|and| 250 251  
|or| 200 201  
/\\ 250 251  
\\/ 200 201  
\.\. SEGMENT 401 699 PARSE-Seg
=\> 123 103  
+-\> 995 112  
== DEF 122 121  
==\> MDEF 122 121  
\| 108 111  
\:- LETD 125 124  
\:= LET 125 124  

Nud table

Used for prefix/nilfix operators. That is an operator with a Nud takes no operand on its left.

Operator Left-Binding-Power Right-Binding-Power Special Handling
|for| 130 350 PARSE-Loop
|while| 130 190 PARSE-Loop
|until| 130 190 PARSE-Loop
|repeat| 130 190 PARSE-Loop
|import| 120 0 PARSE-Import
|unless|
|add| 900 120  
|with| 1000 300 PARSE-With
|has| 400 400  
\- 701 700  
\+ 701 700  
\# 999 998  
\! 1002 1001  
\' 999 999 PARSE-Data
\<\< 122 120 PARSE-LabelExpr
\>\>
^ 260 259 NIL
\-\> 1001 1002  
\: 194 195  
|not| 260 259 NIL
\~ 260 259 nil
\= 400 700  
|return| 202 201 PARSE-Return
|leave| 202 201 PARSE-Leave
|exit| 202 201 PARSE-Exit
|from|
|iterate|
|yield|
|if| 130 0 PARSE-Conditional
\| 0 190  
|suchthat|
|then| 0 114  
|else| 0 114  

Gliph Table

Gliphs are symbol clumps. The gliph property of a symbol gives the tree describing the tokens which begin with that symbol.

The token reader uses the gliph property to determine the longest token.
Thus [[:=]] is read as one token not as [[:]] followed by [[=]].

<<GLIPHTable>>=
(mapcar #'(lambda (x) (makeprop (car x) 'gliph (cdr x)))
`(
( \| (\)) )
( * (*) )
( \( (<) (\|) )
( + (- (>)) )
( - (>) )
( < (=) (<) )
;; ( / (\\) ) breaks */xxx
( \\ (/) )
( > (=) (>) (\)))
( = (= (>)) (>) )
( \. (\.) )
( ^ (=) )
( \~ (=) )
( \: (=) (-) (\:))))

Rename Token Table

RENAMETOK defines alternate token strings which can be used for different
keyboards which define equivalent tokens.
<<RENAMETOKTable>>=
(mapcar
#'(lambda (x) (MAKEPROP (CAR X) 'RENAMETOK (CADR X)) (MAKENEWOP X NIL))
'((\(\| \[) ; (| |) means []
(\|\) \])
(\(< \{) ; (< >) means {}
(>\) \})))

Generic function table

GENERIC operators be suffixed by [[$]] qualifications in SPAD code.
[[$]] is then followed by a domain label, such as I for Integer, which
signifies which domain the operator refers to. For example [[+$Integer]]
is [[+]] for Integers.
<<GENERICTable>>=
(mapcar #'(lambda (x) (MAKEPROP X 'GENERIC 'TRUE))
'(- = * |rem| |mod| |quo| |div| / ** |exquo| + - < > <= >= ^= ))

Character Syntax Table

<<CharacterSyntaxTable>>=
(defun SPECIALCASESYNTAX () (OR (AND (char= TOK '#\#) (DIGITP CHR))))

(defun TERMINATOR (CHR)
(member CHR '(#\ #\( #\) #\. #\; #\, #\Return)) :test #'char=)



metadata block
see also:
Correspondence about this page

Book Shop - Further reading.

Where I can, I have put links to Amazon for books that are relevant to the subject, click on the appropriate country flag to get more details of the book or to buy it from them.

 

This site may have errors. Don't use for critical systems.

Copyright (c) 1998-2017 Martin John Baker - All rights reserved - privacy policy.