This is part of some experimental code that I am writing to implement the FriCAS interpreter using SPAD code. For an overview of this experiment see page here. For information about how this is done using the current boot/lisp code see the page here.
Here we describe a scanner or tokeniser for our interpreter. This takes the input string holding the input line and converts it to a list of tokens.
How It Works
Each token, generated by this code, consists of a token type and a string with its acual value. For instance, if the token type is 'key' then the sting will hold the particular keyword such as: "macro".
Token Type | Meaning |
---|---|
id | identifier such as the name of a variable |
key | keyword |
integer | A numeric integer literal. If it is negative this will not be held in this token but there will be a '-' token preceeding it. |
rinteger | |
float | This holds numeric values but it may also have '.' 'e' 'E' and '-' values. It is difficult to scan this as a single terminal value |
string | any characters wrapped in double quotes. |
comment | |
negcomment | |
error | |
spaces |
This tokeniser is driven by a state table, as we scan across the input line this determines the next state depending on the character being scanned.
Character Just Read | |||||||
---|---|---|---|---|---|---|---|
Current State | space | double quote | alphabetic | numeric | other | ||
init | space | string | sym | integ | op | ||
space | space | string | sym | integ | op | ||
string | string | init | string | string | string | ||
sym | space | string | sym | sym (symbol names can contain numeric values) | op | ||
integ | space | string | sym or float if 'e' or 'E' | integ | op or float if '.' |
||
float | space | string | sym or float if 'e' or 'E' | float | op | ||
op | space | string | sym | integ | op | ||
comment | comment | comment | comment | comment | comment |
'comment' state is triggered if 'op' contains '--' or '++'.
Each time the state changes a new token is added to the list being generated.
In the case of errors a error token will be put in the token list. There is a function to scan the list for error tags. If this is true then the following stages of parsing need not be carried out and the error string can be displayed.
Testing It
We can try out the tokeniser in isolation by calling 'spadTokenise' from the existing interpreter. For information about downloading and compiling the code see this page.
(1) -> spadTokenise("1+2") (1) [integer="1",key="PLUS",integer="2"] Type: Tokeniser (2) -> spadTokenise("1.0 + a3") (2) [float="1.0",spaces=" ",key="PLUS",spaces=" ",id="a3"] Type: Tokeniser (3) -> spadTokenise("b2= -3") (3) [id="b2",key="EQUAL",spaces=" ",key="MINUS",integer="3"] Type: Tokeniser |
To Do
There are still some things to be fixed
(4) -> spadTokenise("b2=-3") (4) [id="b2",error="=-",integer="3"] Type: Tokeniser (5) -> spadTokenise("2e-6") (5) [float="2e",key="MINUS",integer="6"] Type: Tokeniser |
- Need to be able to split non-alphanumeric symbols for example a=-3 only works if we put space between = and -
- Floats not yet handled correctly for example 3e-5 would not work correctly
Next step
The output of this tokeniser is passed on to the parser as described on the page here.