Lexer

[<<<] [>>>]

The module lexer is implemented in the C source file `lexer.c' This is a module that converts the read characters to a list of tokens. The lexer recognizes the basic lexical elements, like numbers, strings or keywords. It starts to read the characters provided by the reader and group it p into lexical elements. For example whenever the lexical analyzer sees a " character it starts to process a string until it finds the closing ". When it does the module creates a new token, links it to the end of the list and goes on.

To do this the lexical analyzer has to know what is a keyword, string or number.

Because general purpose, table driven lexical analyzers are usually rather slow ScriptBasic uses a proprietary lexical analyzer that is partially table driven, but not so general purpose as one created using the program LEX.

There are some rules that are coded into the C code of the lexical analyzer, while other are defined in tables. Even the rules coded into the C program are usually parameterized in the module object.

Lets see the module object definition from the file `lexer.c' (Note that the C .h header files are extracted from the .c files thus there is no need to double maintain function prototypes.)

Note however that this is actually a copy of the actual definition from the file `lexer.c' and it may have been changed since I wrote this manual. So the lexer object by the time I wrote this manual was:

typedef struct _LexObject {
  int (*pfGetCharacter)(void *);
  char * (*pfFileName)(void *);
  long (*pfLineNumber)(void *);
  void *pvInput;
  void *(*memory_allocating_function)(size_t, void *);
  void (*memory_releasing_function)(void *, void *);
  void *pMemorySegment;

char *SSC; char *SCC;

char *SFC; char *SStC; char *SKIP;

char *ESCS; long fFlag;

pReportFunction report; void *reportptr; int iErrorCounter; unsigned long fErrorFlags;

char *buffer; long cbBuffer;

pLexNASymbol pNASymbols; int cbNASymbolLength;

pLexNASymbol pASymbols;

pLexNASymbol pCSymbols; pLexeme pLexResult; pLexeme pLexCurrentLexeme; struct _PreprocObject *pPREP; }LexObject, *pLexObject;

This struct contains the global variables of the lexer module. In the first "section" of the structure you can see the variables that may already sound familiar from the module reader. These parameterize the memory allocation and the input source for the module. The input functions are usually set so that the characters come from the module reader, but there is no principal objection to use other character source for the purpose.

The variable pvInput is not altered by the module. It is only passed to the input functions. The function pointer name pfGetCharacter speaks for itself. It is like getc returns the next character. However when this function pointer is set to point to the function reader_NextCharacter the input is already preprocessed a bit. Namely the include and import directives were processed.

This imposes some interesting feature that you may recognize now if you read the reader module and this module definition carefully. include and import works inside multi-line strings. (OK I did not talk about multi-line strings so far so do not feel ashamed if you did not realize this.)

The function pointers pfFileName and pfLineNumber should point to functions that return the file name and the line number of the last read character. This is something that a getc will not provide, but the reader functions do. This will allow the lexical analyzer to store the file name and the line number for each token.

The next group of variables seems to be frightening and unreadable at first, but here is this book to explain them. These variables define what is a string, a symbol, what has to be treated as unimportant space and so on. Usually symbols start with alpha character and are continued with alphanumeric characters in most programming languages. But what is an alpha character? Is _ one or is $ a valid alphanumeric character. Well, for the lexer module if any of these characters appear in the variable SSC then the answer is yes. The name stands for Symbol Start Characters. But lets go through all these variables one by one.

The default values for these variables are set in the function lex_InitStructure. Interestingly these default values are perfectly ok for ScriptBasic.

The field pNASymbols points to an array that contains the non-alpha symbols list. Each element of this array contains a string that is the textual representation of the symbol and a code, which is the token code of the symbol. For example the table NASYMBOLS in file `syntax.c' is:

LexNASymbol NASYMBOLS[] = { { "@\\" , CMD_EXTOPQN } , { "@`" , CMD_EXTOPQO } , { "@'" , CMD_EXTOPQP } , { "@" , CMD_EXTOPQQ } ,

...

{ "@" , CMD_EXTOPQ } , { "^" , CMD_POWER } , { "*" , CMD_MULT } , { NULL, 0 } };

When the lexical analyzer finds something that is not a string, number or alphanumeric symbol it tries to read forward and recognize any of the non-alpha tokens listed in this table. It is extremely important that the symbols are ordered in this table so that the longer symbols come first thus a symbol abc is not presented before abcd. Otherwise abcd will never be found!

The variable cbNASymbolLength is nothing to care about. This is used internally and is calculated automatically by the lexical analyzer.

The variable pASymbols is similar to the variable pNASymbols pointing to a same kind of table. This variable however should point to an array that contains the alphanumeric symbols. You can find the array ASYMBOLS in file `syntax.c' that is pointed by this variable for ScriptBasic.

The order of the words in this array is not important except that more frequent words being listed earlier result faster compilation.

The field pCSymbols points to an array that is used only for debugging purposes. I mean debugging ScriptBasic code itself and not debugging BASIC programs.

The rest of the variables are used by the functions that iterate through the list of tokens when the syntax analyzer reads the token list or to report errors during lexical analysis. Error reporting is detailed in a separate section.

The tables that list the lexical elements are not maintained "by hand". The source for ScriptBasic syntax is maintained in the file `syntax.def' and the program `syntaxer.pl' creates the C syntax file `syntax.c' from the syntax definition.

The program `syntaxer.pl' is so complex that after two years I wrote it I had hard time to understand it and I rather treat it as a holly code: blessed and untouchable. (Ok: see: that code is quite compound, but if there was any bug found in that I could understand what I did in a few hours. Anyway, the brain created that code once belonged to me.)


[<<<] [>>>]