Create your own scripting language in C++ with Qt (Part 1)

28 12 2011

I’m being writing my own scripting language in Qt for the past month or so. My original goal was to have immediate feedback when trying various drawing methods using Qt drawing class QPainter. First I thought I could use QtScript and make it a 2 week programming task, but I decided to spice it a bit and write the script language from scratch.

In this series of blog posts, I’m doing to teach you how I did it. I assume you know a bit about regular expressions, C++ and Qt. I will try to simplify the theory behind the best I can.

Bear in mind that my method is not the most efficient way to implement a scripting language. My goal with this project was to learn and have fun, not to worry about performance too much.

Basics of compilation

Overview of the process

In its simplest form, the compilation process is the act of translating an input into another format. Your C++ compiler takes a text file with instructions in it and translate it into machine code.

A compiler needs several building blocks to understand a language and create another output. There are the lexical analysis (Lexer), the grammar analysis (Parser),  an abstract syntax tree (AST) and an Interpreter or a CodeGenerator in a compiled language. I left out the semantic analysis because in our case, it will be done in the interpreter.

What’s lexical analysis and grammar analysis ? We going to find out in the following sections.

Lexical analysis

First let’s take a English sentence and identify the various lexical class.

My father gave me a nice gift.

If you were paying any attention in our English or French class. You will identify these as

Fragment Lexical class
My article
father noun
gave verb
me pronoun
a article
nice adjective
gift noun

Identifying the various components of a sentence helps us in the grammar analysis by giving us a simplified representation of the sentence. If we put the sentence like this:

article noun verb pronoun article adjective noun

It is now easier to tell that our sentence is grammatically correct. If we shuffle the words around a bit like this:

father my me gave nice a gift ->
noun article pronoun verb adjective article noun

We’re still able to identify the various components but the sentence make no sense from a grammar point of view. A noun is not supposed to be followed by a article for instance.

In a programming language, we can apply the same process

void PrintDebugInformation(const char *format, int arg1)
{
    printf(format,arg1);
}

Here’s the various components

Fragment Lexical class
void KEYWORD_VOID
PrintDebugInformation IDENTIFIER
( LEFT_PARENTHESIS
const KEYWORD_CONST
char KEYWORD_CHAR
* POINTER_SIGN
format IDENTIFIER
, COMMA
int KEYWORD_INT
arg1 IDENTIFIER
) RIGHT_PARENTHESIS
{ LEFT_BLOCK_SIGN
printf IDENTIFIER
( LEFT_PARENTHESIS
format IDENTIFIER
, COMMA
arg1 IDENTIFIER
) RIGHT_PARENTHESIS
; STATEMENT_END
} RIGHT_BLOCK_SIGN

Thus the job of a Lexer is to identify lexical classes, called tokens, from the input text. Like the English grammar analysis, it will simplify the grammar analysis. Note that I named the lexical class all in caps. It helps to separate them from a grammar rule.

Grammar analysis

A grammar is a set of rules to describe a syntax, A syntax is how the lexical classes are structured together to create a cohesive and logical form.  Let’s take a C function call as an example.

printf(format,arg1);

The lexical analysis give us this:

IDENTIFIER LEFT_PARENTHESIS IDENTIFIER COMMA IDENTIFIER RIGHT_PARENTHESIS STATEMENT_END

We could formulate a rule to identify a function call like:

functionCall ::= IDENTIFIER LEFT_PARENTHESIS IDENTIFIER COMMA IDENTIFIER RIGHT_PARENTHESIS STATEMENT_END

But it assume we pass only 2 arguments to our functionCall. What if we want to support 0, 1 or more arguments ? What if we could use the zero-or-more operator(*) from regular expression ? Using the * operator, our functionCall rule looks like this

functionCall ::= IDENTIFIER LEFT_PARENTHESIS IDENTIFIER (COMMA IDENTIFIER)* RIGHT_PARENTHESIS STATEMENT_END

Okay, now we can express various function arguments, but what if our functionCall does not have arguments ? We can use another regular expression operator, the zero-or-one operator(?) on the first IDENTIFIER after the parenthesis to parse the functionCall correctly.

functionCall ::= IDENTIFIER LEFT_PARENTHESIS IDENTIFIER? (COMMA IDENTIFIER)* RIGHT_PARENTHESIS STATEMENT_END

We should be fine with our functionCall rule right ?  What if we want to pass a number to a function ? We could use the alternative operator(|) right ?

functionCall ::= IDENTIFIER LEFT_PARENTHESIS (IDENTIFIER|NUMBER)? (COMMA (IDENTIFIER|NUMBER))* RIGHT_PARENTHESIS STATEMENT_END

While it could work if we only want to support variables and numbers as function arguments, it is not flexible enough if we want to support different types of arguments like a math expression, a function call, etc. We will use another grammar rule to express the term of a functionCall

term ::= IDENTIFIER | NUMBER
functionCall ::= IDENTIFIER LEFT_PARENTHESIS term? (COMMA term)* RIGHT_PARENTHESIS STATEMENT_END

During the parsing process, the term rule in the functionCall rule will be expanded into the right lexical class. Here’s an example of a concrete parse tree of the function call defined above. As you can see, our lexical classes are all leaves in the tree. The goal of a rule is to create leaf nodes in a parse tree (conceptually of course).

The Parser job is to take a text input and validate the syntax of a language. It uses the grammar rules to define the order of parsing. But it can do more than just validate the syntax of a language. For each rule, we can add code to be executed each time the parser encounter that rule. It can be used to create the nodes in the Abtract Syntax Tree for instance.

In a next part, we’ll go deeper into defining grammar rules and how it translate to C++ code.

Abstract Syntax Tree

An abstract syntax tree is a tree thats represent operations in your language. It is called abstract because it doesn’t store the whole source code, it only store the data needed for each operation. Why we don’t store the whole information ? Because most information is redundant or only used to identify a language construct. Let’s take a variable declaration statement.

int myVariable = 3*5*x;

It generates this Abtract Syntax Tree:

As you can see, the int type, equal sign and the semicolon are not expressed in the tree. We just don’t need them. I also put the data stored in each node to show that an abstract syntax is not a dumb representation of the source code syntax, it is used for interpretation or code generation.

Interpreter

An Interpreter takes an AST, traverse it and execute an operation on each node. In a nutshell, it’s the Interpreter pattern in the GoF Design Patterns book. We could also use the Visitor pattern to visit each node and make the AST independent of the code execution mechanism (ex: interpreted, JIT, compiled)

Conclusion

In this part, I introduced you to various concepts related to compilation: lexing, parsing, abstract tree and interpretation. In the next part, we will start coding the scripting language by implementing mathematic expression in our scripting language. Don’t worry, I won’t fall in the trap of others compilations tutorial, I will go further than math expressions 🙂

Advertisements