I’ve been working with PHP for years, and I studied C++ mainly for educational purposes. So I decided to combine both worlds and start writing my own interpreter, documenting the journey from day.
For example, today I learned how bytecode is generated. Because of that, I finally understand tools like OPcache. If there are any inaccuracies I'm happy if you point them out to me. Happy reading...
Architecture
The Jim PHP architecture is divided into three levels.
Each level behaves like an object, and these three objects communicate
with each other.LEXER - Splits the source code into tokens PARSER - Builds the AST (Abstract Syntax Tree) from the tokens INTERPRETER - Analyzes the AST and executes its nodes
Other Details
This project is inspired by Jim Tcl by Salvatore Sanfilippo.
Jim PHP follows a different approach in its architecture.
The Lexer (Tokenizer) follows a similar philosophy, but the Parser and Interpreter
are based on different ideas.Note: Jim PHP uses an AST-based interpreter, not a run-time oriented like Jim Tcl.
Daily Goal
[DONE] DAY ZERO
Set up Git and GitHub. Studied the general architecture.
Wrote the README file and the CMakeLists.txt.
Understood the basic structure and goals.[DONE] DAY ONE Started studying how PHP code could be executed. Jim PHP can run code in three ways, similar to Jim Tcl: - Inline string: std::string php_code = "1+1;"; (for testing only) - Command line: jimphp -r 'echo 1+1;' - File execution: jimphp sum.php
Worked on inline string execution and Lexer implementation with a token structure. We need tokens because the Parser will operate on individual tokens. Lexer.cpp can now tokenize expressions — "1+1" becomes "1", "+", "1".
[DONE] DAY TWO Started fixing issues in Lexer.cpp. Issue #1: If you hardcode a PHP expression like: std::string php_code = "(10.2+0.5(2-0.4))2+(2.14)"; the Lexer would return "Unknown character" because it didn’t yet recognize symbols like ), {, and so on.
Yesterday (day one), Jim PHP was tested only with simple expressions like "1+1". Obviously, that’s not acceptable. We needed a better Lexer that can tokenize code more accurately and recognize symbols properly.
Jim PHP now implements these category for token structures: -Char Tokens: a-z A-Z and _ -Num Tokens: 0-9 -Punct (Punctuation) Tokens: . , : ; -Oper (Operator) Tokens: + - * / = % ^ -Parent (Parenthesis) Tokens: ()[]{} -SChar (Special char) Tokens: ! @ # $ & ? < > \ | ' " and == != >= <= && ||
In this way we can write more complex PHP expressions like: std::string php_code = "$hello = 5.5 + 10 * (3 - 1); // test! @#|_\";
Result: SCHAR: $ | CHAR: hello_user | OPER: = | NUM: 5 | PUNCT: . | NUM: 5... At this point the Lexer can reconize complex PHP expression.
[DONE] DAY THREE Today we need to understand how these tokens will be handled in order to build the AST (Abstract Syntax Tree). We are now inside the parser stage. After the lexer, the next step is to build the AST from the generated tokens.
My first question was: what is an Abstract Syntax Tree? In very simple terms, conceptually, it is like a contract between the human writing the code and the way that code must be organized and cleaned before being translated into machine language.
Let's take this expression: 3 + 5 * 2 The tree must first clean the expression by removing unnecessary spaces, and then represent the operation like this:
First, perform the addition: + / \ You need 3 and the the multiplication: 3 * / \ The multiplication is between: 5 2
Limit text here... Continue reading the readme on github if you like.