The world's Best* operator precedence-aware parsing algorithm
We here at Gimbling Science take cursedness very seriously. I’m Gim! I own the place. This is my first article on this (unchanged until now) website as part of The Writing Gaggle in the Rust Community Discord Server. I’m generally pretty excited.
Cat: And so am I!
That eager voice you heard is the lovely Cat, my assistant. Rest assured, she won’t interrupt you too much. Isn’t that right, Cat?
Cat: I’ll try (meow)
She’s the backbone of this corporation. Cute as a cat (duh), too. Sorry, fellas. She’s married. To cursed!
# Right, where were we?
Let’s hold our meows together, folks, and get our weapons together. My weapon of choice is F#.
For the sake of simplicity, this article will only focus on basic math. Emphasis on basic.
Cat: Did you just~
Shhh.
# What’s the fuzz, Gim?
Now we all know the good ol’ Pratt Parsing. Some people even prefer recursive descent. But nothing is as cursed as The FORTRAN way.
Let’s dive into Wikipedia…
Another approach is to first fully parenthesize the expression, inserting a number of parentheses around each operator, such that they lead to the correct precedence even when parsed with a linear, left-to-right parser. This algorithm was used in the early FORTRAN I compiler: [7]
The Fortran I compiler would expand each operator with a sequence of parentheses. In a simplified form of the algorithm, it would
- replace
+
and–
with))+((
and))-((
, respectively;- replace
*
and/
with)*(
and)/(
, respectively;- add
((
at the beginning of each expression and after each left parenthesis in the original expression; and- add
))
at the end of the expression and before each right parenthesis in the original expression.Although not obvious, the algorithm was correct, and, in the words of Knuth, “The resulting formula is properly parenthesized, believe it or not.” [8]
Look, I said cursed for a reason. And in the words of Knuth, it works. So don’t touch it.
Cat: I don’t think Knuth said that, Gim…
# The Lexer
That means all we have is parenthesis for scoping, integers for integers, and operators for operating.
|
|
Let’s go step-by-step, understanding what’s going on here. Now I’m a Span/ReadOnlySpan shill. Functional programming doesn’t have to be slow! I blame the linked lists! The govermen~
Cat: Sir, the lexer.
Right. A brief overview of what’s going on here is this. Our main lexer is in a private function lex_
. I’ll explain later why this needs to be a thing.
We take in a parameter, containing the original source code. Strings in .NET land are arrays of chars. They even have an implicit conversion operator to ReadOnlySpan<char>
. The second being the index we’re currently at. And the third being a collector that we append our tokens into.
We get our current character by indexing the span with the current index, and we iterate by recursing with the index bumped up. This is fast! As fast as traditional looping. Although, this is talks for later.
If the current index is same as the length of the original source, we can conclude that we have reached the end of the source and we can stop lexing further. Now it’s pretty simple, our grammar isn’t really that complex. In a real compiler, I would use a lexer generator like Logos instead however.
We match over the character, ignoring the whitespace entirely and just bumping the index. Then matching over the different operators and…
Cat: WTF?!
Cat, did you not read the script? Just scroll up a bit, you lazy bastard.
At the lexer stage, we transform the operators into what the algorithm defines.
- replace
+
and–
with))+((
and))-((
, respectively;- replace
*
and/
with)*(
and)/(
, respectively;
Same with the parenthesis.
- add
((
at the beginning of each expression and after each left parenthesis in the original expression; and- add
))
at the end of the expression and before each right parenthesis in the original expression.
To lex integers, we increment the current index till source[index]
is a digit. Then we fetch a sub-span from the original span, convert it into a string, and parse it as a double
.
Cat: Sir, you remember that time you threatened people to blow them up with combustible floats?
Cat, this is desparate times.
That’s our entire lexer! That’s all we have. Fear not, war hero, we will implement this same algorithm in the future in a real language with complex expressions and not just integers. We don’t even have unops yet.
# The Parser
Our parser needs to parse into a some structure. Our grammar isn’t particularly complex enough to have a proper AST. However, we can represent an expression using a structure like this.
|
|
We even have a cute little ToString method so we can sanely print our expressions for debugging (Who knows?)
|
|
If you can understand F#, or ML, or any ML-like language, try having a go for yourself. (Definitely not making my job here easier.)
Cat: whistles He always was a lazy bu~
Pick your next words very wisely, Cat.
Fear not, our parser is separate into two parts.
# The LHS
First, we match over an integer token, if it’s matched then, then we form an integer expression. Then we match over paranthesis, no fancy business here really, we just iterate over the LParen (
and then parse an expression, ending with consuming an RParen )
. Consuming being, that we expect a specific token to be there, and bailing if the token isn’t there.
Note how we didn’t really care about the operators…till now.
# The RHS
Now we look if our LHS expression is followed with an operator, i.e Add/Mul/Sub/Div. Then, we parse another expression and we form the appropriate expression. If the operator was Add, then an addition expression, and so on.
# What the F#$%?
If you’re coming from Pratt Parsing, this will be Really odd for you. This is very similar, yet so far. We don’t care about the precedence magic in our parser At All.
But we might be forgetting something here…
- add
((
at the beginning of each expression and after each left parenthesis in the original expression; and- add
))
at the end of the expression and before each right parenthesis in the original expression.
Ahaha!~ Let’s get at it. Careful readers might’ve noticed that similar to our lexer pattern before, our parser function was named parse_
.
So, let’s a make a wrapper that implements these additions.
|
|
There we go.
# The Interpreter
Cat: Sir, It’s 22:42
Sleep is for the weak, Cat.
|
|
Look, do I need to explain this? Something something, is left to the reader as an exercise.
# The Final Result
So, Olympian/War Hero/Homeless person, did you have fun on this wild journey from the past? (Say yes, you. I still have a dozen of those combustible floats with me.)
Old algorithms are a blast, and this is no exception. It works and it works magically well. And, it’s pretty non-obvious on why. But it clicks after a hot second of “WTF?”, it simulates precedence using parenthesis depth.
Now, I’m going to hit the sack. Zzz…..
Cat: He’s gone, you should too! He sleep-walks with them combustible floats at hand. Be free! You don’t want to be in his cursed cave again!
Cat, I’m still awake.
Cat: Fuck.