Foreword
Recently, while working on a BI product, a client requested a feature to configure summary rows in table displays, similar to the SUM
function in Excel. This feature would automatically calculate and display the sum of all row data in the configured summary columns.
Implementing simple functions like SUM
or other mathematical functions is not complex. The challenge lies in the fact that for certain columns, direct addition doesn’t hold meaning. For instance, metrics like CTR
(Click-Through Rate), which are inherently percentages, require summing the click count and impression count separately before calculating the ratio for the summary.
Some BI tools offer basic calculation formulas for metric manipulation, while tools like Excel or online spreadsheets provide more complex functions.
In the aforementioned scenario, providing users with an input box to enter formulas and mathematical expressions would significantly enhance flexibility. However, from a technical implementation standpoint, handling this with simple regular expression parsing or similar methods becomes impractical. The product manager, based on typical use cases, designed a generic calculation template that functions like a drop-down menu for selecting formulas and operators. However, I personally find this approach somewhat inelegant and consider it an interim solution.
Therefore, I decided to apply some of the compiler principles I learned in university to explore how to tackle the input parsing problem. The goal was to determine if providing a direct input box for users could achieve a more flexible approach.
Compiler Principles Overview
The university textbook we used for compiler principles was the renowned “Dragon Book,” so named because of its purple dragon cover. The actual English title is “Compilers: Principles, Techniques, and Tools”.
The general steps involved in a compiler are as follows:
high level
language
│
┌─────────▼─────────┐
│ lexical analyzer │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ syntax analyzer │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ semantic analyzer │
┌──────────────┐ └─────────┬─────────┘ ┌──────────────┐
│ symbol table │ │ │error handling│
└──────────────┘ ┌─────────▼─────────┐ └──────────────┘
│ intermediate code │
│ generator │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ code optimiser │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ target code │
│ generation │
└───────────────────┘
▼
assembly code
Building a compiler from scratch is a challenging task. In fact, the 2020 Turing Award was awarded to two of the authors of the “Dragon Book” in recognition of their contributions to the foundational algorithms and theory underlying programming language implementation. Due to my limited knowledge, I won’t delve into excessive detail regarding compiler theory and optimization techniques here.
Fortunately, there are numerous tools available to assist us in building simple compilers, such as yacc
and antlr
.
Exploration
Basic Arithmetic Expressions
Let’s begin by exploring the process of building an interpreter for simple mathematical expressions. First, we need to define the grammar for these expressions, which typically resemble the following form:
$$ (1 + 2) * 3 $$
Expressions involve addition, subtraction, multiplication, division, parentheses, and basic elements like numbers. We can define the following grammar description (using antlr syntax):
grammar Calculator;
// parser
start : expr | <EOF> ;
expr
: NUMBER # DOUBLE
| '-' expr # UMINUS
| '(' expr ')' # PARENGRP
| expr mulop expr # MULOPGRP
| expr addop expr # ADDOPGRP
;
addop : '+' | '-' ;
mulop : '*' | '/' ;
// lexer
NUMBER : ('0' .. '9') + ('.' ('0' .. '9') +)? ;
WS : [ \r\n\t] + -> skip ;
Using this grammar, the previous example (1 + 2) * 3
can be represented as:
Tools
The above example demonstrates the grammar definition using antlr
. There’s an excellent tool for front-end development called chevrotain
, which allows us to conveniently write lexical, syntactic, and parsing rules in JavaScript and then perform parsing in the front-end code to generate results.
To build a mathematical expression parser using compiler principles, the first step is to define its grammar. We’ve already defined the grammar for mathematical expressions using antlr
in the previous section. A grammar provides a precise description of the syntactic rules, outlining the valid grammatical structures and organization within a language.
Below is a grammar diagram generated using the tools provided by chevrotain
:
Mathematical Expressions - Lexer
In the previous section, we saw the final grammar diagram generated using Chevrotain. To actually write a parser, after defining the grammar, we need to describe its lexical rules. In Chevrotain, we create lexical tokens using createToken
, as shown below:
const AdditionOperator = createToken({
name: "AdditionOperator",
pattern: Lexer.NA,
});
const Plus = createToken({
name: "Plus",
pattern: /\+/,
categories: AdditionOperator,
});
const Minus = createToken({
name: "Minus",
pattern: /-/,
categories: AdditionOperator,
});
const MultiplicationOperator = createToken({
name: "MultiplicationOperator",
pattern: Lexer.NA,
});
const Multi = createToken({
name: "Multi",
pattern: /\*/,
categories: MultiplicationOperator,
});
const Div = createToken({
name: "Div",
pattern: /\//,
categories: MultiplicationOperator,
});
const LParen = createToken({ name: "LParen", pattern: /\(/ });
const RParen = createToken({ name: "RParen", pattern: /\)/ });
const NumberLiteral = createToken({
name: "NumberLiteral",
pattern: /[1-9]\d*/,
});
// marking WhiteSpace as 'SKIPPED' makes the lexer skip it.
const WhiteSpace = createToken({
name: "WhiteSpace",
pattern: /\s+/,
group: Lexer.SKIPPED,
});
Here, we define AdditionOperator
and MultiplicationOperator
as abstract tokens, with their actual lexical representation expressed by their leaf nodes Plus
& Minus
and Multi
& Div
, respectively. The remaining tokens represent left and right parentheses, numbers, and whitespace.
Lexical analysis involves parsing the raw input into a sequence of tokens. After defining these tokens, we can use the provided Lexer
class to create a lexical analyzer.
const allTokens = [
WhiteSpace, // whitespace is normally very common so it should be placed first to speed up the lexer's performance
Plus,
Minus,
Multi,
Div,
LParen,
RParen,
NumberLiteral,
AdditionOperator,
MultiplicationOperator,
];
const CalculatorLexer = new Lexer(allTokens);
Mathematical Expressions - Grammar Rules
With tokens and a lexical analyzer in place, we use the CstParser
base class to define our grammar rules, as follows:
class CalculatorPure extends CstParser {
constructor() {
super(allTokens);
const $ = this;
$.RULE("expression", () => {
$.OR([
{ ALT: () => $.SUBRULE($.uminusExpression) },
{ ALT: () => $.SUBRULE($.additionExpression) },
]);
});
$.RULE("uminusExpression", () => {
$.CONSUME(Minus);
$.SUBRULE($.expression);
});
// lowest precedence thus it is first in the rule chain
// The precedence of binary expressions is determined by how far down the Parse Tree
// The binary expression appears.
$.RULE("additionExpression", () => {
$.SUBRULE($.multiplicationExpression, { LABEL: "lhs" });
$.MANY(() => {
// consuming 'AdditionOperator' will consume either Plus or Minus as they are subclasses of AdditionOperator
$.CONSUME(AdditionOperator);
// the index "2" in SUBRULE2 is needed to identify the unique position in the grammar during runtime
$.SUBRULE2($.multiplicationExpression, { LABEL: "rhs" });
});
});
$.RULE("multiplicationExpression", () => {
$.SUBRULE($.atomicExpression, { LABEL: "lhs" });
$.MANY(() => {
$.CONSUME(MultiplicationOperator);
// the index "2" in SUBRULE2 is needed to identify the unique position in the grammar during runtime
$.SUBRULE2($.atomicExpression, { LABEL: "rhs" });
});
});
$.RULE("atomicExpression", () =>
$.OR([
// parenthesisExpression has the highest precedence and thus it appears
// in the "lowest" leaf in the expression ParseTree.
{ ALT: () => $.SUBRULE($.parenthesisExpression) },
{ ALT: () => $.CONSUME(NumberLiteral) },
// { ALT: () => $.SUBRULE($.powerFunction) },
])
);
$.RULE("parenthesisExpression", () => {
$.CONSUME(LParen);
$.SUBRULE($.expression);
$.CONSUME(RParen);
});
// very important to call this after all the rules have been defined.
// otherwise the parser may not work correctly as it will lack information
// derived during the self analysis phase.
this.performSelfAnalysis();
}
}
Chevrotain parsers use the following rules:
- CONSUME - maps to a single token
- SUBRULE - references another rule
- OR - chooses between multiple alternatives
- OPTION - optional rule
- MANY - repeats one or more times
- AT_LEAST_ONE - repeats one or more times
- MANY_SEP - repeats one or more times with separators
- AT_LEAST_ONE_SEP - repeats one or more times with separators
The CalculatorPure
class above inherits from CstParser
to generate a Concrete Syntax Tree (CST). In contrast, many might be more familiar with the concept of an Abstract Syntax Tree (AST). The key differences between the two are:
- An AST typically doesn’t include all syntactic information. This means reconstructing the original text accurately from an AST is not always possible.
- An AST doesn’t represent the entire parse tree. It usually contains only nodes relevant to specific parse tree nodes, not all nodes (mainly leaf nodes).
Other concepts, such as Grammar and Syntax, can be described as follows (quoted from ChatGPT):
In compiler principles, Grammar and Syntax are related but not identical concepts.
Grammar is a set of rules that describe a programming language or natural language. It defines the valid syntactic structure and sentence forms within the language. Grammars are typically expressed using Productions, where the left-hand side is a Non-terminal symbol and the right-hand side is a sequence of Terminals and/or Non-terminals. Grammars are categorized into different types, such as Context-Free Grammar and Context-Sensitive Grammar, which strictly define the language’s syntactic structure.
Syntax refers to rules and conventions within a programming language that define the valid structure of programs and statements. Syntax specifies the keywords, operators, identifiers, statements, expressions, and other elements that can be used in a program, as well as how they can be combined. Syntax is an abstract description method used to ensure the correctness and consistency of program structure. The syntax of a programming language is usually based on a specific grammar, so syntax can be considered a concrete implementation of grammar within a programming language.
Mathematical Expressions - Semantic Analysis
Through syntactic analysis, we generate a syntax tree. However, the syntax tree merely analyzes the input text based on grammatical rules. In practical scenarios, we also need it to produce actual results or other data structures. Taking the mathematical expression example, after parsing, we expect to obtain the final calculated result, such as 9
for the expression (1 + 2) * 3
. This is where semantic analysis comes in. We traverse the syntax tree and perform corresponding actions (semantics) for each node, such as mathematical operations, ultimately leading to the output result. This process is recursive.
Chevrotain provides the Visitor
class for CSTs, which facilitates syntax tree traversal. For the previous CalculatorPure
class, we can easily write its interpreter as follows:
const parser = new CalculatorPure([]);
const defaultRuleName = "expression";
// ----------------- Interpreter -----------------
const BaseCstVisitor = parser.getBaseCstVisitorConstructor();
class CalculatorInterpreter extends BaseCstVisitor {
constructor() {
super();
// This helper will detect any missing or redundant methods on this visitor
this.validateVisitor();
}
expression(ctx) {
if (ctx.additionExpression){
return this.visit(ctx.additionExpression);
} else { // uminus
return this.visit(ctx.uminusExpression);
}
}
uminusExpression(ctx) {
return -1 * this.visit(ctx.expression);
}
additionExpression(ctx) {
let result = this.visit(ctx.lhs);
// "rhs" key may be undefined as the grammar defines it as optional (MANY === zero or more).
if (ctx.rhs) {
ctx.rhs.forEach((rhsOperand, idx) => {
// there will be one operator for each rhs operand
let rhsValue = this.visit(rhsOperand);
let operator = ctx.AdditionOperator[idx];
if (tokenMatcher(operator, Plus)) {
result += rhsValue;
} else {
// Minus
result -= rhsValue;
}
});
}
return result;
}
multiplicationExpression(ctx) {
let result = this.visit(ctx.lhs);
// "rhs" key may be undefined as the grammar defines it as optional (MANY === zero or more).
if (ctx.rhs) {
ctx.rhs.forEach((rhsOperand, idx) => {
// there will be one operator for each rhs operand
let rhsValue = this.visit(rhsOperand);
let operator = ctx.MultiplicationOperator[idx];
if (tokenMatcher(operator, Multi)) {
result *= rhsValue;
} else {
// Division
result /= rhsValue;
}
});
}
return result;
}
atomicExpression(ctx) {
if (ctx.parenthesisExpression) {
// passing an array to "this.visit" is equivalent
// to passing the array's first element
return this.visit(ctx.parenthesisExpression);
} else if (ctx.NumberLiteral) {
// If a key exists on the ctx, at least one element is guaranteed
return parseInt(ctx.NumberLiteral[0].image, 10);
}
}
parenthesisExpression(ctx) {
// The ctx will also contain the parenthesis tokens, but we don't care about those
// in the context of calculating the result.
return this.visit(ctx.expression);
}
}
Mathematical Expressions - Testing
With the lexical, syntactic, and semantic processing defined above, we can use these definitions to handle actual text and interpret it to generate output. The relevant test code is as follows:
import {expect, test} from 'vitest';
import { CalculatorLexer, CalculatorInterpreter, parser, defaultRuleName } from "./calculator.js";
const visitor = new CalculatorInterpreter();
function parse(lexResult, startRuleName) {
parser.reset();
parser.input = lexResult.tokens;
let value = parser[startRuleName]();
return { value: value, errors: parser.errors };
}
function evaluate(text) {
let lexResult = CalculatorLexer.tokenize(text);
let parseResult = parse(lexResult, defaultRuleName);
if (parseResult.errors.length > 0) {
return new Error(parseResult.errors);
}
let result = visitor.visit(parseResult.value);
return result;
}
const testData = [{
message: 'negivate number',
calc: '-1',
value: -1
}, {
message: 'basic add',
calc: '1+2',
value: 3
}, {
message: 'basic minus',
calc: '1-2',
value: -1
}, {
message: 'basic multi',
calc: '1 * 2',
value: 2
}, {
message: 'basic div',
calc: '10 / 5',
value: 2
}, {
message: 'paren',
calc: '(1+2)*3',
value: 9
}, {
message: 'negivate with paren',
calc: '-(1+2)',
value: -3
}]
for (const data of testData) {
const {message, calc, value} = data;
test(message, () => {
expect(evaluate(calc)).toBe(value);
});
}
test('expect error', () => {
expect(evaluate('abc')).toBeInstanceOf(Error);
});
For example, running evaluate('(1 + 2) * 3')
will output the expected number 9
.
Expansion
The above example demonstrates a simple mathematical expression parser with a defined set of grammar and parsing rules. We can extend this further to support functions and other symbols. The extended grammar definition is as follows:
grammar formula;
expression
: '-' expression
| additionalExpression;
additionalExpression
: multiplicationExpression (additionalOperator multiplicationExpression)?;
additionalOperator
: '+' | '-';
multiplicationExpression
: atomicExpression (multiplicationOperator additionalExpression)?;
multiplicationOperator
: '*' | '/';
atomicExpression
: parenthesisExpression | NumberLiteral | formulaExpression;
parenthesisExpression
: LParen expression RParen;
LParen
: '(';
RParen
: ')';
NumberLiteral
: ('0' .. '9') + ('.' ('0' .. '9') +)? ;
formulaExpression
: FormulaLiteral LParen ReferenceLiteral RParen;
FormulaLiteral
: 'SUM' | 'AVG' | 'MAX' | 'MIN';
ReferenceLiteral
: ('a' .. 'z' | 'A' .. 'Z' | '_' | '0' .. '9') + ;
The visualization after expansion looks like this:
Here, the definition of formulaExpression
in the syntactic analysis Parser is as follows:
$.RULE("formulaExpression", () => {
$.CONSUME(FormulaLiteral, {LABEL: "formula"});
$.CONSUME(LParen);
$.CONSUME(ReferenceLiteral, {LABEL: "reference"});
$.CONSUME(RParen);
});
The corresponding parsing in the semantic interpreter is as follows (where this.data
is data passed in the constructor):
formulaExpression(ctx) {
let formula = ctx.formula[0];
let reference = ctx.reference[0].image;
if (!this.data.hasOwnProperty(reference)) {
throw new Error(`Unknown reference: ${reference}`);
}
let values = this.data[reference];
if (values.length === 0) {
throw new Error(`Empty reference: ${reference}`);
}
if (tokenMatcher(formula, Sum)) {
return _.sum(values);
} else if (tokenMatcher(formula, Avg)) {
return _.sum(values) / values.length;
} else if (tokenMatcher(formula, Max)) {
return _.max(values);
} else if (tokenMatcher(formula, Min)) {
return _.min(values);
}
}
The complete code is available at https://github.com/tomowang/formula-parser-demo
Afterword
Compiler principles involve a vast amount of knowledge beyond what I’ve covered here. Topics like the differences between LL and LR grammars, and techniques for eliminating left recursion, are worth exploring when time permits.