Writing regex engine in 2023
1 min readJun 20, 2023
Why? For the fun and none of the profit.
Basic intro – I’ve implemented D language std regex library, including all of the quirky features of a modern regex. Still I hate it for the following reasons:
- Unclear memory allocation model, mostly malloc with hacky ref counting on top
- All of the features of PCRE regex, including backreferences and lookaround
- Speeeed, many cool optimization did not made it in, because std moves slooowly
- Interface – many optimizations are lost due to missing interface like hasMatch which mearly checks the presense of the match without trying to extract anything out of text
- Horrible bytecode – it’s both too complex to manipulate and difficult to execute, best of both worlds heh.
So the ground rules for the proper regex library that I call rewind regex https://github.com/DmitryOlshansky/rewind-regex
First the basics:
- Use AST plain simple, linked class-based AST, yes it is GC allocated
- Plot AST as DOT language graphs for visualization
- Bytecode – simple cisc byte oriented with lookup tables stored inline
- Matchers – simply a stack of engine with rising accuracy at the cost of more expansive matching
- JIT – just a kind of matcher that JITs bytecode to machine code, this is arch dependant and slated for second or third release
This mostly covers all of the key points, last but not least I have beatiful documentation and test suite from std.regex. Truly the docs and test are the most valuable asset.