Subtle Regex Bug
While extending the Chess.com core game logic to support variants back in 2016, I decided to store new variant-related state as optional extra fields at the end of a FEN string.
Using a simplified version for illustration, I took the following regex:
let good = /^(p) ?(c?|-) ?(e?|-)$/;
…and removed the $
from the end:
let bad = /^(p) ?(c?|-) ?(e?|-)/;
My reasoning was that this was exactly the same regular rexpression as before, except it would now permit arbitrary additional characters at the end. (The use-case for this particular regex was in the PGN viewer and in correspondence chess, which didn’t support variants, so it didn’t matter what those characters were).
What I failed to notice, however, was that the original regex had been written to allow for missing fields. It reads as:
- start
- (p) # position
- possible space
- (possible c or "-") # castling
- possible space
- (possible e or "-") # en passant
- end
Notice that everything after the “position” bit has “possible” in front of it (except for “end”). This means that a lone p
is a match for this expression. Notice also that both castling and en passant can optionally be -
. How does the regex matcher know which is which?
It turns out (I think) that the original regex was broken to begin with. My simplified “good” version certainly is: with all fields present, it works as expected. But with en passant omitted and castling set to -
(FEN = p -
), the -
is actually assigned to the en passant capture. Castling is an empty string.
In the “bad” version, with the $
marker removed, the behaviour only works when all fields are present and not -
:
let bad = /^(p) ?(c?|-) ?(e?|-) ?/;
With this regex, the string p - e
produces an empty string for both castling and en passant! For some reason, not needing to match an end marker causes the regex to lazily match as little as possible, but only when encountering the potentially ambiguous -
. I’m still not sure exactly why this is—if you know, feel free to email me or discuss on HN.