In computer programming, the lexer hack (as opposed to "a lexer hack") describes a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.
The problem is that in the following code, the lexical class of A
cannot be determined without further contextual information:
This code could be multiplication of two variables, in which case A
is a variable; written unambiguously:
Alternatively, it could be casting the dereferenced value of B
to the type A
, in which case A
is a typedef name; written unambiguously:
In more detail, in a compiler, the lexer performs one of the earliest stages of converting the source code to a program. It scans the text to extract meaningful tokens, such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule.
This ambiguity can happen in C if the lexer does not distinguish between variable and typedef identifiers. For example, in the C expression:
the lexer may find these tokens:
The problem is precisely that the lexical class of A cannot be determined without further context: the parser can interpret this as variable A multiplied by B or as type A casting the dereferenced value of B. This is known as the "typedef-name: identifier" problem, due to the name of the problematic production rule.
The solution generally consists of feeding information from the semantic symbol table back into the lexer. That is, rather than functioning as a pure one-way pipeline from the lexer to the parser, there is a backchannel from semantic analysis back to the lexer. This mixing of parsing and semantic analysis is generally regarded as inelegant, which is why it is called a "hack".