Lexical Analysis
Lexical Analyzer
Lexical analyzer is the first phase of a compiler
- task: read input characters and produce a sequence of tokens that the parser uses for syntax analysis
- remove white spaces
You separate the analysis phase of compiling into lexical and syntax analysis because:
- simpler (layered) design
- compiler efficiency
Specialized tools have been designed to help automate the construction of both separately
Lexemes
sequence of characters in the source program that matched by the pattern for a token
- a lexeme is a basic lexical unit of a language
- lexemes of a programming language include its
Identifiers
- name of variables, methods, classes, packages and interfaces
Literals
- fixed values (e.g. "1", "17.56", "0xFFE")
Operators
- for Maths, Boolean and logical operations
Special words: keywords (e.g. "if", "for", "public")
Tokens, Patterns, Lexemes
Token: category of lexemes
A pattern is a rule describing the set of lexemes that can represent a particular token in source program
Lexical Errors
- Few errors are discernible at the lexical level alone
- lexical analyzer has a very localized view of a source program
- Let some other phrase of compiler handle any error
Specification of Tokens
We need a powerful notation to specify the patterns for the tokens
- regular expressions to the rescue
In the process of studying regular expressions, we will discuss:
- operations on languages
- regular definitions
- notational shorthands
Operations on languages
- Union between languages: the set of strings that belong to at least one of both languages
- Concatenation: the set of all strings that contain substrings from the languages being concatenated
- Intersection: set of all strings from both languages
- Kleene closure: set of all strings that are concatenations of 0 or more strings from the original language
- Positive closure: set of all strings that are concatenations of 1 or more strings from the original language
Regular expressions
A compact notation for describing a string. Typically an identifier is a letter followed by zero or more letters or digits (letter (letter | digit)) where | = or and = zero or more instances
Rules
is a regular that denotes , the set containing empty string.
If is a symbol in then is a regular expression that denotes , the set containing the string .
Suppose and are regular expressions denoting the languages and , then:
- is a regular expression denoting
- denotes
- is a regular expression denoting
Regular definitions
If is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:
> ...
where each is a distinct name, and each is a regular expression over the symbols in
Notational shorthands
- One or more instances: a+ denotes the set of all strings of one or more a's
- Zero or more instances: a* denotes all strings of zero or more a's
- Character classes: the notation [abc] where a, b and c denotes the regular expression
- Abbreviated character classes: the notation [a-z] denotes the regular expression
Finite State Automata
You can use state machines to tell if a string follows a regular expression. These states machines are used in a program called a Recognizer. A *recognizer for a language is a program that takes as input a string and answers
- "yes" if is a lexeme of the language
- "no" otherwise
Finite automation compiles a regular expression into a recognizer by constructing a generalized transition diagram. It can be deterministic or nondeterministic.
- Nondeterministic mean thats on transition out of state may be possible on the same input symbol
Nondeterministic Finite Automata (NFA)
A set of states . A set input symbols that belong to alphabet . A set of transitions that are triggered by the processing of a character. A single state that is distinguished as the start (initial state). A set of states distinguished as accepting (final) states.
Deterministic Finite Automata (DFA)
A DFA is a special case of NFA in which
- no state has an -transition
- for each state and input symbol , there is at most one edge labeled leaving
Practical Regular expressions
To match a single digit, we use: [0-9]. We can also use \d
. The slash is an escape character used to distinguish it from letter d. Similar to match a non-digit character, we can use the notation \D
.
To match an alphanumeric character, we can use ths notation: [a-zA-Z0-9]
. Or we can use \w
. Similarly, we can use \W
for non-alphanumeric characters.
You can use \p{L}
to show the English alphabet plus more letter characters defined in Unicode.
A wildcard can be used to match any character you can use \.
.
To match everything except a set of character, in other terms exclusion, we can use [^abc]
.
To specify a range of how many times a character can be repeated we use a {x,y}
, where a is any string and x,y are integer ranges.
We can use ?
to represent optional character. An underscore represents whitespace or \s
and \S
for no whitespace.
To match the beginning of a line use ^
while $
to match the end. To match a whole line use both.
Lastly, \b
is used to match the start/end of a string and \B
to do the opposite.