Lexical Analysis

View on GitHub

Lexical Analyzer

Lexical analyzer is the first phase of a compiler

task: read input characters and produce a sequence of tokens that the parser uses for syntax analysis
remove white spaces

You separate the analysis phase of compiling into lexical and syntax analysis because:

simpler (layered) design
compiler efficiency

Specialized tools have been designed to help automate the construction of both separately

Lexemes

sequence of characters in the source program that matched by the pattern for a token
- a lexeme is a basic lexical unit of a language
- lexemes of a programming language include its
Identifiers
- name of variables, methods, classes, packages and interfaces
Literals
- fixed values (e.g. "1", "17.56", "0xFFE")
Operators
- for Maths, Boolean and logical operations
Special words: keywords (e.g. "if", "for", "public")

Tokens, Patterns, Lexemes

Token: category of lexemes

A pattern is a rule describing the set of lexemes that can represent a particular token in source program

Lexical Errors

Few errors are discernible at the lexical level alone
- lexical analyzer has a very localized view of a source program
Let some other phrase of compiler handle any error

Specification of Tokens

We need a powerful notation to specify the patterns for the tokens

regular expressions to the rescue

In the process of studying regular expressions, we will discuss:

operations on languages
regular definitions
notational shorthands

Operations on languages

Union between languages: the set of strings that belong to at least one of both languages
Concatenation: the set of all strings that contain substrings from the languages being concatenated
Intersection: set of all strings from both languages
Kleene closure: set of all strings that are concatenations of 0 or more strings from the original language
Positive closure: set of all strings that are concatenations of 1 or more strings from the original language

Regular expressions

A compact notation for describing a string. Typically an identifier is a letter followed by zero or more letters or digits (letter (letter | digit)) where | = or and = zero or more instances

Rules

$\epsilon$ is a regular that denotes ${\epsilon}$ , the set containing empty string.

If $a$ is a symbol in $\Sigma$ then $a$ is a regular expression that denotes ${a}$ , the set containing the string $a$ .

Suppose $r$ and $s$ are regular expressions denoting the languages $L$ and $M$ , then:

$(r) | (s)$ is a regular expression denoting $L \union M$
$(r)(s)$ denotes $LM$
$(r) \*$ is a regular expression denoting $(L)\*$

Regular definitions

If $\Sigma$ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

$d_1 \Rightarrow r_1$ > $d_2 \Rightarrow r_2$ ... $d_n \Rightarrow r_n$

where each $d_i$ is a distinct name, and each $r_i$ is a regular expression over the symbols in $\Sigma \union {d_1, d_2, ..., d_{i-1}}$

Notational shorthands

One or more instances: a+ denotes the set of all strings of one or more a's
Zero or more instances: a* denotes all strings of zero or more a's
Character classes: the notation [abc] where a, b and c denotes the regular expression $a | b | c$
Abbreviated character classes: the notation [a-z] denotes the regular expression $a | b | ... | z$

Finite State Automata

You can use state machines to tell if a string follows a regular expression. These states machines are used in a program called a Recognizer. A *recognizer for a language is a program that takes as input a string $x$ and answers

"yes" if $x$ is a lexeme of the language
"no" otherwise

Finite automation compiles a regular expression into a recognizer by constructing a generalized transition diagram. It can be deterministic or nondeterministic.

Nondeterministic mean thats on transition out of state may be possible on the same input symbol

Nondeterministic Finite Automata (NFA)

A set of states $S$ . A set input symbols that belong to alphabet $\Sigma$ . A set of transitions that are triggered by the processing of a character. A single state $s_0$ that is distinguished as the start (initial state). A set of states $F$ distinguished as accepting (final) states.

Deterministic Finite Automata (DFA)

A DFA is a special case of NFA in which

no state has an $\epsilon$ -transition
for each state $s$ and input symbol $a$ , there is at most one edge labeled $a$ leaving $s$

Practical Regular expressions

To match a single digit, we use: [0-9]. We can also use \d. The slash is an escape character used to distinguish it from letter d. Similar to match a non-digit character, we can use the notation \D.

To match an alphanumeric character, we can use ths notation: [a-zA-Z0-9]. Or we can use \w. Similarly, we can use \W for non-alphanumeric characters.

You can use \p{L} to show the English alphabet plus more letter characters defined in Unicode.

A wildcard can be used to match any character you can use \..

To match everything except a set of character, in other terms exclusion, we can use [^abc].

To specify a range of how many times a character can be repeated we use a {x,y}, where a is any string and x,y are integer ranges.

We can use ? to represent optional character. An underscore represents whitespace or \s and \S for no whitespace.

To match the beginning of a line use ^ while $ to match the end. To match a whole line use both.

Lastly, \b is used to match the start/end of a string and \B to do the opposite.

Compilers

Finite State Automata