Lexical Analysis

Lexical Analyzer

Lexical analyzer is the first phase of a compiler

  • task: read input characters and produce a sequence of tokens that the parser uses for syntax analysis
  • remove white spaces

You separate the analysis phase of compiling into lexical and syntax analysis because:

  • simpler (layered) design
  • compiler efficiency

Specialized tools have been designed to help automate the construction of both separately

Lexemes

  • sequence of characters in the source program that matched by the pattern for a token

    • a lexeme is a basic lexical unit of a language
    • lexemes of a programming language include its
  • Identifiers

    • name of variables, methods, classes, packages and interfaces
  • Literals

    • fixed values (e.g. "1", "17.56", "0xFFE")
  • Operators

    • for Maths, Boolean and logical operations
  • Special words: keywords (e.g. "if", "for", "public")

Tokens, Patterns, Lexemes

Token: category of lexemes

A pattern is a rule describing the set of lexemes that can represent a particular token in source program

Lexical Errors

  • Few errors are discernible at the lexical level alone
    • lexical analyzer has a very localized view of a source program
  • Let some other phrase of compiler handle any error

Specification of Tokens

We need a powerful notation to specify the patterns for the tokens

  • regular expressions to the rescue

In the process of studying regular expressions, we will discuss:

  • operations on languages
  • regular definitions
  • notational shorthands

Operations on languages

  • Union between languages: the set of strings that belong to at least one of both languages
  • Concatenation: the set of all strings that contain substrings from the languages being concatenated
  • Intersection: set of all strings from both languages
  • Kleene closure: set of all strings that are concatenations of 0 or more strings from the original language
  • Positive closure: set of all strings that are concatenations of 1 or more strings from the original language

Regular expressions

A compact notation for describing a string. Typically an identifier is a letter followed by zero or more letters or digits (letter (letter | digit)) where | = or and = zero or more instances

Rules

ϵ\epsilon is a regular that denotes ϵ{\epsilon}, the set containing empty string.

If aa is a symbol in Σ\Sigma then aa is a regular expression that denotes a{a}, the set containing the string aa.

Suppose rr and ss are regular expressions denoting the languages LL and MM, then:

  • (r)(s)(r) | (s) is a regular expression denoting L\unionML \union M
  • (r)(s)(r)(s) denotes LMLM
  • (r)\*(r) \* is a regular expression denoting (L)\*(L)\*

Regular definitions

If Σ\Sigma is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

d1r1d_1 \Rightarrow r_1 > d2r2d_2 \Rightarrow r_2 ... dnrnd_n \Rightarrow r_n

where each did_i is a distinct name, and each rir_i is a regular expression over the symbols in Σ\uniond1,d2,...,di1\Sigma \union {d_1, d_2, ..., d_{i-1}}

Notational shorthands

  1. One or more instances: a+ denotes the set of all strings of one or more a's
  2. Zero or more instances: a* denotes all strings of zero or more a's
  3. Character classes: the notation [abc] where a, b and c denotes the regular expression abca | b | c
  4. Abbreviated character classes: the notation [a-z] denotes the regular expression ab...za | b | ... | z

Finite State Automata

You can use state machines to tell if a string follows a regular expression. These states machines are used in a program called a Recognizer. A *recognizer for a language is a program that takes as input a string xx and answers

  • "yes" if xx is a lexeme of the language
  • "no" otherwise

Finite automation compiles a regular expression into a recognizer by constructing a generalized transition diagram. It can be deterministic or nondeterministic.

  • Nondeterministic mean thats on transition out of state may be possible on the same input symbol

Nondeterministic Finite Automata (NFA)

A set of states SS. A set input symbols that belong to alphabet Σ\Sigma. A set of transitions that are triggered by the processing of a character. A single state s0s_0 that is distinguished as the start (initial state). A set of states FF distinguished as accepting (final) states.

Deterministic Finite Automata (DFA)

A DFA is a special case of NFA in which

  • no state has an ϵ\epsilon-transition
  • for each state ss and input symbol aa, there is at most one edge labeled aa leaving ss

Practical Regular expressions

To match a single digit, we use: [0-9]. We can also use \d. The slash is an escape character used to distinguish it from letter d. Similar to match a non-digit character, we can use the notation \D.

To match an alphanumeric character, we can use ths notation: [a-zA-Z0-9]. Or we can use \w. Similarly, we can use \W for non-alphanumeric characters.

You can use \p{L} to show the English alphabet plus more letter characters defined in Unicode.

A wildcard can be used to match any character you can use \..

To match everything except a set of character, in other terms exclusion, we can use [^abc].

To specify a range of how many times a character can be repeated we use a {x,y}, where a is any string and x,y are integer ranges.

We can use ? to represent optional character. An underscore represents whitespace or \s and \S for no whitespace.

To match the beginning of a line use ^ while $ to match the end. To match a whole line use both.

Lastly, \b is used to match the start/end of a string and \B to do the opposite.