Parsing

We saw that top-down parsers may need to backtrack when they select the wrong production. We want to avoid backtracking.

This is where predictive parsers come in useful

  • LL(1): left to right scan, left-most derivation, 1-token look ahead
  • LR(1): left to right scan, right-most derivation, 1-token look ahead

LL(1) Grammar

In order to use LL(1) parsers, the context-free grammar has to be:

  • unambiguous
  • without left recursion
  • left factored

Eliminating Left Recursion

Consider the grammar fragment:

<foo> ::= <foo>A
| B

where A and B do not start with <foo>. We can re-write this as:

<foo> ::= B<bar>
<bar> ::= A<bar>
| epsilon

where <bar> is a new non-terminal. This fragment contains no left recursion.

Left Factoring

For any two productions, we would like a distinct way of choosing the correct production to expand.

We define FIRST (A) as the set of terminals that appear first in some string derived from A.

For a terminal w, we can say: wFIRST(A)    Awzw \in FIRST(A) \iff A \Rightarrow wz

Now going back to our two productions, we do FIRST(production 1) \cap FIRST(production 2) =ϕ=\phi. This would allow the parser to make a correct choice with a look ahead of only one symbol.

FIRST and FOLLOW sets

FIRST Set Calculation

Rules to calculate the FIRST set:

  1. FIRST(terminal) is (terminal)
  2. If A::=aA ::= a, and a is a terminal:
  • aFIRST(A){a} \in FIRST (A)
  1. If A::=BA ::= B and rule B does not exist in grammar:
  • FIRST(B)FIRST(A)FIRST(B) \in FIRST(A)
  1. If A::=BA ::= B and rule B does exist in the grammar:
  • (FIRST(B)ϵ)FIRST(a)FIRST(A){(FIRST(B) - \epsilon) \cup FIRST(a)} \in FIRST (A)

FOLLOW Set Calculation

The follow set of non-terminal AA contains all the terminals that appear after AA in any string generate by the grammar GG.

Rules to calculate the FOLLOW set:

  1. {$} \in FOLLOW(S)
  2. If A::=aBA ::= aB
  • FOLLOW(A) \in FOLLOW (B)
  1. If A::=aByA ::= aBy and yy does not exist in the grammar
  • FIRST(y) \in FOLLOW(B)
  1. If A::=aByA ::= aBy and yy does exist in the grammar:
  • {(FIRST(y)-\epsilon)} \cup FOLLOW(A)} \in FOLLOW(B)

Parsing tables

Image

LL(1) Parsing

In order to implement an LL(1) parser, we need to use the following data structures:

  • parse table (2D array)
  • stack (contains the derivations)
  • list (that will contain the token input stream)

LL(1) Error Recovery

What happens when the parser discovers an error?

  • Approach 1: stop all parsing activity and return an error message
  • Approach 2: try to continue parsing (is possible) and see if there are more errors along the way

Which approach does your compiler take?

An error is detected when:

  • the terminal on top of the stack does not match the next input token
  • the parsing table cell from which we are supposed to pull the next production is empty

What does the parser do?

  • It enters the panic-mode error Recovery
  • Based on the idea of skipping symbols on the input until a token in the SYNCH set appears

Let SS be a set of tokens called a synchronization set (SYNCH). Let sS,ss \in S, s is called a synchronization token.

Place all symbols in FOLLOW(A) into the SYNCH(A) set for non-terminal A. If we skip tokens until an element of SYNCH(A) is seen and we pop A from the stack, it's likely that parsing can continue.

The panic-mode error recovery can be implemented using the SYNCH set(s) as follows:

  • Scenario 1: If there is a non-terminal at the top of the stack, discard input tokens until you find a synch token, then pop the non-terminal
  • Scenario 2: If there is a terminal at the top of the stack, we could try popping it to see whether we can continue. Assume that the input string is actually missing that terminal

A NON LL(1) Grammar

Consider the grammar:

<stmt> ::= if <expr> then <stmt>
| if <expr> then <stmt> else <stmt>

Needs left factoring, which gives:

<stmt’> ::= else <stmt> | ε

Let’s get the FIRST and FOLLOW sets

FIRST(stmt) = {if}
FIRST(stmt’)={else, ε}
FOLLOW(stmt) = {$, else}
FOLLOW(stmt’)={$, else}

The problem arises because for an input token else and stack top of stmt’, we do not know which production to choose:

  • <stmt’> else <stmt>
  • <stmt’> ε

Therefore, this is not an LL(1) grammar