Syntax Analysis

Context free grammars

A Context Free Grammar (CFG) consists of

Terminals
Non-terminals
Start symbol
Productions

A language that can be generated by a grammar is said to be a context-free language.

Terminals

They are the basics symbols from which strings are formed. These are the tokens that were produced by the Lexical Analyser

Non-terminals

They are syntactic variables that denote sets of strings. One non-terminal is distinguished as the start symbol.

The productions of a grammar specify the manner in which the terminal and non-terminals can be combined to form strings.

Derivations

<expr> = <expr> <op> <expr>

This is read as "expr derives expr op expr".

<expr> = <expr> <op> <expr>
  = id <op> <expr>
  = id * <expr>
  = id*id

This is called a derivation of id*id from expr.

If $A::=\gamma$ is a production and $\alpha$ and $\beta$ are arbitrary strings of grammar symbols we say: $\alphaA\beta \Rightarrow \alpha\gamma\beta$

If $\alpha_1 \Rightarrow \alpha_2 \Rightarrow ... \Rightarrow \alpha_n$ , we say $\alpha_1$ derives $\alpha_n$ .

$\Rightarrow$ means "derives in one step"
$\Rightarrow^*$ means "derives in zero or more steps."
$\Rightarrow^+$ means "derives in one or more steps."
$G$ is grammar
$S$ is the start symbol
$L(G)$ is the language generated by $G$

Strings in $L(G)$ may contain only terminal symbols of $G$ . A string of terminals $w$ is said to be in L(G) if and only if $S\Rightarrow^+ W$ .

A language that can be generated by a grammar is said to be a context-free language. If two grammars generate the same language, the grammars are said to be equivalent.

Parse Trees

      expr
       |
  |    |     |
expr   op   expr
  |    |     |
 id    +     id

This is called: Leftmost derivation

Precedence

A problem with parse trees is that it does not enforce precedence. However we can enhance production rules to add precedence.

<expr> ::= <expr> + <term>
        |  <expr> - <term>
        |  <term>
<term> ::= <term> * <factor>
        |  <term> / <factor>
        |  <factor>
<factor> ::= num
          | id

Ambiguity

A grammar that produces more than one parse tree for some sentence is said to be ambiguous.

Eliminating Ambiguity

Sometimes ambiguous grammar can be rewritten to eliminate the ambiguity.

<stmt> ::= <matched>
        |  <unmatched>
<matched> ::= if <expr> then <matched> else <matched>
           |  other stmts
<unmatched> ::= if <expr> then <stmt>
             |  if <expr> then <matched> else <unmatched>

Top-down parsing

A top-down parser starts with the root of the parse tree, labelled with the start or goal symbol of the grammar.

To build a parse tree, we repeat the following steps until the leaves of the parse tree match the input string.

At a node labelled A, select a production A ::= $\alpha$ and construct the appropriate child for each symbol of $\alpha$
When a terminal is added to the parse tree that does not match the input string, backtrack
Find the next non-terminal to be expanded.

Left recursion

Finite State Automata

Software Requirement Analysis