Syntax Analysis

Context free grammars

A Context Free Grammar (CFG) consists of

  • Terminals
  • Non-terminals
  • Start symbol
  • Productions

A language that can be generated by a grammar is said to be a context-free language.

Terminals

They are the basics symbols from which strings are formed. These are the tokens that were produced by the Lexical Analyser

Non-terminals

They are syntactic variables that denote sets of strings. One non-terminal is distinguished as the start symbol.

The productions of a grammar specify the manner in which the terminal and non-terminals can be combined to form strings.

Derivations

<expr> = <expr> <op> <expr>

This is read as "expr derives expr op expr".

<expr> = <expr> <op> <expr>
= id <op> <expr>
= id * <expr>
= id*id

This is called a derivation of id*id from expr.

If A::=γA::=\gamma is a production and α\alpha and β\beta are arbitrary strings of grammar symbols we say: \alphaAβαγβ\alphaA\beta \Rightarrow \alpha\gamma\beta

If α1α2...αn\alpha_1 \Rightarrow \alpha_2 \Rightarrow ... \Rightarrow \alpha_n, we say α1\alpha_1 derives αn\alpha_n.

  • \Rightarrow means "derives in one step"

  • \Rightarrow^* means "derives in zero or more steps."

  • +\Rightarrow^+ means "derives in one or more steps."

  • GG is grammar

  • SS is the start symbol

  • L(G)L(G) is the language generated by GG

Strings in L(G)L(G) may contain only terminal symbols of GG. A string of terminals ww is said to be in L(G) if and only if S+WS\Rightarrow^+ W.

A language that can be generated by a grammar is said to be a context-free language. If two grammars generate the same language, the grammars are said to be equivalent.

Parse Trees

expr
|
| | |
expr op expr
| | |
id + id

This is called: Leftmost derivation

Precedence

A problem with parse trees is that it does not enforce precedence. However we can enhance production rules to add precedence.

<expr> ::= <expr> + <term>
| <expr> - <term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= num
| id

Ambiguity

A grammar that produces more than one parse tree for some sentence is said to be ambiguous.

Eliminating Ambiguity

Sometimes ambiguous grammar can be rewritten to eliminate the ambiguity.

<stmt> ::= <matched>
| <unmatched>
<matched> ::= if <expr> then <matched> else <matched>
| other stmts
<unmatched> ::= if <expr> then <stmt>
| if <expr> then <matched> else <unmatched>

Top-down parsing

A top-down parser starts with the root of the parse tree, labelled with the start or goal symbol of the grammar.

To build a parse tree, we repeat the following steps until the leaves of the parse tree match the input string.

  1. At a node labelled A, select a production A ::= α\alpha and construct the appropriate child for each symbol of α\alpha
  2. When a terminal is added to the parse tree that does not match the input string, backtrack
  3. Find the next non-terminal to be expanded.

Left recursion