Library grammar

Require Import List.

Grammars

Given two finite sets of terminal and non-terminal symbols, a grammar is a finite set of productions of the form S ~> w where S is a non-terminal symbol and w is a list of (terminal or non-terminal) symbols.
We shall use lowercase identifiers for terminals (e.g. a, b, x) and uppercase ones for non-terminals (e.g. S, E, U).
We shall identify both kinds of symbols by natural numbers in Coq (in practice, only finitely many natural number identifiers will be used in a grammar) as shown in the next definition.

Definition tsym := nat.
Definition nsym := nat.

Inductive sym := T : tsym -> sym | NT : nsym -> sym.

Definition grammar : Set := list (nsym * list sym).

Derivations

Grammars define languages, through the notion of derivation: starting from an initial list of symbols u, one can repeatedly expand occurrences of any non-terminal S into some w such that the grammar contains the production S ~> w. The terminal words obtained in this way form the language of u.
One does not loose anything by restricting to left derivations, i.e. derivations where only leftmost non-terminals are expanded. It will be more convenient for us to work with this notion, which is defined next.

Inductive deriv (g:grammar) : list sym -> list sym -> Prop :=
  | deriv_nil : deriv g nil nil
  | deriv_refl : forall l w s, deriv g l w -> deriv g (s::l) (s::w)
  | deriv_NT :
      forall exp l w u,
      In (u,exp) g ->
      deriv g (exp ++ l) w ->
      deriv g (NT u :: l) w.

Parsing and LL(1) grammars

Given a terminal word, we want to efficiently determine whether the word belongs to the language of the initial non-terminal NT 0. There is an obvious non-deterministic algorithm doing this, but a linear-time deterministic algorithm is desirable.
We will work on the class of LL(1) grammars, where such parsers exists, and moreover the parsing function only needs to look at the next terminal to determine which productions to use.
The example grammar ABAB.g is not LL(1) because we need to look at the next two symbols to know where we should use S ~> A or S ~> B. However, Dyck.g is.
We define next some notions that are useful to determine when a grammar is LL(1), and build parsers for them.

Section LL.

All definitions in this section will be parameterized by G.
  Variable G : grammar.

A list of symbols is nullable if it can derive the empty word.
  Definition Nullable e := deriv G e nil.

First a e means that a word in the language of e starts with the terminal a.
  Definition First a e := exists v, deriv G e (T a :: v).

A list of symbols is reachable if it is the suffix of a list that can be derived from the initial non-terminal.
  Definition Reachable u := exists w, deriv G (NT 0 :: nil) (w ++ u).

  Definition Follow a U := exists v, Reachable (NT U :: v) /\ First a v.

  Definition End U := exists v, Reachable (NT U :: v) /\ Nullable v.

The high-level idea behind LL(1) grammars is that, when attempting to determine whether w belongs to the language of NT U :: l, we want to be able to determine a unique production U ~> e which can make this work. When two productions could work, we say that there is a conflict.
Conflicts can happen in several cases, for instance when w = a::_ and there are productions U ~> e and U ~> e' such that First a e and First a e'. It also happens when First a e, Nullable e' and Follow a U: intuitively, we could obtain a from e, but we could also get rid of e' and then obtain a from l because it may follow U.
A grammar is strongly LL(1) when there is no distinct productions U ~> e and U ~> e' that are in conflict, in the following sense: instead of definition a conjunction of conditions, we use a record, so that conjuncts get names. When r : no_conflict u e e', nullable_nullable r : Nullable e -> Nullable e' -> e = e'.

  Record no_conflict u e e' : Prop := {
    first_first : forall a, First a e -> First a e' -> e = e' ;
    first_follow : forall a, First a e -> Nullable e' -> Follow a u -> e = e' ;
    nullable_nullable : Nullable e -> Nullable e' -> e = e'
  }.

  Definition LL_1 :=
    forall u e e', In (u,e) G -> In (u,e') G -> no_conflict u e e'.

End LL.

Common utilities

You can put below some lemmas, definitions, or tactics that you use in several other files. You cannot leave admitted stuff, axioms, variables... since this would break the ability to Load grammar.
Tactic for proving membership in an explicit list.
Ltac auto_In := repeat (try apply in_eq ; apply in_cons).