MENTAL
 Main Menu
 Applications
 Linguistics
 Formal Grammars and Regular Expressions


Formal Grammars and Regular Expressions
 FORMAL GRAMMARS
AND REGULAR
EXPRESSIONS

"In essence, grammar is one and the same in all languages, though it varies accidentally" (Roger Bacon).

"The totality of propositions is language" (Wittgenstein, Tractatus 4.001).

"To understand a language means to master a technique" (Wittgenstein, Philosophical Investigations).



Formal Grammars

A formal grammar is a set of production rules that specify the possible character strings that define a formal language. Obviously, a formal grammar only describes the forms of character strings, since a language cannot express its own semantics.

The production rules are of the form PQ or else P :: = Q, where P and Q (the left and right sides, respectively, of the rule), in general, character strings (terminal symbols) and non-terminal symbols. Examples:
  1. SbS   Sc
    represent the language {c, bc, bbc, bbbc, bbbbc, ...}.

  2. SaBc   Bu   BuB
    represent the language {auc, auuc, auuuc, auuuuc, ...}.
Rule features: A grammar can be considered the "generator" of all possible character strings in the language. A grammar can also be used to "recognize" whether or not a given string belongs to a given language. These are two contrary processes: one forward and one backward.

The language of grammars (the production rules) is really a metalanguage that allows describing particular languages.


Chomsky's hierarchy

The concept of generative grammar was introduced by Noam Chomsky in 1956. Because of its abstract character, it formalized linguistics.

The so-called "Chomsky hierarchy" is a classification of formal languages based on the different types of formal grammars that generate formal languages. This hierarchy consists of 4 levels, from greater to lesser generality, and where each level encompasses the lower ones:
  1. Type 0 grammars (without restrictions). They include all formal grammars.

  2. Type 1 grammars (context-sensitive grammars), which generate context-sensitive languages. The rules are of the form αAβ → αγβ, with A being nonterminal and α. β and γ strings of terminal and non-terminal symbols. The strings α and β can be empty (∧).

  3. Type 2 grammars (context-free grammars), which generate context-independent languages. The rules are of the form A → γ, with A being non-terminal and γ being a string of terminals and non-terminals.

  4. Type 3 grammars (regular grammars), which generate regular languages. The production rules are of the form Aa, A → ∧ and AaB or AaB, where a is a terminal symbol, and A and B are non-terminal symbols.

Regular Expressions

Regular expressions, first studied by Stephen Kleene [1956], are used to describe languages. A language is a set of strings of symbols of a certain alphabet Σ. The language associated with a regular expression r is denoted by L(r) and the corresponding set is called a "regular set".

Two regular expressions r and s are said to be equivalent if L(r) = L(s).

The rules that define regular expressions on a Σ alphabet are of two types: initial rules and combinatorial rules.


Initial rules
  1. (empty string) is a regular expression representing the set {∧}.

  2. If a is a symbol of Σ, a is a regular expression representing the set {a}.

  3. If r is a regular expression designating L(r), then (r) is another regular expression that is included in L(r).

Combinatorial rules

Regular expressions are formed by the following combinatorial mechanisms: Alternate, Repetition, and Concatenation, along with a Distribution mechanism associated with Alternate.
  1. Alternate.
    The "|" symbol is used.

    If r and s are regular expressions, then r|s is another regular expression representing the union of L(r) and L(s): L(r) ∪ L(s).

    Examples:

    r = a|b|c   L(r) = {a, b, c}
    s = a|b|d   L(s) = {a, b, di>}
    r|s = a|b|c|d   L(r|s) = L(r) ∪ L(s) = {a, b, c, d}

  2. Repetition.
    The symbol "*" (as superscript) is used to indicate zero or more repetitions of the concatenated regular expression. Zero repetitions is the empty string.

    If r is a regular expression, r* is the regular expression

      ∧|r|rr|rrr|rrrr|...

    Repetition is a particular case of alternative, in this case to describe alternatives of infinite length.

    Examples:

      a* = ∧|a|aa|aaa|...
      r = ab
      r* = ∧|ab|abab|ababab|...

  3. Concatenation.
    There is no explicit operator. If r and s are regular expressions. then rs is another regular expression. For example, if a, b and c are &Sigma symbols, then abc is the string formed by the concatenation of the symbols a, b and c.

    The concatenation operation is distributive with respect to the alternative (expressed explicitly by the "|" operator or implicitly by the "*" operator).

    Examples:

      (a|b|c)d = ad|bd|cd
      d(a|b|c) = da|db|dc
      (a|b)(c|d) = ac|ad|bc|bd
      ab* = a|ab|abb|abbb|...
      a*b = b|ab|aab|aaab|...

Operator priority

In order to save parentheses the following conventions are used:
  1. The priorities, from highest to lowest, are: "*", concatenation and "|".
  2. The three operations are associative on the right.
For example, a|b*c is interpreted as a|((b*)c ), i.e., the set of strings consisting of either a, or zero or more b's followed by c:
Properties

PropertyExpressions
Commutativer|s = s|r
Associative(rs)t = r(st)< br>(r|s)|t = r|(s|t)
Distributiver(s|t) = rs|< i>rt
(r|s)t = rt|st
Idempotencer** = r*
Repetition of Aletrnative(a|b)* = (a*b*)*
Empty stringr∧ = ∧r = r
∧* = ∧
(r|∧)* = r*
Equal elementsr|r = r


Notation extension

In order to simplify the specification of certain frequently occurring constructions the following symbols are used:
  1. + ("+" sign as superscript): indicates repetition of one or more times:

      r+ = r|rr|rrr|...

    It follows that

      r* = ∧|r+ = r+|∧
      r+ = rr* = r*r

  2. Exponents. They are used to indicate repetition of a regular expression a finite number of times:

      r2 = rr, r3 = rrr, etc.

  3. ?: indicates optional appearance.

    r? is an abbreviation of r|&∧. Therefore, L(r?) = L(r)∪{∧}

  4. -: indicates range, together with square brackets.

      [a-z] equivale a a|b|...|z

    By extension, [abc] is equivalent to a|b|c

    Examples:

      [0-9] (one digit). Equivalent to 0|1|...|9
      [Ll]os. Equivalent to Los|los

Regular expression and grammar

Any language that can be described by a regular expression can also be described by a regular grammar. For example, the regular expression a*b* can be expressed by the regular grammar
Limitations of regular expressions

For the purposes of describing a language, regular expressions have the advantage of providing a more compact and understandable notation than a regular grammar. But they have limitations:
Regular definitions

In order to overcome the latter limitation, so-called "regular definitions" are used: regular expressions are given names, using the same syntax as the rules of a grammar, and these names are used as if they were symbols.

To distinguish symbols from names, the latter are usually written in bold type. For example, Pascal identifiers are defined as follows:
Specification in MENTAL of Formal Grammars and Regular Expressions

Formal Grammars

A production rule is specified by a deferred substitution expression, which indicates representation. For example, the rules representing language {auc, auuc, auuuc, auuuuc, ...}, are expressed as a grammar (g):
Regular expressions
Advantages of MENTAL to represent languages

Bibliography