Formal Grammars and Regular Expressions

FORMAL GRAMMARS
AND REGULAR
EXPRESSIONS

"In essence, grammar is one and the same in all languages, though it varies accidentally" (Roger Bacon).

"The totality of propositions is language" (Wittgenstein, Tractatus 4.001).

"To understand a language means to master a technique" (Wittgenstein, Philosophical Investigations).

Formal Grammars
A formal grammar is a set of production rules that specify the possible character strings that define a formal language. Obviously, a formal grammar only describes the forms of character strings, since a language cannot express its own semantics.

The production rules are of the form P → Q or else P :: = Q, where P and Q (the left and right sides, respectively, of the rule), in general, character strings (terminal symbols) and non-terminal symbols. Examples:

S → bS S → c
represent the language {c, bc, bbc, bbbc, bbbbc, ...}.
S → aBc B → u B → uB
represent the language {auc, auuc, auuuc, auuuuc, ...}.

Rule features:

Non-terminal symbols are specified by uppercase letters and terminal symbols by lowercase letters.
One of the rules of the grammar must include on its left side an initial non-terminal symbol from which potentially all character strings in the language are generated, and is usually represented by S.
The same non-terminal symbol may appear on the left side of several rules.
The meaning of a rule P → Q is that P is defined as Q, or that P represents or generates Q. In turn, the symbols of Q can appear as antecedents in other rules, repeating this process until there are only non-terminal symbols. This is precisely the advantage of production rules: they can be recursive. Recursion makes it possible to describe an infinite language (of infinite strings of characters).
Recursion can be direct, as in the case of rule A → aA or indirectly (through other rules) as A → aB and B → cA.

A grammar can be considered the "generator" of all possible character strings in the language. A grammar can also be used to "recognize" whether or not a given string belongs to a given language. These are two contrary processes: one forward and one backward.

The language of grammars (the production rules) is really a metalanguage that allows describing particular languages.

Chomsky's hierarchy

The concept of generative grammar was introduced by Noam Chomsky in 1956. Because of its abstract character, it formalized linguistics.

The so-called "Chomsky hierarchy" is a classification of formal languages based on the different types of formal grammars that generate formal languages. This hierarchy consists of 4 levels, from greater to lesser generality, and where each level encompasses the lower ones:

Type 0 grammars (without restrictions). They include all formal grammars.
Type 1 grammars (context-sensitive grammars), which generate context-sensitive languages. The rules are of the form αAβ → αγβ, with A being nonterminal and α. β and γ strings of terminal and non-terminal symbols. The strings α and β can be empty (∧).
Type 2 grammars (context-free grammars), which generate context-independent languages. The rules are of the form A → γ, with A being non-terminal and γ being a string of terminals and non-terminals.
Type 3 grammars (regular grammars), which generate regular languages. The production rules are of the form A → a, A → ∧ and A → aB or A → aB, where a is a terminal symbol, and A and B are non-terminal symbols.

Regular Expressions
Regular expressions, first studied by Stephen Kleene [1956], are used to describe languages. A language is a set of strings of symbols of a certain alphabet Σ. The language associated with a regular expression r is denoted by L(r) and the corresponding set is called a "regular set".

Two regular expressions r and s are said to be equivalent if L(r) = L(s).

The rules that define regular expressions on a Σ alphabet are of two types: initial rules and combinatorial rules.

Initial rules

(empty string) is a regular expression representing the set {∧}.
If a is a symbol of Σ, a is a regular expression representing the set {a}.
If r is a regular expression designating L(r), then (r) is another regular expression that is included in L(r).

Combinatorial rules

Regular expressions are formed by the following combinatorial mechanisms: Alternate, Repetition, and Concatenation, along with a Distribution mechanism associated with Alternate.

Alternate.
The "|" symbol is used.

If r and s are regular expressions, then r|s is another regular expression representing the union of L(r) and L(s): L(r) ∪ L(s).

Examples:

r = a|b|c L(r) = {a, b, c}
s = a|b|d L(s) = {a, b, di>}
r|s = a|b|c|d L(r|s) = L(r) ∪ L(s) = {a, b, c, d}
Repetition.
The symbol "*" (as superscript) is used to indicate zero or more repetitions of the concatenated regular expression. Zero repetitions is the empty string.

If r is a regular expression, r* is the regular expression
Repetition is a particular case of alternative, in this case to describe alternatives of infinite length.

Examples:
Concatenation.
There is no explicit operator. If r and s are regular expressions. then rs is another regular expression. For example, if a, b and c are &Sigma symbols, then abc is the string formed by the concatenation of the symbols a, b and c.

The concatenation operation is distributive with respect to the alternative (expressed explicitly by the "|" operator or implicitly by the "*" operator).

Examples:

Operator priority

In order to save parentheses the following conventions are used:

The priorities, from highest to lowest, are: "*", concatenation and "|".
The three operations are associative on the right.

For example, a|b*c is interpreted as a|((b*)c ), i.e., the set of strings consisting of either a, or zero or more b's followed by c:

bbc

bbbc

Properties

Property	Expressions
Commutative	r\|s = s\|r
Associative	(rs)t = r(st)< br>(r\|s)\|t = r\|(s\|t)
Distributive	r(s\|t) = rs\|< i>rt (r\|s)t = rt\|st
Idempotence	r** = r*
Repetition of Aletrnative	(a\|b)* = (ab)*
Empty string	r∧ = ∧r = r ∧* = ∧ (r\|∧)* = r*
Equal elements	r\|r = r

Notation extension

In order to simplify the specification of certain frequently occurring constructions the following symbols are used:

⁺ ("+" sign as superscript): indicates repetition of one or more times:
It follows that
Exponents. They are used to indicate repetition of a regular expression a finite number of times:
?: indicates optional appearance.

r? is an abbreviation of r|&∧. Therefore, L(r?) = L(r)∪{∧}
-: indicates range, together with square brackets.
By extension, [abc] is equivalent to a|b|c

Examples:

Regular expression and grammar

Any language that can be described by a regular expression can also be described by a regular grammar. For example, the regular expression a*b* can be expressed by the regular grammar

Limitations of regular expressions

For the purposes of describing a language, regular expressions have the advantage of providing a more compact and understandable notation than a regular grammar. But they have limitations:

You cannot express recursive definitions, as in grammars.
They do not allow describing more complex forms, such as those with certain restrictions. For example, balanced or nested forms such as the set of all strings of balanced parentheses:
But it could be described with a grammar:
Another example would be strings of the form wcw, being w = a*b*, but keeping the same string on both sides of c.
Exponent ranges (repetitions) cannot be expressed.
You cannot specify a regular expression using the names of other regular expressions.

Regular definitions

In order to overcome the latter limitation, so-called "regular definitions" are used: regular expressions are given names, using the same syntax as the rules of a grammar, and these names are used as if they were symbols.

To distinguish symbols from names, the latter are usually written in bold type. For example, Pascal identifiers are defined as follows:

letter

digit

letter

digit

Specification in MENTAL of Formal Grammars and Regular Expressions
Formal Grammars

A production rule is specified by a deferred substitution expression, which indicates representation. For example, the rules

aBc

representing language {auc, auuc, auuuc, auuuuc, ...}, are expressed as a grammar (g):


( g = {(S =: aBc) (B =: u) (B =: uB)} )

Regular expressions

Empty string.

(∧ =: θ)
(the symbol ∧, the logical conjunction operator, is redefined)
Concatenation.

⟨( rs = (r s) )⟩
Alternative.

⟨( r|s = {r s} )⟩

This definition fulfills the commutative and associative properties, as well as the distributive:

⟨( r(s|t) = {(r s)|(r t)} )⟩ ⟨( (s|t)r = {(s r)|(t r)} )⟩
Repeat 0 or more times.

⟨( r* = {∧ r★[1...]} )⟩
Repeat 1 or more times.

⟨( r* = {r★[1...]} )⟩
Exponents (repeat n times).

⟨( r^n = r★n )⟩
(the exponentiation operator is redefined)
Optional appearance.

⟨( r? = {r ∧} )⟩
Numerical range.

⟨( [n1-n2] = n1...n2 )⟩
Alphabetical range.

(al = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnmnopqrstuvwxyz)

(A =: al\1) (B =: al\2) ...

A...D (rep. (al\1 ... al\4) rep. ( al\[1...4] ) rep. ABCD
Regular expression names.

(name = r) or (name =: r)

Advantages of MENTAL to represent languages

MENTAL is a universal language, a parent language or metalanguage of all particular languages. Therefore, all particular languages are connected at a deep level with the mother language. MENTAL is a boundary language and is its own metalanguage.
A complete formal language is available to express all kinds of expressions, not just strings of symbols.
The distinction between applications, grammars and regular expressions is blurred. They are manifestations of the same deep and universal language.
The universal semantic primitives of MENTAL establish the degrees of freedom, with which the limitations of production rules and regular expressions are overcome. For example, you can describe conditional expressions, functions, generic expressions, assign names to all kinds of expressions, and so on. You can also define parameterized grammars, i.e. categories of grammars.

Bibliography

Alfonseca, Manuel; Sancho, Justo; Martínez Orga, Miguel. Teoría de lenguajes y autómatas. Publicaciones R.A.E.C., 1997.
Fernández, Gregorio; Sáez Vacas, Fernando. Fundamentos de Informática. Lógica, Autómatas, Algoritmos y Lenguajes. Anaya Multimedia, 1995.
Hausser, Roland R. Foundations of Computational Linguistics. Springer-Verlag, 1999.
Isasi Viñuela, Pedro, Martínez Fernández, Paloma; Borrajo Millán, Daniel. Lenguajes, gramáticas y autómatas. Un enfoque práctico. Pearson Addison-Wesley, 2008.
Kleene, Stephen C. Representation of Events in Nerve Nets and Finite Automata. In Automata Studies. Claude Shannon and John McCarthy, eds, 1956.
Martin, John. Lenguajes formales y teoría de la computación. McGraw-Hill Interamericana, 2004.
Shannon, C.; McCarthy, J. Automata Studies. Princeton University Press, 1956.

	MENTAL
	Main Menu
	Applications
	Linguistics
	Formal Grammars and Regular Expressions