"A document's marking labels describe its structure" (Charles Goldfarb).
The SGML Language
SGML stands for "Standard Generalized Markup Language". The concept of "markup" or "markup" refers to the assignment of attributes to different parts of a document. An attribute is a "name-value" pair, where "name" is the name of the attribute and "value" is the value associated with the attribute.
SGML is a generic language that was born to describe the logical structure of the content of documents, but its vocation and projection is universal, beyond the world of documentation. Indeed, although current applications of this standard have focused on a single way of structuring information (document content), the language is generic, making it potentially applicable to other information structures as well, including databases. SGML has now been freed from document-oriented concepts, so that it is applicable to the description of any information structure, conceptually and independently of the application or tool used.
SGML is really a metalanguage, since it is a language for defining specific markup languages. A specific markup language has its specific vocabulary and a syntax that defines the relationships between its elements.
The logical structure of a document
The description of the logical structure of a document is based on the identification of the types of component elements (chapters, sections, paragraphs, lists, etc.), their attributes, as well as the relationships between these elements. This description is carried out completely independently of the different processes that can be carried out with the document:
Formatting or external representation.
Search for information.
Selection of component elements.
Format conversion.
Parsers.
Editors.
Etc.
It is what is called "Open Information Management" (OIM), that is, making information available for all types of applications.
The identification of the different elements of a document is done by marking them with tags. These tags have the following characteristics:
They are enclosed in angle brackets in order to differentiate them from the text.
The beginning of each element is indicated by an initial tag, which contains the element type (called the generic identifier, GI) and optionally one or more attributes specifying defining characteristics of the element.
Example: <cap nro=19>
In this case, the GI is "cap" (chapter) and "nro" (number) is an attribute, whose associated value is 19.
Actually the IG can be considered as another attribute, whose value is precisely the name of the element type (IG=cap). In this way, the syntax would be more homogeneous:
<IG=cap nro=19>
There is an attribute called ID, which is optional. It is a unique identifier, which serves to distinguish the element from all others having the same IG.
GI and ID are primary attributes. The rest of the attributes are secondary.
The end of an element is indicated by an end tag, starting with "", followed by the GI and ">". Examples:
</p> (end of paragraph)
</cap> (end of chapter)
<p>This is a text</p> (a paragraph)
An element can contain other elements. Even an element type (IG) can contain another element of the same type. The level of nesting is deduced by the position of the element through the start and end tags. Example:
<table nro=1> (start 1st table)
<item>item 1</item>
<item>item 2</item>
<table nro=2> (start 2nd table inside 1st table)
<item>item A</item>
<item>item B</item>
</table> (end of the second table)
</table> (end of the first table)
Sometimes it is not necessary to indicate the end of an element, since it is implied by the appearance of another element following it. Example:
<p>paragraph
<p>another paragraph (the previous paragraph is supposed to have ended)
Generalized marking vs. specific marking
Specific markup consists of inserting, within the text of a document, controls related to a specific process. For example, in the case of a document formatting process (text presentation):
.NL 1 (skip a line).
.SA 4 (indent 4).
Specific marking, also called procedural, has the following disadvantages:
It is rigid. If you decide to change the style of the document, you have to revise all the controls.
It is a slow and delicate process that leads to errors.
It requires training, even defining macros (derived controls), because although these simplify the marking, they must be added to the vocabulary of primitive controls.
The different elements or components of a document are not clearly differentiated, which can lead to ambiguities regarding controls.
On the other hand, with generalized marking, it is achieved:
Simplicity.
Marking is a simple operation in which it is difficult to make mistakes because all it does is describe the structure of the document.
Clarity.
A document with generalized markup makes it easy to understand its structure.
Accessibility.
The elements of the documents are identified, thus making them available and accessible information resources, with the information released and not hidden.
Openness to applications.
Accessibility of information opens the document to multiple applications.
Abstraction.
It allows separating or abstracting the logical structure of the document and the content (the text).
Definition of document types
In SGML, the definition of a document type is done by means of a DTD (Document Type Definition, which specifies the structural constraints imposed on documents of a certain type. There are DTDs defined for military applications, in the aeronautical industry, in large corporations, etc.
The language of DTDs is realized by a variant of the regular expression notation [see Applications - Linguistics - Formal Grammars and Regular Expressions], in which the following symbols are used:
Symbol
Meaning
&
Union of elements of a set (elements in any order)
,
Separation of elements of a sequence
( )
Grouping elements together
|
Alternative elements
?
Optional element
*
Repetition of an element zero or more times
+
Repeating an element one or more times
Limitations of SGML
It has been said that SGML walks between humanism and science. Indeed, the document encoding system is very intuitive and easy to understand, all because it is based on the concept of attribute, which has a high semantic level. But SGML suffers from many limitations:
It is not a language in itself. It is only a syntactic system for describing or representing (in particular domains) hierarchical structures of elements defined by attributes.
It is not generic enough. It is noted that the standard was born document-oriented.
Although it is a language defined to describe concrete documents and document types, there is no parallel language for its processing, leaving this task to other languages. In this sense, the language is not complete, as it lacks the operational component.
The DTDs language is an isolated language different from SGML. And it has the limitations of regular expressions, on which it relies.
No relationships can be established between elements except for those in the hierarchy.
Attributes are limited to the form "name=value". The value of an attribute cannot be composite, i.e. a sequence (or set) of values, of other attributes, etc.
There are no distributive, repetitive, conditional, recursive, etc. forms.
It does not allow automatic inferences, such as derived attributes.
The XML Language
XML (eXtended Markup Language, Extended Markup Language) is a subset of SGML, a simplified SGML developed in 1998 by the W3C (WWW Consortium), for use on the Internet and in all types of applications in general.
SGML is more powerful and flexible than XML. In SGML it is even possible to change the syntax of the angle brackets. But SGML is more complex and more difficult to implement than XML. SGML is currently being replaced by XML, because it is simpler, because it integrates better with the current Web and because it is one of the technologies chosen for the future Semantic Web [see Applications - Computing - Semantic Web].
Ejemplo
A hierarchical structure of information specified using XML is, for example the following:
XML is not a language per se. It is a formal system for representing (in particular domains) hierarchical structures of tags with values. The semantics is not formal, it is implicit: the meaning of an attribute is based on its name and is oriented to human consumption. What there is is a structural semantics based on hierarchies of elements.
The syntax is simple, but not simple enough, as it repeats the label (at the beginning and at the end).
There are no relationships between elements other than those established through the hierarchy.
Attributes are limited to the form "name = value". The value of an attribute cannot be composite, i.e. a sequence (or set) of values, of other attributes, etc.
The language is declarative, with no operational component.
There are no distributive, repetitive, conditional, recursive, etc. forms.
It does not allow inferences to be drawn.
XML Schema (XMLS)
XML Schema is a language oriented to define XML document types or structures, that is, syntactically valid documents. It has the advantage that its syntax is also XML (as opposed to DTDs, which have no SGML notation).
Limitations:
The syntax is complex, rigid and unnatural.
The data types are limited, and do not constitute a robust system such as exists in programming languages.
It is an isolated language, specialized in the definition of XML document types.
It is not possible to define schemas of schemas, i.e. hierarchies of document types.
Data type inheritance (adopted from object-oriented programming languages) is very complex.
It is not possible to generate document types dynamically, since there is no base language capable of generating them.
The universality of XML
XML is currently being applied for the specification of all kinds of information structures. This is at least a debatable line, since claiming to use XML "for everything" leads to an inconsistency similar to claiming in OOP (Object Oriented Programming) that "everything is an object". For example,
In OOP the expression a+b is interpreted as the message a+ addressed to the object b.
In XML, this same expression would be represented:
In the first case a semantic error (of interpretation) is committed and in the second case an error of representation, by adopting unnecessary complexity.
It should be borne in mind that it is first the semantics and then the syntax, which should be as simple, readable and adequate as possible and evoke the associated semantics.
MENTAL as a Generalized Markup Language
MENTAL provides a complete language for the generalized markup philosophy, making it unnecessary to use a special language for this field. It can be applied to the specification of all types of information structures, overcoming the limitations of SGML (and its simplified version XML).
Markup with MENTAL is mainly done with the primitive "/" (particularization):
The primary expression can be any type (a sequence, a set, a generic expression, etc.), but it is usually a sequence or a set.
The secondary expression in general can be any expression, but usually an attribute name is sufficient.
There may be hierarchies of attributes.
The attribute name need not be repeated because the limits of an attribute (or set of attributes) are given because the attribute affects the primary expression.
Example
In MENTAL, the above example would be specified as follows:
If you want to restrict the values associated with the parameters, you would have to include conditions. For example, length, type (numeric, alphanumeric, date, etc.), ranges of values, etc.
Use of XML syntax
We can, if we wish, use XML syntax. To do so, we can use the following definition:
〈( <y>x</y> =: y/x )〉
As the angle brackets and the slash have meaning in MENTAL, they should be differentiated by, for example, another color. Potential substitution has been used, so representation is being indicated. For example, the HTML statement
<p height=12 color=blue font=Arial>text</p>
represents in MENTAL the expression
text/(height/12 color/blue Font/Arial)
Advantages of MENTAL as a generalized markup language
The notation is homogeneous, much simpler and more expressive.
No need for different and specific languages: SGML, DTD, XML and XMLS.
The distinction with a (hierarchical) record of a database is diluted.
Paradoxically, although MENTAL is a very simple language (conceptually and syntactically), its possibilities are much greater than SGML and XML, since it has all the possibilities of a complete language: create higher order types, make queries, assign names, access attributes, make modifications, deferred evaluation, automatic inferences, use generic, shared, virtual, linked expressions, etc.
Addenda
Origin of SGML
Charles F. Goldfarb (along with Edward Mosher and Raymond Lorie), invented GML (which are also the initials of their last names), the precursor of SGML, in 1969, at IBM, inventing the concept of "markup" as a means of structuring and sharing the content of a document between different applications. In 1974 SGML was born as an evolution of GML, although it took more than a decade before it was fully developed and standardized. SGML has been an international standard since 1986 (ISO 8879).
The standard does not define tags, although a basic set appears as an annex to ISO 8879, in which examples of application of the language appear. Today, SGML, although a widely accepted and widespread international language for information exchange, has been replaced in practice by XML, as it is simpler.
Some developments in SGML are:
HTML (Hypertext Markup Language) is a DTD used in the coding of Internet web pages. It is a document presentation-oriented language, along with the definition of hyperlinks between documents. There is an XML version of HTML called XHTML, the HTML language with XML syntax.
The TCIF (Telecommunications Industry Forum) has adopted SGML in the documentation of telecommunications systems and equipment.
The computer industry is applying SGML encoding for on-line documentation.
The Association of American Publishers designed a set of tags for books and magazines that has become an ANSI standard.
The aviation industry and airline association has developed a set of SGML tags for aircraft maintenance and operating manuals.
The TEI (Text Encoding Initiative) is an international effort to standardize the encoding of literary texts.
SDIF (SGML Document Interchange Format) is an ISO 9069 (1988) standard for SGML document interchange in an open systems environment.
A complete description of SGML can be found in [Goldfarb, 1991], written by the main inspirer of this language. A more practical approach is provided by [Herwijnen, 1994]. [Wright, 1992] explains SGML as a technique for releasing information.
Bibliography
Castro, Elizabeth. Guía de aprendizaje XML. Pearson Education, S.A., 2001.
Clark, James. Comparison of SGML and XML. Internet.
Goldfarb, Charles F. The SGML Handbook. Oxford University Press, 1991.
Goldfarb, Charles F. The XML Handbook. Prentice Hall, 2003.
Goldfarb, Charles F. A Generalized Approach to Document Markup. SIGPLAN Notices, June 1981.
Goldfarb, Charles; Prescod, Paul. Manual de XML. Prentice Hall Iberia, 1999.
González, Oscar. XML. Anaya Multimedia, 2001.
Herwijnen, Eric van. Practical SGML. Springer, 1994.
Wright, Haviland. SGML Frees Information. BYTE, Junio 1992.