XML TEI Procedure
XML (eXtended Markup Language) is a tagging format that makes it possible to describe a structure of normalized data. It is particularly well adapted / suitable to the critical edition of documents, because it contains both the original text and the tags defining the annotations as well as the full description of the annotations.
Therefore, all the work that has been realized on the project is embedded in the XML documents. All the information, from the creation of the original resource up to the final annotations, is accessed from the XML document, thus facilitating data exchange and processing.
The TEI (Text Encoding Initiative) consortium is intended to develop and maintain a standard for the representation of texts in digital form. The format is widely used by researchers to present their data to the community and insure their durability.
Tagging in Unicode / UTF-8
The Aliento corpus is made of texts using both ancient alphabets (original texts) and modern ones (annotations). Several alphabets and two different reading directions can coexist in the same document.
We had to use a tagging syste m making it possible to represent all these characters in the same document:
Unicode is the computer standard used for the representation and manipulation of the text. In its present version, it can represent 136 755 characters and 139 alphabets.
UTF-8 is the Unicode implementation we have used. It covers the needs expressed by the project in terms of representation of the different scripts and their reading directions.