Text Encoding Initiative

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains an eponymous technical standard, a journal, a , a SourceForge repository and a toolchain.

The TEI Guidelines, which collectively define an XML format, are the defining output of the community of practice. The format differs from other well-known open formats for text (such as HTML and OpenDocument) in that it's primarily semantic rather than presentational; the semantics and interpretation of every tag and attribute are specified. Some 500 different textual components and concepts (word,sentence,character,glyph,person, etc.); each is grounded in one or more academic discipline and examples are given.

The standard is split into two parts, a discursive textual description with extended examples and discussion and set of tag-by-tag definitions. Schemata in most of the modern formats (DTD, RELAX NG and W3C Schema) are generated automatically from the tag-by-tag definitions. A number of tools support the production of the guidelines and the application of the guidelines to specific projects.

A number of special tags are used to circumvent restrictions imposed by the underlying Unicode; glyph to allow representation of characters that don't qualify for Unicode inclusion and choice to allow overcome the required strict linearity.

Most users of the format do not use the complete range of tags but produce a customisation, using a project-specific subset of the tags and attributes defined by the Guidelines. The TEI defines a sophisticated customization mechanism known as ODD for this purpose. In addition to documenting and describing each TEI tag, an ODD specification specifies its content model and other usage constraints, which may be expressed using schematron.

Project URL Strengths
British National Corpus 100 million word snapshot of current English
Oxford Text Archive >1 GB of Linguistic data and electronic texts in 25 languages
Perseus Project Greek and Latin texts
EpiDoc Epigraphy and Papyrology
Women Writers Project Early modern women writers (Margaret Cavendish, Eliza Haywood, etc.)
New Zealand Electronic Text Centre New Zealand and Pacific Islands texts
The SWORD Project Bible software, dictionaries, Christian literature
FreeDict Bilingual dictionaries
Text Creation Partnership Early English and American books

  • 1987 Work on what would become the TEI started by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. This culminated in the Closing statement of the Vassar Planning Conference
  • 1994 TEI P3 released co-edited by Lou Burnard (at Oxford University) and Michael Sperberg-McQueen (then at the University of Illinois at Chicago, later at the W3C).
  • 1999 TEI P3 updated.
  • 2002 TEI P4 released, moving from SGML to XML; adoption of Unicode, which XML parsers are required to support.
  • 2007 TEI P5 released, including integration with the xml:lang and xml:id attributes from the W3C (these had previously been attributes in the TEI namespace), regularization of local pointing attributes to use the hash (as used in HTML) and unification of the ptr and xptr tags. Together these changes with many more new additions make P5 more regular and bring it closer to current xml practice as promoted by the W3C and as used by other XML variants. Maintenance and feature update versions of TEI P5 have been released at least twice a year since 2007.
  • 2011 TEI P5 v2.0.1 released with support for Genetic editing. (among many other additions the Genetic editing features allow encoding of texts without interpretation as to their specific semantics.)


