VTD-XML

VTD-XML
Developer(s)	XimpleWare

Stable release	2.12 / Nov 19, 2015
Operating system	Portable
Platform	Java, C#, C and C++
Type	XML parser/indexer/slicer/editor library
License	GPL and Proprietary License
Website	vtd-xml.sourceforge.net VTD-XML blog

Virtual Token Descriptor for eXtensible Markup Language (VTD-XML) refers to a collection of cross-platform XML processing technologies centered on a non-extractiveXML, "document-centric" parsing technique called Virtual Token Descriptor (VTD). Depending on the perspective, VTD-XML can be viewed as one of the following:

VTD-XML is developed by XimpleWare and dual-licensed under GPL and proprietary license. It is originally written in Java, but is now available in C,C++ and C#.

Traditionally, a lexical analyzer represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated extractive parsing. In contrast, non-extractive tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.

Virtual Token Descriptor (VTD) applies the concept of non-extractive, document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64 bits in length, they can be stored efficiently and managed as an array.

Location Caches (LC) build on VTD records to provide efficient random access. Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.

Virtually all the core benefits of VTD-XML are inherent to non-extractive, document-centric parsing which provides these characteristics:

Combining those characteristics permits thinking of XML purely as syntax (bits, bytes, offsets, lengths, fragments, namespace-compensated fragments, and document composition) instead of the serialization/deserialization of objects. This is a powerful way to think about XML/SOA applications.

VTD-XML conforms strictly to XML 1.0 (Except the DTD part) and XML Namespace 1.0. It essentially conforms to XPath 1.0 spec (with some subtle differences in terms of underlying data model) with extension to XPath 2.0 built-in functions.

...
Wikipedia