*** Welcome to piglix ***

UTF-9 and UTF-18


UTF-9 and UTF-18 (9- and 18-bit Unicode Transformation Format, respectively) were two April Fools' Day RFC joke specifications for encoding Unicode on systems where the (nine bit group) is a better fit for the native word size than the octet, such as the 36-bit PDP-10 and the UNIVAC 1100/2200 series. Both encodings were specified in RFC 4042, written by Mark Crispin (inventor of IMAP) and released on April 1, 2005. The encodings suffer from a number of flaws and it is confirmed by their author that they were intended as a joke.

However, unlike some of the "specifications" given in other April 1 RFCs, UTF-9 and UTF-18 are actually technically possible to implement, and have in fact been implemented in PDP-10 assembly language. They are however not endorsed by the Unicode Consortium.

Similarly to UTF-8, which uses a variable-width encoding with 8-bit code units, UTF-9 uses a system of putting the octets of a Unicode code point in the low 8 bits of each output nonet and using the high bit to indicate continuation. This means that ASCII and Latin 1 characters take one nonet each, the rest of the BMP characters take two nonets each and non-BMP code points take three. Code points that require multiple nonets are stored starting with the most significant non-zero octet.

This table shows the UTF-9 encoding scheme (the x characters are replaced by the bits of the code point):


...
Wikipedia

...