*** Welcome to piglix ***

UTF-8


UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike.

The encoding is variable-length and uses 8-bit code units. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.

UTF-8 is the dominant character encoding for the World Wide Web, accounting for 87.9% of all Web pages in November 2016 (the most popular East Asian encodings, Shift JIS and GB 2312, have 1.1% and 0.8% respectively). The Internet Mail Consortium (IMC) recommended that all e-mail programs be able to display and create mail using UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and HTML.

UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard). Code points with lower numerical values (i.e., earlier code positions in the Unicode character set, which tend to occur more frequently) are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as end of string.


...
Wikipedia

...