*** Welcome to piglix ***

Punycode


Punycode is a way to represent Unicode within the limited character subset of ASCII used for Internet host names. For example, "München" (German name for the city of Munich) would be encoded as "Mnchen-3ya". Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphen (the Letter-Digit-Hyphen (LDH) subset, as it is called).

While in theory the Domain Name System (DNS) supports arbitrary sequences of octets in domain name labels, the DNS standards strongly recommend the use of the LDH subset of ASCII conventionally used for host names, and require that string comparisons between DNS domain names be done case-insensitively assuming ASCII. The Punycode syntax is a method of encoding strings containing Unicode characters, such as internationalized domain names (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in IETF Request for Comments 3492.

As stated in RFC 3492, "Punycode is an instance of a more general algorithm called Bootstring, which allows strings composed from a small set of "basic" code points to uniquely represent any string of code points drawn from a larger set." Punycode is Bootstring with particular parameter values appropriate for the encoding of labels in the IDNA framework. This section demonstrates the procedure for Punycode encoding, using the example of the string "bücher" (German for books), which is translated into the label "bcher-kva".

First, all basic ASCII characters in the string are copied from input to output, skipping over any other characters. For example, "bücher" is copied to "bcher". If any characters were copied, an ASCII hyphen is added to the output next (e.g., "bücher" → "bcher-"). Since it is a basic character, the ASCII hyphen may itself appear in the string before this additional character. However, the additional ASCII hyphen does not cause any ambiguity as no later part of the encoding process can introduce another ASCII hyphen; the last ASCII hyphen, if any, signifies the end of the basic characters.


...
Wikipedia

...