In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. It is also described in The Unicode Standard version 3.0 and higher, as well as in the IETF's RFC 2781.
UTF-16 represents a character that has been assigned within the lower 65536 code points of Unicode or ISO/IEC 10646 as a single code value equivalent to the character's code point: 0 for 0, hexadecimal FFFD for FFFD, for example.
UTF-16 represents a character above hexadecimal FFFF as a surrogate pair of code values from the range D800-DFFF. For example, the character at code point hexadecimal 10000 becomes the code value sequence D800 DC00, and the character at hexadecimal 10FFFD, the upper limit of Unicode, becomes the code value sequence DBFF DFFD. Unicode and ISO/IEC 10646 do not assign characters to any of the code points in the D800-DFFF range, so an individual code value from a surrogate pair does not ever represent a character.
These code values are then serialized as 16-bit words, one word per code value. Because the endianness of these words varies according to the computer architecture, UTF-16 specifies three encoding schemes.
The UTF-16 encoding scheme mandates that the byte order must be declared by prepending a Byte Order Mark before the first serialized character. This BOM is the encoded version of the Zero-Width No-Break Space character, Unicode number FEFF in hex, manifesting as the byte sequence FE FF for big-endian, or FF FE for little-endian. A BOM at the beginning of UTF-16 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder.
The UTF-16LE and UTF-16BE encoding schemes are identical to the UTF-16 encoding schemes, but rather than using a BOM, the byte order is implicit in the name of the encoding (LE for little-endian, BE for big-endian). A BOM at the beginning of UTF-16LE or UTF-16BE encoded data is not considered to be a BOM; it is part of the text itself.
The IANA has approved UTF-16, UTF-16BE, and UTF-16LE for use on the Internet, by those exact names (case insensitively). The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names.
code point | character | UTF-16 code value(s) | glyph* |
---|---|---|---|
122 (hex 7A) | small Z (Latin) | 007A | z |
27700 (hex 6C34) | water (Chinese) | 6C34 | 水 |
119070 (hex 1D11E) | musical G clef | D834 DD1E | 𝄞 |
"水z𝄞" (water, z, G clef), UTF-16 encoded | ||
---|---|---|
labeled encoding | byte order | byte sequence |
UTF-16LE | little-endian | 34 6C, 7A 00, 34 D8 1E DD |
UTF-16BE | big-endian | 6C 34, 00 7A, D8 34 DD 1E |
UTF-16 | little-endian, with BOM | FF FE, 34 6C, 7A 00, 34 D8 1E DD |
UTF-16 | big-endian, with BOM | FE FF, 6C 34, 00 7A, D8 34 DD 1E |
* Appropriate font and software are required to see the correct glyphs.
Characters encoding:
Characters with values less than 0x10000 are represented as a single 16-bit integer with a value equal to that of the character number. (Values between 0xD800 and 0xDBFF are are reserved for use with UTF-16, and don't have any characters assigned to them)
Characters with values between 0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between 0xD800 and 0xDBFF followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF. Since the value of 0x10FFFF-0x1000=0xFFFFF requires 20 bit the higher 10 bit are encoded in the first word and the lower 10 bit are encoded in the secound word.
Characters with values greater than 0x10FFFF cannot be encoded in UTF-16.
Example: The character with the value v=0x64321 is to be encoded.
vī:=v-0x1000
= 0x54321 = 0101 0100 0011 0010 0001The correct encoding for this character is following word sequenz:vh= 0101010000 // higher 10 bit vl= 1100100001 // lower 10 bit w1:= 0xD800 // the resulting 1st word is initialized with the lower bracket w2:= 0xDC00 // the resulting 2nd word is initialized with the lower bracket w1:=w1 || vh =1101 1000 0000 0000 | 01 0101 0000 =1101 1001 0101 0000
w2:=w2 || vl =1101 1100 0000 0000 | 11 0010 0001 =1101 1111 0010 0001
0xD950 0xDF21