Loading...
 

Background


The Unicode standard defines the Universal Character Set (UCS) which gives numbers to all the characters in all the alphabeths of the world. The UCS is a superset of Latin-1 (ISO-8859-1) which again is a superset of ASCII. ASCII defines the first 128 characters and Latin-1 defines another 128 characters and thereby exhusts all bits in a 8-bit byte.

The UCS defines many more characters, so 1 byte per character is not enough. Unicode uses 31 bit, so the logical size of each character would be 4 bytes (32 bit). The problem with those wide characters is that they're only needed if your use of the ~2 billion characters are evenly distributed --- most people use no more than 256 of those characters in their documents, so there's a lot wasted space.

The UTF-8 encoding is a way of transforming 4 byte wide characters into 1-6 byte wide characters. It's backwards compatible with ASCII meaning that texts encoded in ASCII automatically is in UTF-8 as well. Other encodings (including Latin-1) use two or more bytes to represent each character. That's why 'æ', 'ø', and 'å' turns into two-letter combinations when an UTF-8 encoded text is viewed as Latin-1.

All the above is dealt with in much more detail in the UTF-8 and Unicode FAQ for Unix/Linux which is usefull for a lot more than just Unix/Linux.

List Slides
Background The Unicode standard defines the Universal Character Set (UCS) which gives numbers to all the characters in all the alphabeths of the world. The UCS is a superset of Latin-1 (ISO-8859-1) which again is a superset of ASCII. ASCII defines the first 128 characters and Latin-1 defines another 128 characters and thereby exhusts all bits in a 8-bit byte. The UCS defines many more characters, so 1 byte per character is not enough. Unicode uses 31 bit, so the logical size of each character would be 4 bytes (32 bit). The problem with those wide characters is that they're only needed if your use of the ~2 billion characters are evenly distributed --- most people use no more than 256 of those characters in their documents, so there's a lot wasted space. The UTF-8 encoding is a way of transforming 4 byte wide characters into 1-6 byte wide characters. It's backwards compatible with ASCII meaning that texts encoded in ASCII automatically is in UTF-8 as well. Other encodings (including Latin-1) use two or more bytes to represent each character. That's why 'æ', 'ø', and 'å' turns into two-letter combinations when an UTF-8 encoded text is viewed as Latin-1. All the above is dealt with in much more detail in the UTF-8 and Unicode FAQ for Unix/Linux which is usefull for a lot more than just Unix/Linux.