Background

The Unicode standard defines the Universal Character Set (UCS) which gives numbers to all the characters in all the alphabeths of the world. The UCS is a superset of Latin-1 (ISO-8859-1) which again is a superset of ASCII. ASCII defines the first 128 characters and Latin-1 defines another 128 characters and thereby exhusts all bits in a 8-bit byte.

The UCS defines many more characters, so 1 byte per character is not enough. Unicode uses 31 bit, so the logical size of each character would be 4 bytes (32 bit). The problem with those wide characters is that they're only needed if your use of the ~2 billion characters are evenly distributed --- most people use no more than 256 of those characters in their documents, so there's a lot wasted space.

The UTF-8 encoding is a way of transforming 4 byte wide characters into 1-6 byte wide characters. It's backwards compatible with ASCII meaning that texts encoded in ASCII automatically is in UTF-8 as well. Other encodings (including Latin-1) use two or more bytes to represent each character. That's why 'æ', 'ø', and 'å' turns into two-letter combinations when an UTF-8 encoded text is viewed as Latin-1.

All the above is dealt with in much more detail in the UTF-8 and Unicode FAQ for Unix/Linux which is usefull for a lot more than just Unix/Linux.

List Slides

Export PDF

Handouts

Exit Slideshow

Background The Unicode standard defines the Universal Character Set (UCS) which gives numbers to all the characters in all the alphabeths of the world. The UCS is a superset of Latin-1 (ISO-8859-1) which again is a superset of ASCII. ASCII defines the first 128 characters and Latin-1 defines another 128 characters and thereby exhusts all bits in a 8-bit byte. The UCS defines many more characters, so 1 byte per character is not enough. Unicode uses 31 bit, so the logical size of each character would be 4 bytes (32 bit). The problem with those wide characters is that they're only needed if your use of the ~2 billion characters are evenly distributed --- most people use no more than 256 of those characters in their documents, so there's a lot wasted space. The UTF-8 encoding is a way of transforming 4 byte wide characters into 1-6 byte wide characters. It's backwards compatible with ASCII meaning that texts encoded in ASCII automatically is in UTF-8 as well. Other encodings (including Latin-1) use two or more bytes to represent each character. That's why 'æ', 'ø', and 'å' turns into two-letter combinations when an UTF-8 encoded text is viewed as Latin-1. All the above is dealt with in much more detail in the UTF-8 and Unicode FAQ for Unix/Linux which is usefull for a lot more than just Unix/Linux.

Background

If Tiki on your server doesn't look fine:

Test on this page itself

Editing the translations

Test on this page itself
(pl) Czy polski ogónki funkcjonujÄ… tutaj, pisany przez mozilli? Np, tu trochÄ™ treÅ›Ä‡. (ja) æ—¥æœ¬èªž. (en) Good, this looks OK 😊.