Working with ISO-8859-1 and Unicode Character Sets

Introduction

This article gives a brief and not too technical explanation of character encoding, and of the titular character encoding methods. I also outline how to work with the methods, how to fix some common problems and how to choose which encoding system to use.

Why is Character Encoding Important?

If a web developer includes an image in some HTML markup, he/she does not have to specify in what fomat the media was saved – the browser rendering engine will interpret that using a signature in the media file; similarly a media player will interpret a video file to discover which format the file is in.

Unfortunately character strings have no signature that allows the processing engine to automatically determine the format of the character encoding, in situations where multiple formats may be encountered, such as a web page, or if .NET or Java process external text files. The developer needs to inform the relevant engine what the character encoding format is.

When Character Encoding Goes Bad

This is a common sight on web pages:-

The price is �100 or about �120

… or the same text showing a different error:-

The price is £100 or about €120

The correct text should displayed as:-

The price is £100 or about €120

See below for a detailed explaination of the problem and the solution.

What is Character Encoding?

Character encoding is the means by which the characters are stored in a sequence (or stream) of bytes.

One Byte Per Character

The simplest format is the use of a single byte for a character giving 255 possible characters, 0 is usually the terminating character..This is sufficient to display most characters in most western languages, or most characters in any given language.

Two Bytes Per Character

If you have ever programmed in Java or .NET, you will almost certainly have encountered 2 byte (or 16 bit) character encoding, since strings are handled internally in this format. This allows the representation of 65535 characters which may initially seem to be sufficient to represent every possible character in written worldwide culture, and it usually is, but not always.

Unicode

Unicode simplifies things by allowing any character to be displayed within a single and huge character encoding system, which includes thousands of characters, more than can be represented by a 16 bit character encoding.

It also provides a more space efficient format than the aforementioned 16 bit encoding scheme, the popular UTF-8, and you may also encounter  UTF-16 or even UTF-32.

For a more detailed explaination of Unicode see our earlier blog on the subject.

ISO-8859-1 Encoding

ISO-8859-1 is actually a subset of Unicode. It comprises the first 255 Unicode characters (see below for the full character set) and is also sometimes known as Latin-1 since it features most of the characters that are used by Western European languages.

(The developer should be aware that the first 127 characters are encoded identically in ISO-8859-1 and UTF-8, as a single byte).

Many web pages created by English and other Western European language speakers are still encoded in ISO-8859-1, since this is sufficient to represent any possible character that they wish to display.

ISO-8859-1 vs UTF-8

When faced with the choice of character encoding, the choice is between flexibility and storage space and simplicity.

If only ISO-8859-1 characters are to be used in a project (such as a website), then ISO-8859-1 does offer a slight benefit in terms of storage space, and therefore in the case of a web page, of download size.

Fans of the Swedish/Danish TV show The Bridge will be familiar with the events contained in this sample string:-

Saga Norén leaves Malmö and crosses the Øresund Bridge

The text above comprises 54 characters. All the characters are present within the ISO-8859-1 character set, and so the string can be stored as 54 bytes using a simple one character per byte encoding.

If however the string is stored in UTF-8, it requires 57 bytes. This is because the three non-English characters (which are outside of the lower 1-127 range) are stored in two bytes using UTF-8. There is thus a slight space advantage.

I would nevertheless choose UTF-8 to give flexibility to show any possible future characters. Unicode wins.

Web Page Character Encoding Errors Explained

Remember the incorrectly displayed web page text shown above?

Error 1 was:-

The price is �100 or about �120

Error 2 was:-

The price is £100 or about €120

What has gone wrong? Well, the first example shows what happens when text that has been encoded as ISO-8859-1 is displayed on a web page which has told the viewing web browser that the contents are encoded as UTF-8.

The characters £ and are outside of the lower range (1-127) and are therefore encoded differently in UTF-8 and ISO-8859-1.

The second example shows the opposite; text encoded as UTF-8 is displayed in a page which has informed the web browser that the contents are encoded in ISO-8859-1.

Put simply, the web page encoding information does not match the contents, and horrid errors are shown.

In order to display this correct text…

The price is £100 or about €120

.. the simple solution to both problems is to establish which encoding should be used, and then within the

<head>

…of an HTML 4 or earlier page, use

<meta http-equiv="content-type" content="text/html;
charset=utf-8" />

…to specify UTF-8 contents or

<meta http-equiv="content-type" content="text/html;
charset=iso-8859-1" />

…if the contents are ISO-8859-1.

For HTML 5 specifying the character set is simpler:-

<meta charset="utf-8" />

The above code fragments are suitable for flat HTML pages; PHP programmers would use

header("Content-Type: text/html;charset=utf-8");

and a JSP page would use

<%@ page contentType="text/html;charset=UTF-8" %>

…to show just a couple of common examples.

Working with Text Files

A simple text file, as we have seen, carries no header or signature to indicate in what encoding format the text was saved. The programmer should determine that encoding format carefully.

For example, to read an ISO-8859-1 text file containing our 54 character sentence above, in C#, you would:-

StreamReader tr = null;

try
{
   tr = new StreamReader("saga.txt",
                          Encoding.GetEncoding("iso-8859-1"));
   String testline = tr.ReadLine();
}
catch
{
}
finally
{
   tr.Close();

}

The above code will ensure that the non-English characters are read correctly into the .NET String class instance.

Reference: The ISO-8859-1 Character Set

These are the displayable characters in the ISO-8859-1 character set, with their Hexadecimal values. Characters 0x20 (space) to 0xff are shown.

Character Hex Character Hex Character Hex Character Hex
20 ! 21 22 # 23
$ 24 % 25 & 26 27
( 28 ) 29 * 2A + 2B
, 2C 2D . 2E / 2F
0 30 1 31 2 32 3 33
4 34 5 35 6 36 7 37
8 38 9 39 : 3A ; 3B
< 3C = 3D > 3E ? 3F
@ 40 A 41 B 42 C 43
D 44 E 45 F 46 G 47
H 48 I 49 J 4A K 4B
L 4C M 4D N 4E O 4F
P 50 Q 51 R 52 S 53
T 54 U 55 V 56 W 57
X 58 Y 59 Z 5A [ 5B
\ 5C ] 5D ^ 5E _ 5F
` 60 a 61 b 62 c 63
d 64 e 65 f 66 g 67
h 68 i 69 j 6A k 6B
l 6C m 6D n 6E o 6F
p 70 q 71 r 72 s 73
t 74 u 75 v 76 w 77
x 78 y 79 z 7A { 7B
| 7C } 7D ~ 7E  7F
80  81 82 ƒ 83
84 85 86 87
ˆ 88 89 Š 8A 8B
Œ 8C  8D Ž 8E  8F
 90 91 92 93
94 95 96 97
˜ 98 99 š 9A 9B
œ 9C  9D ž 9E Ÿ 9F
A0 ¡ A1 ¢ A2 £ A3
¤ A4 ¥ A5 ¦ A6 § A7
¨ A8 © A9 ª AA « AB
¬ AC ­ AD ® AE ¯ AF
° B0 ± B1 ² B2 ³ B3
´ B4 µ B5 B6 · B7
¸ B8 ¹ B9 º BA » BB
¼ BC ½ BD ¾ BE ¿ BF
À C0 Á C1 Â C2 Ã C3
Ä C4 Å C5 Æ C6 Ç C7
È C8 É C9 Ê CA Ë CB
Ì CC Í CD Î CE Ï CF
Ð D0 Ñ D1 Ò D2 Ó D3
Ô D4 Õ D5 Ö D6 × D7
Ø D8 Ù D9 Ú DA Û DB
Ü DC Ý DD Þ DE ß DF
à E0 á E1 â E2 ã E3
ä E4 å E5 æ E6 ç E7
è E8 é E9 ê EA ë EB
ì EC í ED î EE ï EF
ð F0 ñ F1 ò F2 ó F3
ô F4 õ F5 ö F6 ÷ F7
ø F8 ù F9 ú FA û FB
ü FC ý FD þ FE ÿ FF

Leave a Reply