Character sets (charsets) are utilized by browsers to convert information from stream of bytes into readable characters. Each character is represented by a value and each value has assigned corresponding character in a table. There are literally hundreds of the character encoding sets that are in use. Here is a list of just a few common character encoding used on the web ordered by popularity:
- UTF-8 (Unicode) Covers: Worldwide
- ISO-8859-1 (Latin alphabet part 1) Covers: North America, Western Europe, Latin America, the Caribbean, Canada, Africa
- WINDOWS-1252 (Latin I)
- ISO-8859-15 (Latin alphabet part 9) Covers: Similar to ISO 8859-1 but replaces some less common symbols with the euro sign and some other missing characters
- WINDOWS-1251 (Cyrillic)
- ISO-8859-2 (Latin alphabet part 2) Covers: Eastern Europe
- GB2312 (Chinese Simplified)
- WINDOWS-1253 (Greek)
- WINDOWS-1250 (Central Europe)
- US-ASCII (basic English)
Note that popularity of particular charsets greatly depends on the geographical region. You can find all names for character encodings in the IANA registry.
As you can see there are multiple possibilities to choose from therefore character encoding information should always be specified in the HTTP Content-Type response headers send together with the document. Without specifying charset you risk that characters in your document will be incorrectly interpreted and displayed.
In Hypertext Transfer Protocol (HTTP) a header is simply a part of the message containing additional text fields that are send from or to the server. When browsers request a webpage, in addition to the HTML source code of a webpage the web server also sends fields containing various metadata describing settings and operational parameters of the response. In another words, the HTTP header is a set of fields containing supplemental information about the user request or server response.
From the example above, the “Response Headers” contain several fields with information about the server, content and encoding where the line
Content-Type: text/html; charset=utf-8
informs the browser that characters in the document are encoded using UTF-8 charset.
How to add charset information to the request?
Although, it is possible to specify character encoding in HTML document using meta http-equiv pragma directive, this approach disables the look ahead downloader in the internet Explorer 8 which increases page load times. Also including content-type in a meta can cause problems with information duplication or coherency. Some webservers automatically send HTTP headers containing content-type, when user suggest a different content-type via meta tag this creates incoherent information which can “confuse” browsers.
In HTML 5 the declaration of charset via meta looks as follows:
<meta charset="UTF-8">
In HTML4 or older the declaration of charset looks as follows:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
In the XML document the encoding declaration can be accomplished by adding the “encoding” attribute as follows:
<?xml version="1.0" encoding="UTF-8"?>
The best approach is to provide charset information in the HTTP header. How to do it depends entirely on what server-side technology you are using. For example, if you are generating your content using PHP, add following line of code at the top of your page:
header("Content-Type: text/html; charset=utf-8");
Alternatively, if you are using Apache, you can do it using htaccess directives as follows:
AddType 'text/html; charset=UTF-8' html
And if you’re using nginx, add following in your config:
more_set_headers -t 'text/html' 'Content-Type: text/html; charset=utf-8';
These are of course only few examples, your solution will depend on your server configuration.