Latest web development tutorials

HTML Character Set

To correctly display an HTML page, the browser must know the character set to be used (the character encoding).


HTML Character Set

In HTML, what is the correct character code is?

HTML5 default character encoding is UTF-8.

This is not always the case. Early network character encoding is ASCII code.

Later, from HTML 2.0 to HTML 4.01, ISO-8859-1 was identified as the standard.

With the emergence of HTML5 and XML, UTF-8 has finally arrived, solve a lot of character encoding issues.

The following is a brief overview of the character encoding standards.


In the beginning: ASCII

Computer information (number, text, images) in electronics is binary 1 and 0 (01,000,101) for storing.

In order to regulate the storage of alphanumeric characters, created ASCII (full name of the American Standard Code for Information Interchange). It is defined as each character is stored with a unique seven-digit binary support 0-9, upper / lower case letters of the alphabet (az, AZ), and some special characters, such as $ + -! () @ <>.

Since ASCII uses a byte (7 represents the character, a parity indicates transmission control), so it can only represent 128 different characters. There are 32 of these characters are reserved for use as the other control purposes.

ASCII biggest drawback is that it excludes non-English letters.

ASCII is still in widespread use today, especially in large computer systems.

For insight into ASCII, please see the full ASCII reference manual .


In Windows: ANSI

ANSI (also known as Windows-1252), is a Windows 95 and Windows systems prior to the default character set.

ANSI ASCII is an extension, it joined the international character. It uses a whole byte (8 bits) to represent 256 different characters.

Since becoming Windows ANSI character set the default, all browsers support ANSI.

For in-depth understanding of ANSI, please see the full ANSI reference manual .


In HTML 4 are: ISO-8859-1

Since most countries use characters other than ASCII, the HTML 2.0 standard, change the default character encoding ISO-8859-1.

ISO-8859-1 is extended ASCII, it joined the international character. And ANSI, it uses an entire byte (8 bits) to represent 256 different characters.

Note When a browser detects ISO-8859-1 in the page, usually the default is ANSI, because in addition to ANSI 32 extra characters that other aspects of ANSI substantially equivalent to ISO-8859-1.

If HTML 4 page uses a different character set ISO-8859-1, you need to specify in the <meta> tag, as follows:

Examples

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-8">

Note

HTML5 default character set is UTF-8.
All HTML 4 processors support UTF-8, all of HTML5 and XML processors support UTF-8 and UTF-16.

For in-depth understanding of ISO-8859-1, please see the full ISO-8859-1 Reference Manual .


In HTML5: Unicode (UTF-8)

Because character sets listed above are limited, in a multilingual environment are not compatible, so the Unicode Consortium (Unicode Consortium) developed the Unicode standard (Unicode Standard).

Unicode standard covers (almost) all the characters, punctuation and symbols.

Unicode enables processing, storage and transportation of the text, and platform-independent language.

HTML5 default character encoding is UTF-8.

For in-depth understanding of Unicode (UTF-8), please see the complete Unicode reference manual .