HTML Document Character Set 

Contents

  1. The Document Character Set
  2. Character entities

Human languages define a large number of text characters and human beings have invented a wide variety of systems for representing these characters in a computer. Unless proper precautions are taken, differing character representations may not be understood by user agents in all parts of the world.

The Document Character Set 

To promote interoperability, SGML requires that each application (including HTML), as part of its definition, define its document character set. A document character set is a set of abstract characters (such as the Cyrillic letter "I", the Chinese character meaning "water", etc.) and a corresponding set of integer references to those characters. SGML considers a document to be a sequence of references in the document character set.

The document character set for HTML is the Universal Character Set (UCS) of [ISO10646]. This set is character-by-character equivalent to Unicode 2.0 ([UNICODE]). Both of these standards are updated from time to time with new characters and the amendments should be consulted at the respective Web sites.

In the current specification, references to ISO/IEC-10646 or Unicode imply the same document character set. However, the current document also refers to the Unicode specification for other issues such as the bidirectional text algorithm.

Conforming HTML user agents may receive or output a document, or represent a document internally, using any character encoding. A character encoding represents some subset of the document character set. Character encodings such as ISO-8859-1 (commonly referred to as "Latin-1" since it encodes most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), and euc-jp (another Japanese encoding) save bandwidth by representing only slices of the document character set.

Thus, character encodings allow authors to work with a convenient subset of the document character. Authors should not have to know anything about the underlying character encoding of the document or tool they are using --- writing Japanese in a UTF-8 editor is as easy as writing Japanese in a JIS or SHIFT_JIS editor.

Character encodings also mean that authors are not required to enter a document's text in the form of references the document character set. Requiring authors to work with such a large character encoding would be cumbersome and wasteful (although encodings such as UTF-8 that cover all of Unicode do exist).

To allow this convenience, conforming user agents must correctly map to [UNICODE] all characters in any character encodings ("charsets") they recognize (or behave as if they did). A list of recommended character encodings for various scripts and languages will be provided in a separate document.

How does a user agent know which character encoding has been used to encode a given document?

In many cases, before a Web server sends an HTML document over the Web, it tries to figure out the character encoding (by a variety of techniques such as examining the first few bytes of the file, checking its encoding against a database of known files and encodings, etc.). The server transmits the document and the name of the character encoding to the receiving user agent by way of the charset parameter of the HTTP "Content-Type" field. For example, the following HTTP header announces that the character encoding is "euc-jp".

Content-Type: text/html; charset=euc-jp

The value of the "charset" parameter must be the name of a "charset" as defined in [RFC2045].

Unfortunately, not all servers send information about the character encoding (even when the character encoding is different from the widely used ISO-8859-1 encoding). HTML therefore allows authors a way to tell user agents which character encoding has been used by specifying it explicitly in the document header with the META element. For example, to specify that the character encoding of the current document is "euc-jp", include the following META declaration:

<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

This mechanism has a notable limit: the user agent cannot interpret the META element to determine the character encoding if it doesn't already know the character encoding of the document. The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves at least until the META element is parsed. In this case, conforming user agents must correctly interpret the META element.

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding, (from highest priority to lowest):

  1. Explicit user action to override erroneous behavior.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  4. The "charset" attribute set for the A and LINK elements.
  5. User agent heuristics and user settings. For example, user agents typically assume that in the absence of other indicators, the character encoding is ISO-8859-1. This assumption may lead to an unreadable presentation of certain documents.

In all cases, the value of the "charset" attribute or parameter must be the name of a "charset" as defined in [RFC2045].

If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.

Note: Modern web servers can be configured with information about which document is using which character encoding. Webmasters should use these facilities but should take pains to configure the server properly.

Character entities 

Your hardware and software configuration probably won't allow you to refer to all Unicode characters through simple input mechanisms, so SGML offers character encoding-independent mechanisms for specifying any character from the document character set.

Numeric character references specify the integer reference of a Unicode character. A numeric character reference with the syntax &#D; refers to Unicode decimal character number D. A numeric character reference with the syntax &#xH; refers to Unicode hexadecimal character number H. The hexadecimal representation is a new SGML convention and is particularly useful since character standards use hexadecimal representations.

Here are some examples:

To give authors a more intuitive way to refer to characters in the document character set, HTML offers a set of named character entities. Named character references replace integer references with symbolic names. The named entity &aring; refers to the same Unicode character as &#229;. There is no named entity for the Cyrillic capital letter "I". The full list of named character entities is included in this specification.

Four named character entities deserve special mention since they are frequently used to "escape" special characters: For text appearing as part of the content of an element, you should escape < as &lt; to avoid possible confusion with the beginning of a tag. The & character should be escaped as &amp; to avoid confusion with the beginning of an entity reference.

You should also escape & within attribute values since entity references are allowed within cdata attribute values. In addition, you should escape > as &gt; to avoid problems with older user agents that incorrectly perceive this as the end of a tag when coming across this character in quoted attribute values.

Rather than worry about rules for quoting attribute values, its often easier to encode any instance of " by &quot; and to always use " for quoting attribute values. Many people find it simpler to always escape these 4 characters in element content and attribute values.

Names of named character entities are case-sensitive. Thus, &Aring; refers to a different character (upper case A, ring) than &aring; (lower case a, ring).

Note: In SGML, it is possible to eliminate the final ";" after a numeric or named character reference in some cases (e.g., at a line break or directly before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.