Internet DRAFT - draft-chung-idnop-charprep
draft-chung-idnop-charprep
Edmon Chung
Internet Draft Neteka
<draft-chung-idnop-charprep-00.txt>
Intended Category: Informational April 2003
CHARPREP û Character Equivalency Preparations for IDN
STATUS OF THIS MEMO
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts. Internet-Drafts are draft documents valid for a maximum of
six months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The reader is cautioned not to depend on the values that appear in
examples to be current or complete, since their purpose is primarily
educational. Distribution of this memo is unlimited.
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
Charprep intends to take up where Nameprep [NAMEPREP] left off to
provide additional preventive measures to bridge the users conceptual
perception of a multilingual domain name with the domain matching
process. The critical development from Nameprep is that common user
perception is taken into account. That is, Charprep strives to take
the 'case-insensitivity' concept of user-friendliness to another
level for IDNs because of the inherent complexity and potential
confusion that could arise from the use of multilingual characters in
domain names.
Charprep is designed to be a framework for Zone Administrators (e.g.
domain registries) to employ relevant equivalency tables to compute
and generate variants from the original string to variants that could
possibly create confusion with users. The actual management of
Reserved Variants (RV), Zone Variants (ZV) with the original string
(Primary Domain) will be discussed in Zoneprep [ZONEPREP].
Furthermore, Charprep and Zoneprep are designed to be a recommended
feature to be offered to users by a Zone Administrator (e.g. Domain
Chung [Page 1]
IDNOP-CHARPREP April 2003
Registries) in the management of Internationalized domain names
(IDN). A key concept is that these are done without affecting the
IDN protocol specified in [RFC3490], [RFC3491] and [RFC3492].
Table of Contents
1. Introduction....................................................2
1.1 Terminology....................................................3
1.2 Nomenclature...................................................3
1.3 Disclaimer.....................................................3
2. Importance of Charprep..........................................3
3. Equivalency versus Prohibition..................................4
4. Character Equivalency Preparations..............................4
5. Charprep Tables and Profiles....................................5
5.1 Codepoints Inclusion Table.....................................6
6.2 Charprep Table.................................................6
6.3 Publishing of Charprep Profiles................................7
6.4 Generation of Charprep Equivalence Set.........................7
7. IANA Considerations.............................................8
8. Security Considerations.........................................8
Acknowledgements...................................................8
1. Introduction
During the discussions to establish an IDN protocol, a great number
of problematic issues surrounding name equivalency were uncovered.
The current Nameprep document decided to constrain its scope of
appliance:
"Although it would be easy to use the process in this step to
"correct" perceived mis-features or bugs in the current character
standards, [Nameprep] expressly does not do so."
Charprep will continue to uphold the spirit of Nameprep to, "allow as
wide of a range of characters as possible to be allowed in host
names... The user should not be limited to only entering exactly the
characters that might have been used, but to instead be able to enter
characters that unambiguously [represents] the characters in the
[perceived] host name."
In other words, to be able to use different but perceptually
equivalent characters (codepoints) and still arrive at the perceived
domain.
This document does not include the specific character equivalency
preparation (Charprep) tables, nor does it provide explicit policies
for the use of the Charprep tables. Rather, it intends to briefly
describe the problem of character equivalency issues for IDNs as well
as to suggest a framework for the publishing of Charprep tables for
different languages.
Chung [Page 2]
IDNOP-CHARPREP April 2003
1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in RFC
2119 [RFC2119].
1.2 Nomenclature
As in the Unicode Standard [UNICODE], Unicode code points are denoted
by "U+" followed by four to six hexadecimal digits.
The following terms will carry specific definitions within this
document:
Zone Administrator û A domain operator or service that manages sub-
domain delegations. This would include domain registries such as TLD
registries as well as domain operators of SLDs to issue third level
domains, etc.
Registration û Entry of a domain into the zone file of an
authoritative name server.
Resolution û Matching or lookup of domain names within the name
server.
IDN û Internationalized Domain Names: domain names consisted of one
or more characters out of the A-z 0-9 and "-" repertoire.
1.3 Disclaimer
This document does NOT intend to provide any discussion on
equivalence policies of any scripts, nor does it intend to suggest
any type of policies. Zone Administrators SHOULD consult with and
understand the needs of their user base before deciding and
publishing their own policies. Examples provided in this document
are for explanation only.
2. Importance of Charprep
The best way to illustrate the importance and need for Charprep is
through the following simple example:
Suppose a person obtained a domain <alpha><beta>.example from the
.example Zone Manager. The person now advertises his domain as
<ALPHA><BETA>.example (Alpah & Beta in capital letters). A user
seeing this perceives the domain as AB.example. The user now
attempts to access the domain and fails.
It is true that the characters <ALPHA> and <A> are not technically
equivalent, but because of their perceived equivalence, it will cause
confusion to the user and therefore defeating the purpose of having a
human-friendly domain name system.
Chung [Page 3]
IDNOP-CHARPREP April 2003
More importantly, it could create a security issue whereby a domain
name is maliciously registered to confuse the end user. For example,
suppose the AB.example site is an e-Commerce site, a malevolent
registrant may register the domain <ALPHA><BETA>.example set up a
link to it on a competing site. The end user will not be able to
realize that s/he is being brought to a different site because the
display will always look like: ôAB.exampleö.
Charprep will provide a framework for the publishing of Charprep
tables that can be used by Zone Administrators to create a set of
variants from the original submitted domain (Primary Domain) that may
cause user confusion. Further management of this set of variants
with regards to zone file entries is discussed in Zoneprep.
3. Equivalency versus Prohibition
A common misconception is that equivalence preparations prohibit the
use of mapped characters. This is NOT true. For example, even if
<ALPHA> is deemed equivalent to <A>, and vice versa, it does not
prohibit a Zone Administrator to offer a domain name that contains
<ALHPA>, or <A>, or both. To resolve possible conflicts, the first
come first serve rule as employed by most zone administrators today
may naturally come into place.
Another common misconception is that character equivalence
consideration requires word or phrase semantic (orthographic)
equivalence. This is also NOT true. Charprep does not give much
regard to the end phrase or word, but focus on the character itself.
Therefore, even though a character may be semantically different, it
MAY still be considered as equivalent (e.g. <ALPHA> versus <A>). Or
in the inverse, even though a character may be visually different, it
MAY still be considered equivalent (as in the case for Traditional
versus Simplified Chinese characters).
4. Character Equivalency Preparations
Throughout the IDN discussions, character equivalency issues were
repeatedly brought up. While it is appropriately dismissed as a core
protocol concern, the importance of Charprep has never been
discounted. Especially from zone operators who have started to
deploy IDNs as well as from a policy point of view such as in the
discussions at ICANN.
Charprep is important because characters that may be perceptually
equivalent, whether visually or contextually, may occupy different
"codepoints" (as specified in Unicode), and therefore make them
"technically" distinct and unique "characters", yet in real-life they
are perceived and considered to be the same.
For example, the Greek capital letter <ALPHA> is visually identical
to the English capital letter <A>, yet they occupy two different
codepoints in the Unicode scheme. The implication is that
<ALPHA>.example and <A>.example are technically two distinct domain
Chung [Page 4]
IDNOP-CHARPREP April 2003
names even though, when displayed may appear identical: "A.example",
and "A.example". Furthermore, the Cyrillic capital letter "A" is
also visually identical to the <ALPHA> and <A>.
For another example, within the Chinese language, one particular
character may have a number of different visual representations, yet
they are conceptually equivalent. The most noticeable case is the
Traditional Chinese versus the Simplified Chinese representation of a
character (e.g. . [U+767C("fa"-prosper)] and . [U+53B1("fa"-prosper
| hair)]). To complicate matters these relationships may not be one-
to-one, because within different context, a character may take on a
semantically different meaning, therefore creating additional
variances from the root character (e.g. . [U+53B1("fa"-prosper |
hair)] and . [U+9AEE("fa"-hair)] ).
Furthermore, parts of the Japanese and Korean languages utilizes a
subset of the Chinese character repertoire. Two characters that may
be considered perceptually equivalent in the context of the Chinese
language, however, may be considered distinct and unique in Japanese
Kanji (e.g. . [U+570B("guo"-country<cn>)("goku"-a name<jp>)] and .
[U+56FD("guo"-country<cn>)("koku"-country<jp>)] ).
It is therefore very important to preserve the perceptual
expectations of the end user for multilingual domain names, to
maintain the user-friendly spirit of domain names in order to allow
it to continue to be a useful and human-friendly means of direct
navigation and resource addressing over the Internet.
5. Charprep Tables and Profiles
Charprep deals with perceptual equivalency of characters. Characters
are units of visual or graphical representation of the written form
of languages. Scripts best define the collection of a set of
characters. Charprep profiles MAY utilize the ISO15924: Codes for
the representation of names for scripts, as the guide for identifying
scripts and managing Charprep tables. Multiple scripts may share one
Charprep profile and vice versa. Charprep profiles MAY also define
their own Codepoint Inclusion table.
Each Charprep Profile SHOULD consist the following three elements:
1. Charprep Report
2. Codepoints Inclusion
3. Charprep Table
The Charprep report should provide description to the policy as well
as some rationale and reasoning for equivalency determination of the
policy.
If the Charprep report simply identifies the set of one or more
script codes [ISO15924], a Codepoints Inclusion table is not
necessary. If a more delicate approach is desired, a Codepoints
Inclusion Table SHOULD be included. A Codepoints Inclusion Table
Chung [Page 5]
IDNOP-CHARPREP April 2003
simply provides a set of codepoints that is intended for the
corresponding Charprep Table.
{Note: Current documents of reference include [TSCONV], [JPCHAR] and
[HANGULCHAR], along with [IDN-ADMIN]}
5.1 Codepoints Inclusion Table
The Codepoints Inclusion Table should simply be a list of codepoints
that are intended to be included within the Charprep profile:
#Codepoints Inclusion Table for XXX
#version x.x
#script: XXX YYY
U+XXXX; Optional Remarks
U+XXXX; Optional Remarks
U+XXXX; Optional Remarks
...
Note that a codepoints inclusion table name and a version number MUST
be included as part of the header of the table. Optionally, scripts
considered within the table could be included. If multiple scripts
are used a space separated list of the script code [ISO15924] should
be provided.
6.2 Charprep Table
The Charprep Table MUST have 3 columns and each entry MUST be filled
for the first 2 columns with the third as an optional:
Codept Equivalent Set Remarks
+--------+-------------------------+------------------------------+
| U+XXXX | U+XXXX U+XXXX U+XXXX ...| Optional Remarks |
: : :
There should be one entry for each Nameprep-ed codepoint considered
in the Charprep table. The Equivalent Set column consists of a set
of one or more space delimited codepoints corresponding to the
codepoint in the first column. For multi-codepoint entries, the
convention: U+XXXX+XXXX is used. Optional Remarks may be provided
for each entry. For example:
Codept Charprep Variants Remarks
+--------+-------------------------+------------------------------+
| U+0061 | U+03B1 U+0430 | Greek & Cyrillic <A> |
+--------+-------------------------+------------------------------+
| U+03B1 | U+0061 U+0430 | English & Cyrillic <A> |
+--------+-------------------------+------------------------------+
| U+0430 | U+0061 U+03B1 | English & Greek <A> |
+--------+-------------------------+------------------------------+
: : :
Chung [Page 6]
IDNOP-CHARPREP April 2003
Note that the number of entries for the Variant Table might NOT be
the same as the Codepoints Inclusion Table for the same Charprep
profile.
Note also that a Charprep Table MAY not be necessary if the policy of
the Charprep profile is simply to have a Codepoint Inclusion Table.
6.3 Publishing of Charprep Profiles
A Zone Administrator, especially Top-Level Domain Registries, SHOULD
publish Charprep profiles for all scripts (languages) they allow
registrations in, and make it publicly available for end users to
understand the registration policies.
The Codepoints Inclusion Tables and Charprep Tables SHOULD exist in
flat file format with the semi-colon used as a column delimiter. For
example:
#Charprep Table for XXX
#version x.x
#script: XXX YYY
U+0061; U+03B1 U+0430; Greek & Cyrillic <A>
U+03B1; U+0061 U+0430; English & Cyrillic <A>
U+0430; U+0061 U+03B1; English & Greek <A>
6.4 Generation of Charprep Equivalence Set
Charprep does not discuss about the specific policies of managing DNS
zone files and how the generated variants are managed thereof. The
Charprep tables and profiles enable Zone Administrators to create a
set of variants from a given IDN.
For example, based on the examples above, the domain:
<03B1><03B1>.example [<Alpha><Alpha>.example]
Would generate a set of 8 Charprep Variants:
<03B1><0061>.example
<03B1><0430>.example
<0061><0061>.example
<0061><03B1>.example
<0061><0430>.example
<0430><0061>.example
<0430><03B1>.example
<0430><0430>.example
The management of the variants and how they should be represented and
managed in the DNS zone file will be further discussed in Zoneprep
[ZONEPREP]. Zoneprep describes a framework for Zone Administrator to
prepare their zone files based on Zoneprep profiles.
Chung [Page 7]
IDNOP-CHARPREP April 2003
7. IANA Considerations
There are no explicit IANA considerations required for Charprep.
IANA may however decide to maintain a registry for Charprep Profiles
as described in Section 6.
8. Security Considerations
This document does not talk about DNS security issues, and it is
believed that the proposal does not introduce additional security
problems not already existent and/or anticipated by adding
multilingual characters to DNS and/or using ACE.
Charprep considerations could however help to improve the security
and authenticity for the usage of IDNs by reducing the confusion of
perceptually equivalent characters.
Acknowledgements
This document incorporates many of the discussions from the CJK
community (from CNNIC, TWNIC, JPRS and KRNIC respectively) and by the
JET (Joint Engineering Team) as well as at different forums including
IETF and ICANN. More importantly discussions in the document:
"Internationalized Domain Names Registration and Administration
Guideline for Chinese, Japanese and Korean".
Furthermore, many valuable comments and discussions with the
following people were incorporated:
Xiaodong (Sheldon) Lee
Kenny Huang
Paul Hoffman
Mark Davis
Vincent Chen
References
[TSCONV] XiaoDong LEE, et al., ôTraditional and Simplified Chinese
Conversionö, November 2001
[JPCHAR] Yoshiro Yoneya & Yasuhiro Morishita, JPNIC, ôJapanese
characters in multilingual domain name labelsö, March 2,
2001
[HANGULCHAR] Soobok Lee & GyeongSeog Gim, ôHangeul NAMEPREP
recommendation version 1.0ö, June 2001
[RFC1034] Mockapetris, P., "Domain Names - Concepts and
Facilities," STD 13, RFC 1034, USC/ISI, November 1987
[RFC1035] Mockapetris, P., "Domain Names - Implementation and
Specification," STD 13, RFC 1035, USC/ISI, November
1987
Chung [Page 8]
IDNOP-CHARPREP April 2003
[RFC2119] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels," RFC 2119, March 1997
[RFC2181] R. Elz, University of Melbourne & R. Bush, RGnet, Inc.,
ôClarifications to the DNS Specificationö, July 1997
[RFC3454] P. Hoffman, IMC & VPNC & M. Blanchet, Viagenie,
öPreparation of Internationalized Strings ("stringprep")ö,
December 2002
[RFC3490] P. Faltstrom, Cisco, P. Hoffman, IMC & VPNC & A. Costello
UC Berkeley, ôInternationalizing Domain Names in
Applications (IDNA)ö, March 2003
[RFC3491] P. Hoffman, IMC & VPNC & M. Blanchet, Viagenie, ôNameprep:
A Stringprep Profile for Internationalized Domain Names
(IDN)ö, March 2003
[RFC3492] A. Costello, Univ. of California, Berkeley, ôPunycode: A
Bootstring encoding of Unicode for Internationalized
Domain Names in Applications (IDNA)ö, March 2003
[IDN-Admin] Editors: James SENG & John KLENSIN; Authors: K. KONISHI,
K. HUANG, H. QIAN & Y. KO, ôInternationalized Domain Names
Registration and Administration Guideline for Chinese,
Japanese and Koreanö
Authors:
Edmon Chung
Neteka
Suite 100,
243 College St., Toronto,
Ontario, Canada M5T 1R5
edmon@neteka.com
Chung [Page 9]