Internet DRAFT - draft-chung-idn-ace37
draft-chung-idn-ace37
Internet Draft Edmon Chung, Neteka Inc.
<draft-chung-idn-ace37-00.txt> David Leung, Neteka Inc.
June 2001
ACE Utilizing All 37 Alphanumeric Characters (ACE37)
STATUS OF THIS MEMO
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts. Internet-Drafts are draft documents valid for a maximum of
six months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
The reader is cautioned not to depend on the values that appear in
examples to be current or complete, since their purpose is primarily
educational. Distribution of this memo is unlimited.
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
ACE37 is a combination of DUDE-02, AMC-W/V and LACE. ACE37 utilizes
the simple one pass algorithm of DUDE, the character block
considerations of AMC-W/V and the Base-32 compression of LACE. It
also fully utilizes entire LDH set currently allowed in the DNS (A-
z, 0-9 and "-") within its character repertoire to optimize
performance and compression. Even for the worst-case scenario in
ACE37, any name can have 21 characters including Chinese, Japanese
and Korean names. Two Excel spreadsheets for ACE37 encoding and
decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls
and http://www.dnsii.org/ace37/ace37-decode.xls respectively.
While DUDE-02 provides a very efficient differential mechanism, its
compression is inefficient as it fails to take advantage of the
base-32 scheme in using all 5-bits for character information. The
AMC series is highly efficient in compression but requires
complicated mode changes and therefore inefficient in process. LACE
is rather moderate and requires a two-pass mechanism but utilizes
base-32 for good compression.
Chung & Leung [Page 1]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
ACE37 uses simple character block shifting to achieve the
compression efficiency of the AMC series, retains the one-pass and
one mode XOR differential mechanism used by DUDE while embracing the
base-32 compression used by LACE for efficient character bit
information.
Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in RFC
2119 [RFC2119].
LDH: Letters, Digits and Hyphens: a string of characters that
consists only hyphens ("-"), English letters (A-z) and digits (0-9),
which might not be a result of an algorithm for transcoding
multilingual characters. For example: whatever-you-want.example
ACE - ASCII Compatible Encoding: a string of characters resulting
from a particular algorithm for transforming multilingual character
information into an alphanumeric form acceptable by the existing
DNS. For example: bq--3bhc2zmh.tld. In essence, ACE is a subset of
LDH.
Hexadecimal values are shown preceeded by "0x". For example, 0x60
is decimal 96. Binary values are shown preceeded by "0b" for
example "0b1000" is decimal 8. As in the Unicode Standard
[UNICODE], Unicode code points are denoted by "U+" followed by four
to six hexadecimal digits, while a range of code points (or
hexadecimal numbers) is denoted by two hexadecimal numbers separated
by "..", with no prefixes.
Octets: sequences of 8 bits; Quintets: sequences of 5 bits;
Quartets: sequences of 4 bits; Duplets: sequences of 2 bits.
XOR: bitwise exclusive or. Given 2 nonnegative integers A and B, A
XOR B is the nonnegative integer value whose binary representation
is 1 wherever A and B disagrees, and 0 wherever they agree.
Table Of Contents
1. Introduction....................................................3
2. Code Block Shifting.............................................4
3. Base-32 Characters..............................................5
4. Base-4 Characters...............................................6
5. LDH Considerations..............................................9
6. Encoding Procedure..............................................9
7. Decoding Procedure.............................................11
8. Examples.......................................................13
9. Summary & Comparisons..........................................15
10. Security Considerations.......................................16
11. References....................................................16
Chung & Leung [Page 2]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
1. Introduction
ACE37 takes into account the recommendations and findings of the ACE
design team to create a "super-ACE" that incorporates the key
advantages of the various considered ACEs without complicated mode
changes. The encoding (Section 6) and decoding (Section 7) process
is largely similar to and as simple as DUDE-02. The encoding
processes for ACE37 in comparison with DUDE-02 could be summarized:
ACE37 Encoding Procedure | DUDE Encoding Procedure
---------------------------------+---------------------------------
(1) let initial prev = 0x00 | (1) let initial prev = 0x60
(2) if n = LDH output "-n" | (2) if n = hyphen output "-"
(3) code block shift to obtain | (3) diff = prev XOR n
ACE37 shifted n (Section 2)| (4) prepend "0" to the last
(4) diff = prev XOR n | quartet and "1" to others
(5) output in appropriate base-4 | (5) output a base-32 character
and base-32 form | for each corresponding
(Sections 3&4) | quintet
(6) let prev = n | (6) let prev = n
Similarly, the decoding process can be described and compared:
ACE37 Decoding Procedure | DUDE Decoding Procedure
---------------------------------+---------------------------------
(1) let initial prev = 0x00 | (1) let initial prev = 0x60
(2) if char = hyphen discard "-" | (2) if char = hyphen consume
and output next char | and output 0x002D
(3) consume and convert char into| (3) consume and convert to
duplets and quintets | quintets until encoun-
(according to Sections 3&4)| erring a quintet with "0"
(4) concatenate to form diff | as first bit
(based on Sections 4.1&4.2)| (4) strip all first bits off
(5) let prev = prev XOR diff | (5) concatente to form diff
(6) reverse code block shifting | (6) let prev = prev XOR diff
(7) output Unicode code point | (7) output Unicode code point
The features of ACE37 include:
Unique & Reversible - the ACE37 encoding scheme yields a unique and
consistent result string for a given set of Unicode code points.
The encoded string could be decoded back to the original Unicode
code points without loss of character data.
Simple - ACE37 utilizes a one-pass system and the XOR differential
function to encode and decode. Code block shifting is done by a
simple calculation instead of mapping or creation of arbitrary
reference points. Complex mode changes are not required.
Spacious - With the code block shifting coupled with a base-32
scheme, ACE37 can accommodate up to 21 unique Han characters
(including CJK) within the 63 octets allowed by the DNS. Other
Latin based scripts can reach up to 31 characters.
Chung & Leung [Page 3]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
Completeness - any sequence of Unicode code points
(U+0000..U+10FFFF) could be encoded. Restrictions of allowed code
points is not discussed, but is expected that Nameprep [Nameprep]
will be used prior to ACE37 encoding.
In essence, it captures the focus criterions discussed by the
workgroup ACE design team - reversibility, simplicity and
compression capability. Moreover, ACE37 utilizes a very simple code
block shifting (Section 2) mechanism to allow up to any 21 CJK
ideographs to be encoded within the 63-octet constraint.
2. Code Block Shifting
While the DNS was not originally designed for multilingual
characters, Unicode was not designed with the DNS in mind and
therefore code points were apparently not allocated in an ACE-
friendly way.
The AMC series [AMC-W & AMC-V] utilizes a number of reference points
to achieve better compression efficiency by anticipating and
minimizing delta between characters. For ACE37, a much simpler
rendering is used. More specifically, the entire character block
U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000. That
is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on. To
compensate for the downwards shift, the general script and symbol
characters in U+0000..U+2FFF will be shifted upwards by 0x7000.
Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so
on. All other code points (U+A000..U+10FFFF) are unchanged.
Original Unicode Allocation | ACE37 Code Block Shifted
--------------------------------|-------------------------------
General Scripts U+0000 -+ | +- 0x0000 CJK Misc
U+1000 | | | 0x1000 CJK Ideographs
+- | -> | 0x2000
Symbols U+2000 -+ \ | / | 0x3000
\ |/ | 0x4000
CJK Misc U+3000 -+ \/ | 0x5000
CJK Ideographs U+4000 | /\ +- 0x6000
U+5000 | / |\
U+6000 +-- | \ +- 0x7000 General Scripts
U+7000 | | -> | 0x8000
U+8000 | | |
U+9000 -+ | +- 0x9000 Symbols
|
Hangul U+A000 -+ | +- 0xA000 Hangul
U+B000 | | | 0xB000
U+C000 +----|---> | 0xC000
U+D000 | | | 0xD000
: : -+ | +- : :
|
Chung & Leung [Page 4]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
This shifting effectively moves the entire Han library to within
0x6FFF and therefore could be represented in 15-bits or exactly 3
base-32 characters. (details on base-32 characters in Section 3)
For example, the Chinese character for <change> with the original
Unicode code point at U+8F49, will be shifted to 0x5F49 and can be
represented in 3 quintets, and in turn with 3 base-32 characters:
Character: <change>
Unicode Code Point: U+8F49
ACE37 Shifted: 0x5F49
Corresponding Quartets: 0101 1111 0100 1001
Resulting Quintets: 10111 11010 01001
Base-32: nq9 (further discussed in Section 3)
This in turn means that any Chinese character could be represented
with 3 base-32 characters making the total possible characters
within a label, even without further compression introduced by the
XOR differential process (Section 6), to be at least 21. The ACE37
code block shifting process could be described as follows:
for each input code point = n
if n <= 9FFF
n = n - 0x3000 /*downwards shifting*/
if n <= 0
n = 0x9FFF + n /*compensation for U+0000..U+2FFF*/
The character block shifting introduced here is extremely simple and
utilizes simple calculation that requires no mapping function. At
the same time, it achieves the goal in adjusting the Unicode
allocation so that it becomes more ACE friendly.
3. Base-32 Characters
Base-32 characters are used in LACE for compression, while DUDE-02
and the AMC series only utilizes it for quartet flagging to indicate
the last quartet of each encoded code point. ACE37 utilizes base-32
characters for compression while base-4 characters, which will be
introduced in Section 4, determine the compressed code point
brackets.
The following table shows the 32 base-32 characters and their
corresponding quintets:
Base-32 Character =to= Corresponding Quintet
0 = 00000 8 = 01000 g = 10000 o = 11000
1 = 00001 9 = 01001 h = 10001 p = 11001
2 = 00010 a = 01010 i = 10010 q = 11010
3 = 00011 b = 01011 j = 10011 r = 11011
4 = 00100 c = 01100 k = 10100 s = 11100
5 = 00101 d = 01101 l = 10101 t = 11101
6 = 00110 e = 01110 m = 10110 u = 11110
7 = 00111 f = 01111 n = 10111 v = 11111
Chung & Leung [Page 5]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
With this layout of base-32 characters, it is also possible to
implement a computation based base-32 conversion instead of having
to resort to mapping and lookup tables:
For each quintet = q
if q <= 0x0F
then hex dump q to form base-32 character
if 0x10 <= q <= 0x1F
then q = q - 0x10
and char(q + 0x67) to form base-32 character
Note that 0x67 is the code value for the letter "g". Therefore, for
example if the quintet is 0b10001 its base-32 character can be
obtained by:
0x10 <= q=0b10001=0x11 <= 0x1F
therefore q = q - 0x10 = 0x11 - 0x10 = 0x01
and base-32 character = char(0x01 + 0x67)
char(0x68) = "h"
4. Base-4 Characters
ACE37 goes beyond the 32 characters (base-32) to include the
remaining 4 characters {w,x,y,z} in the alphabet. These base-4
characters enable ACE37 to better utilize the existing "resources"
(the allowed characters) to represent IDN character information,
therefore making it's encoding more efficient.
The set of base-4 characters are {w,x,y,z} and will be used to
represent the following duplets (duplets are groups containing 2
bits):
Base-4 Character =to= Corresponding Duplet
w = 00
x = 01
y = 10
z = 11
4.1 Base-4 Indicators
Base-4 characters while carrying character information, also doubles
as an indicator for code point brackets. In DUDE-02, an extra bit
was pre-pended to each quartet. The last quartet of each encoded
code point will be pre-pended with "0", marking the end of the code
point. In ACE37, base-4 characters will determine the length
(number of ACE37 characters) of the encoded code point. Actually,
to be more precise, the encoded bits are in fact the "diff" and not
the code point itself (diff carries the same meaning as in DUDE-02
and is further discussed in Sections 6 & 7)
Chung & Leung [Page 6]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
The following table explains how base-4 characters are combined with
base-32 characters to form a representation of a diff (key: b4=base-
4, b32=base-32):
diff value |bits| ACE37 Form
-------------------------|----|----------------------------
diff<=0x7F | 7 | <b4><b32>
0x80<=diff<=0x7FFF | 15 | <b32><b32><b32>
0x8000<=diff<=0x1FFFF | 17 | w<b4><b32><b32><b32>
0x20000<=diff<=0xFFFFF | 20 | ww<b32><b32><b32><b32>
0x100000<=diff<=0x10FFFF | 22 | <b4>w<b32><b32><b32><b32>
Note that the "bits" column represents the maximum number of
significant bits for the given diff value. For example when
diff<=0x7F, the maximum value is 0b1111111, therefore the number of
significant bits is 7.
Note also that to encode a 17-bit diff, the letter "w" is used as an
indicator to distinguish the sequence from the 7 bit diff where a
base-32 character is expected to follow a base-4 character. Since
"w" represents "00" that has no value, it will not be used in the
base-4 representation for a 17-bit diff (if a "00" is used, it means
that there are only 15 significant bits and therefore should use the
15 bit diff form). This is the case for the 20-bit form as well.
The "w" is used as an arbitrary indicator in the 22-bit form and
MUST be discarded during decoding.
By analyzing the ACE37 form, an encoded string could be successfully
returned to its original form. There is no overlap and the form can
be determined precisely. The following 5 rules dictate the 5
different ACE37 forms:
(1) Encode: if diff<=0x7F
Decode: if first character is <b4> AND next character NOT <b4>
Then it MUST be in 7-bit form: <b4><b32>
(2) Encode: if 0x80<=diff<=0x7FFF
Decode: if first character is <b32>
Then it MUST be a 15-bit form: <b32><b32><b32>
(3) Encode: if 0x8000<=diff<=0x1FFFF
Decode: if first character is "w" AND next character is <b4>
AND NOT "w"
Then it MUST be in 17-bit form: w<b4><b32><b32><b32>
(4) Encode: if 0x20000<=diff<=0xFFFFF
Decode: if first character is "w" AND next character is "w"
Then it MUST be in 20-bit form: ww<b32><b32><b32><b32>
(5) Encode: if 0x80<=diff<=0x7FFF
Decode: if first character is <b4> AND NOT "w"
AND next character is "w"
Then it MUST be 22-bit form: <b4>w<b32><b32><b32><b32>
Chung & Leung [Page 7]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
Note that the ACE37 scheme can effectively encode a diff of up to 22
significant bits or 0x3FFFFF. The Unicode code points are expected
to range only between 0x0000..0x10FFFF, therefore ACE37 will be able
to handle any Unicode code point.
Additionally, base-4 characters (and sometimes base-32 characters)
could be used for mixed-case annotation. This optional mixed-case
annotation mechanism is discussed in Appendix B.
4.2 First Code Point Considerations
There are additional considerations for the first code point that is
encoded or decoded to ensure that if the first code point is within
the first Unicode plane (U+0000..U+FFFF), it will not occupy more
than 4 ACE37 characters.
This special consideration affects only Rules (1), (3) and (4)
explained in Section 4.1. Rule (1) is discarded for the first code
point, therefore any diff under 0x7FFF will be in the form
<b32><b32><b32>. The form for Rule (3) becomes simply
<b4><b32><b32><b32> without the "w" indicator. Similarly, the form
for Rule (4) becomes w<b32><b32><b32><b32> with one less "w".
The first code point considerations can be summarized in the
following 4 rules:
(a) Encode: if diff<=0x7FFF
Decode: if first character is <b32>
Then it MUST be in 15-bit form: <b32><b32><b32>
(b) Encode: if 0x8000<=diff<=0x1FFFF
Decode: if first character is <b4> AND NOT "w"
Then it MUST be in 17-bit form: <b4><b32><b32><b32>
(c) Encode: if 0x20000<=diff<=0xFFFFF
Decode: if first character is "w"
Then it MUST be in 20-bit form: w<b32><b32><b32><b32>
(d) Encode & Decode: same as Rule (5) in Section 4.1
Besides special considerations for base-4 character usage, prev
setting is also specially considered for the first code point. As
laid out in Section 6, in order to detect for the first code point,
the prev is evaluated. If prev = 0x00, it is assumed that it is the
first code point as 0x00 SHOULD not be a permitted character for
input. When an LDH is the first code point, there is a need to make
a special consideration. Regularly, if n = LDH is encountered
(Section 5), it will be output as "-n" and prev is not changed.
However, if the first code point is an LDH, after outputting "-n",
prev is updated to = lowercase(n). This is to ensure and maintain
that only the first code point coming in will have a prev = 0x00.
Chung & Leung [Page 8]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
5. LDH Considerations
Finally, the 37th character of the entire LDH repertoire, the hyphen
will be used to indicate LDH exceptions. Extending the hyphen
consideration of DUDE-02, ACE37 gives special consideration for the
entire LDH repertoire. All LDH characters will be encoded "as is"
with the addition of a leading hyphen. For example, the character
"a" will be encoded within ACE37 as "-a". The hyphen character "-"
will be encoded as "--".
This ensures that each LDH character will only take up 2 character
spaces within an ACE37 encoded string and also will allow
administrators to see the actual characters, similar to the AMC
series. Unlike the AMC series however, the hyphen is not used to
indicate an ongoing mode change, but only the following character.
Therefore retaining the simplicity of the DUDE-02 single-mode,
single-pass philosophy.
6. Encoding Procedure
Similar to DUDE, all ordering of bits and quartets is big-endian.
The following describes the encoding procedure:
Set initial value for prev = 0x00
for each input code point = n
if n is an LDH {A-z, 0-9, -}
output "-n" (Section 5: LDH Considerations)
if prev = 0x00 (Section 4.2: First Code Point)
let prev = lowercase(n)
else perform code block shifting (Section 2: Code Block Shifting)
let diff = prev XOR n (n after code block shifting)
if diff<=0x7F --------------------------------------+
and if this is the first code point (Section 4.2)|
then output 15-bit form: <b32><b32><b32> |
else, output 7-bit form: <b4><b32> |
if 0x80<=diff<=0x7FFF +-(Section 4:
output 15-bit form: <b32><b32><b32> | Base-4
if 0x8000<=diff<=0x1FFFF | Characters)
and if this is the first code point (Section 4.2)|
output 17-bit form: w<b4><b32><b32><b32> |
if 0x20000<=diff<=0xFFFFF |
output 20-bit form: ww<b32><b32><b32><b32> |
if 0x100000<=diff<=0x10FFFF |
output 22-bit form: <b4>w<b32><b32><b32><b32> ---+
let prev = n
end and obtain next n and return to: "for each input code point = n"
The following is a more comprehensive pseudo code:
let prev = 0x00
for each input integer n (in order) do begin
if n = "-" or "0..9" or "A..Z" or "a..z"
then output "hyphen"+"char(n)"
Chung & Leung [Page 9]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
if prev = 0x00
let prev = lowercase(n)
else begin
if n = 0x00
then error and abort
if n <= 9FFF
n = n - 0x30
if n < 0
then n = 9FFF + n
let diff = prev XOR n
if diff <= 0x7F
if prev = 0x00
then output with 3 base-32 characters
else, output first 2 bits with a base-4 character {wxyz}
and remaining 5 bits with 1 base-32 character
if 0x80 <= diff <= 0x7FFF
then output all 15 bits with base-32 characters
if 0x8000 <= diff <= 0xFFFF
if prev = 0x00
then output first 2 bits with a base-4 {xyz} (except w)
and output remaining 15 bits with base-32
else, output "w"
and output first 2 bits with a base-4 {xyz} (except w]
and output remaining 15 bits with base-32
if 0x10000 <= diff <= 0x1FFFF
then output "w"
and output first 2 bits with a base-4 {xyz} (except w)
and output remaining 15 bits with base-32
if 0x20000 <= diff <= 0xFFFFFF
then output "w"
and output all 20 bits with base-32 characters
if 0x100000 <= diff <= 0x10FFFF
then output first 2 bits with a base-4 {xyz} (except w)
and output "w"
and output remaining 15 bits with base-32
let prev = n
end
end
Nameprep [NAMEPREP] is not discussed in this document, but is
expected that it be implemented for IDN. Hence, regardless of the
code point presented, an encoder MUST not produce an incorrect
output. The encoder must fail if it encounters a negative input
value.
Chung & Leung [Page 10]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
The initial value used is 0x00 so that all domains beginning with a
CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter.
Note that after the code block shifting (Section 2), the entire Han
library is within 0x0000..0x6FFF, while row 0 is fitted to
0x7000..0x7FFF. Therefore by using an initial value of 0x00 the
diff for all Han and row 0 characters will be less than 0x7FFF. The
initial value is also used as a check point for the first code point
considerations (Section 4.2).
Additionally, an optional mixed-case annotation mechanism is
discussed in Appendix B.
7. Decoding Procedure
A thorough description of the decoding rules, except for the final
reversal of the code block shifting has been presented in Sections
4.1 and 4.2. The following description is a brief representation of
the decoding procedure:
let prev = 0x00
while the input string is not exhausted
if present character = hyphen (Section 5: LDH
discard and output next character Considerations)
else, depending on the presented form (Section 4)
convert into duplets and quintets (Section 4 & 3)
and concatenate to form diff
let prev = prev XOR diff
reverse code block shifting: (Section 2)
if prev<=0x9FFF
and if prev<=0x6FFF
output character = prev + 0x3000
else, output character = prev - 0x7000
else output character = prev
output character
End
The following is a more comprehensive pseudo code for the decoding
precedure:
let prev = 0x00
while the input string is not exhausted do begin
if present character = hyphen /*Section 5:LDH Considerations*/
then consume and discard hyphen
and obtain the next character
and output character
if prev = 0x00 /*Section 4.2:First Code Point*/
let prev = code block shifted lowercase output character
else,
if present character = Base-32 characters (0..v)
consume present character and next 2 characters
and convert them to quintets according to Base-32
Chung & Leung [Page 11]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
concatenate the resulting quintets to form diff
/*15 bit form, 0x80<=diff<=0x7FFF*/
if present character = Base-4 characters {xyz} and NOT w
consume present character
and convert it to a duplet according to Base-4
if prev = 0x00
obtain and consume next 3 characters
and convert them to quintets according to Base-32
concatenate duplet with the 3 quintets to form diff
/*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/
else, if next character = Base-32 character (0..v)
then consume and convert to quintet according to Base-32
concatenate duplet with the quintet to form diff
/*7 bit form, diff<=0x7F*/
else, obtain next character
if next character = Base-4 characters {xyz} and NOT w
then fail and indicate error
else, if next character = w
then consume and discard w and obtain next 4 characters
consume and convert characters to
quintets according to Base-32
concatenate duplet with the 4 quintets to form diff
/*22 bit form, 0x100000<=diff<=0x10FFFF*/
if present character = w
discard "w" and obtain next character
if next character = Base-4 characters {xyz} and NOT w
and if prev = 0x00
obtain and consume next 4 characters
and convert characters to quintets based on Base-32
concatenate the 4 quintets to form diff
/*first code point: 20 bit form,*/
/*0x20000<=diff<=0xFFFFFF */
else, consume and convert to duplet according to Base-4
and obtain and consume next 3 characters
and convert to quintets according to Base-32
concatenate duplet with the 3 quintets to form diff
/*17 bit form, 0x8000<=diff<=0x1FFFF*/
else, if next character = w
then consume and discard w
and obtain and consume next 4 characters
and convert to quintets according to Base-32
concatenate duplet the 4 quintets to form diff
/*20 bit form, 0x20000<=diff<=0xFFFFFF*/
Chung & Leung [Page 12]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
else, if next character = Base-32 character (0..v)
then convert to quintet according to Base-32
set quintet to diff
/*7 bit form, diff<=0x7F*/
fail upon encountering a non-ACE37 character
or end-of-input
let prev = prev XOR diff
if prev <= 0x9FFF /*reversal of the code */
and if prev <= 6FFF /*block shifting described*/
output = prev + 0x3000 /*in Section 2 */
else, output = prev - 0x7000
else, output prev
end
end
encode the output sequence and compare it to the input string
fail if they do not match (case insensitively)
8. Examples
ACE37 is likely to be implemented with an ACE prefix in the form
"xx--". The actual prefix to be used is not discussed in this
document. The following examples are taken from the mailing list as
well as from DUDE-02 and the AMC series. The resulting ACE37 string
is compared with that using DUDE:
(A) JPNIC (the registry of .jp domain)
Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3
U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9
U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF
U+30FC
ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3
(57 char)
DUDE-02: (error: result string exceeds 59 characters*)
Note: 59 characters is the maximum allowable when the ACE
prefix "xx--" is included
(B) A health-insurance organization in Tokyo
Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3
U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44
U+5408
ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char)
DUDE-02: (error: result string exceeds 59 characters)
Chung & Leung [Page 13]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
(C) 6 hangul syllables
Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
ACE37: xg9orfsqssvfg3i8t2c (19 char)
DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char)
(D) maji<de>koi<suru>5<byou><mae> (Latin, hiragana, kanji)
Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069
U+3059 U+308B U+0035 U+79D2 U+524D
ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char)
DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char)
(E) <pafii>de<runba> (Latin, katakana)
Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3
U+30D0
ACE37: 06hw4zmyv-d-ewnwox3 (19 char)
DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char)
(F) <sono><supiido><de> (hiragana, katakana)
Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067
ACE37: 02txj06nzdx8xl05e (17 char)
DUDE-02: vsvpvd7hypuivf4q (16 char)
(G) 2 Arbitrary Plane Two Code Points
Unicode: U+261AF U+261BF
ACE37: w4odfwg (7 char)
DUDE-02: uyt6rta (7 char)
(H) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073
U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076
U+00ED U+010D U+0065 U+0073 U+006B U+0079
ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char)
DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char)
(I) Chinese
Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D
U+6587
ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char)
DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char)
Chung & Leung [Page 14]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
9. Summary & Comparisons
In summary, ACE37 is based on the DUDE-02 process with an improved
compression scheme for code point sequences that are less likely to
cluster too closely together, such as CJK ideographs.
Since it is the design team's indication that generally 30
characters should be good enough and that there are a lot of concern
from the Asian community that 14-15 characters is definitely
limiting and that few indication from the Latin community that
length is really a concern, ACE37 have set its objective to increase
the possible number of characters in a worse case scenario closer to
20 characters.
ACE37 have succeeded in creating a very simple variation based on
the primary ACEs identified by the design team to create an ACE that
achieves dramatically better performance for CJK characters while
maintaining the simplicity of DUDE.
Key Improvements of ACE37 over DUDE-02
- much more spacious for Han characters. Improved worst-case
scenario to 21 Han ideographs by introducing code block shifting
and utilizing fully base-32 characters
- no need to arbitrarily pre-pend flagging bits to identify code
point brackets. Instead base-4 characters and diff forms are used
- base-32 and base-4 characters can be easily computed instead of
mapped using lookup tables
Key Improvements of ACE37 over the AMC series
- a more simple process, utilizing the one-pass differential
mechanism from DUDE-02
- a much more simple code block shifting process is used in ACE37 to
achieve a similar goal for the complex multiple reference point
system used by the AMC series
- base-32 and base-4 characters can be easily computed instead of
mapped using lookup tables
Key Improvements of ACE37 over LACE
- a more simple process, utilizing the one-pass differential
mechanism from DUDE-02
- much more spacious for Han characters. Improved worst-case
scenario to 21 Han ideographs by introducing code block shifting
and utilizing fully base-32 characters
- base-32 and base-4 characters can be easily computed instead of
mapped using lookup tables
Two Excel spreadsheet for ACE37 encoding and decoding can be found
at http://www.dnsii.org/ace37/ace37-encode.xls and
http://www.dnsii.org/ace37/ace37-decode.xls respectively. This
illustrates the simplicity of ACE37 and provides a handy tool for
checking ACE37 encoding and decoding algorithms. The ACE37-encode
spreadsheet also includes a DUDE-encode worksheet.
Chung & Leung [Page 15]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
10. Security Considerations
This document does not talk about DNS security issues, and it is
believed that the proposal does not introduce additional security
problems not already existent and/or anticipated by adding
multilingual characters to DNS and/or using ACE.
11. References
[AMC-W] Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001.
[AMC-V] Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001.
[DUDE-02] Mark Welter, Brian W. Spolarich & Adam M.
Costello, "Differential Unicode Domain Encoding (DUDE)",
June 7, 2001.
[LACE] Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length-
based ASCII Compatible Encoding for IDN", January 5, 2001.
[Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie,
"Preparation of Internationalized Host Names", February
24, 2001
Appendix A. Acknowledgements
The ACE37 draft is a combination of DUDE-02, the AMC series and
LACE, and takes into consideration the report of the ACE design
team. The authors would therefore like to thank the authors of
DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the
authors of the AMC series - Adam M.Costello; the authors of LACE -
Mark Davis & Paul Hoffman; and, the ACE design team and its advisors
- Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence,
Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and
Erik Nordmark for their inspirations.
Appendix B. Mixed-case annotation
This section is taken from DUDE and modified for ACE37
In order to use ACE37 to represent case-insensitive Unicode strings,
higher layers need to case-fold the Unicode strings prior to ACE37
encoding. The encoded string can, however, use mixed-case base-4
characters as an annotation telling how to convert the folded
Unicode string into a mixed-case Unicode string for display
purposes.
Each Unicode code point (unless it is an LDH) is represented by a
sequence of base-4 and base-32 characters, the first of which is
mostly a base-4 character, which is always a letter {wxyz} (as
opposed to a digit). If that letter is uppercase, it is a
suggestion that the Unicode character be mapped to uppercase (if
Chung & Leung [Page 16]
ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
possible); if the letter is lowercase, it is a suggestion that the
Unicode character be mapped to lowercase (if possible).
If the code point is an LDH, for example "a", it will be represented
as "-a". To mark the case for an LDH, simply set the LDH to the
desired case following the "-". Fir example if an uppercase "A" is
desired, the encoded form SHOULD be "-A".
Note that there is a possibility that no base-4 character is present
for a code point representation. That is the case for a 15-bit diff
form. In this case, the base-32 characters will be used for case
suggestion (if possible), similar to that discussed for using a
base-4 character. However, also note that there is a very remote
possibility that all 3 base-32 characters are digits. If this
happens, case unfolding will be aborted. Since case annotation is
an optional feature and used for display purposes only, this is not
considered to be a major concern. Moreover, the possibility of this
happening is truly remote at only (32639/27)/1114109 or just 0.1%
chance of happening.
ACE37 encoders and decoders are not required to support these
annotations, and higher layers need not use them.
For example: In order to suggest that example (H) in Section 8:
"Examples" be displayed as:
Czech: Pro<ccaron(uppercase)>prost<ecaron(uppercase)>
nemLUV<iacute(lowercase)><ccaron(lowercase)>esky
one could capitalize the ACE37 encoding as:
ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char)
Authors:
Edmon Chung
Neteka Inc.
2462 Yonge St. Toronto,
Ontario, Canada M4P 2H5
edmon@neteka.com
David Leung
Neteka Inc.
2462 Yonge St. Toronto,
Ontario, Canada M4P 2H5
david@neteka.com
Chung & Leung [Page 17]