Internet DRAFT - draft-chung-idn-ace37

draft-chung-idn-ace37





Internet Draft                                 Edmon Chung, Neteka Inc.
<draft-chung-idn-ace37-00.txt>                 David Leung, Neteka Inc.
                                                              June 2001
    
    
    
          ACE Utilizing All 37 Alphanumeric Characters (ACE37) 
                                     
                                     
STATUS OF THIS MEMO 
 
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026.  
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts.  Internet-Drafts are draft documents valid for a maximum of 
   six months and may be updated, replaced, or obsoleted by other 
   documents at any time.  It is inappropriate to use Internet-Drafts 
   as reference material or to cite them other than as "work in 
   progress."  
    
   The reader is cautioned not to depend on the values that appear in 
   examples to be current or complete, since their purpose is primarily 
   educational.  Distribution of this memo is unlimited. 
    
   The list of current Internet-Drafts can be accessed at  
   http://www.ietf.org/ietf/1id-abstracts.txt 
   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html.  
    
Abstract 
    
   ACE37 is a combination of DUDE-02, AMC-W/V and LACE.  ACE37 utilizes 
   the simple one pass algorithm of DUDE, the character block 
   considerations of AMC-W/V and the Base-32 compression of LACE.  It 
   also fully utilizes entire LDH set currently allowed in the DNS (A-
   z, 0-9 and "-") within its character repertoire to optimize 
   performance and compression.  Even for the worst-case scenario in 
   ACE37, any name can have 21 characters including Chinese, Japanese 
   and Korean names. Two Excel spreadsheets for ACE37 encoding and 
   decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls 
   and http://www.dnsii.org/ace37/ace37-decode.xls respectively. 
    
   While DUDE-02 provides a very efficient differential mechanism, its 
   compression is inefficient as it fails to take advantage of the 
   base-32 scheme in using all 5-bits for character information.  The 
   AMC series is highly efficient in compression but requires 
   complicated mode changes and therefore inefficient in process.  LACE 
   is rather moderate and requires a two-pass mechanism but utilizes 
   base-32 for good compression. 
    

Chung & Leung                                                  [Page 1] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
   ACE37 uses simple character block shifting to achieve the 
   compression efficiency of the AMC series, retains the one-pass and 
   one mode XOR differential mechanism used by DUDE while embracing the 
   base-32 compression used by LACE for efficient character bit 
   information. 
    
Terminology 
    
   The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", 
   and "MAY" in this document are to be interpreted as described in RFC 
   2119 [RFC2119]. 
    
   LDH: Letters, Digits and Hyphens: a string of characters that 
   consists only hyphens ("-"), English letters (A-z) and digits (0-9), 
   which might not be a result of an algorithm for transcoding 
   multilingual characters. For example: whatever-you-want.example 
    
   ACE - ASCII Compatible Encoding: a string of characters resulting 
   from a particular algorithm for transforming multilingual character 
   information into an alphanumeric form acceptable by the existing 
   DNS.  For example: bq--3bhc2zmh.tld.  In essence, ACE is a subset of 
   LDH. 
    
   Hexadecimal values are shown preceeded by "0x".  For example, 0x60 
   is decimal 96.  Binary values are shown preceeded by "0b" for 
   example "0b1000" is decimal 8.  As in the Unicode Standard 
   [UNICODE], Unicode code points are denoted by "U+" followed by four 
   to six hexadecimal digits, while a range of code points (or 
   hexadecimal numbers) is denoted by two hexadecimal numbers separated 
   by "..", with no prefixes. 
    
   Octets: sequences of 8 bits; Quintets: sequences of 5 bits; 
   Quartets: sequences of 4 bits; Duplets: sequences of 2 bits. 
    
   XOR: bitwise exclusive or.  Given 2 nonnegative integers A and B, A 
   XOR B is the nonnegative integer value whose binary representation 
   is 1 wherever A and B disagrees, and 0 wherever they agree. 
    
Table Of Contents 
    
   1. Introduction....................................................3 
   2. Code Block Shifting.............................................4 
   3. Base-32 Characters..............................................5 
   4. Base-4 Characters...............................................6 

   5. LDH Considerations..............................................9 
   6. Encoding Procedure..............................................9 
   7. Decoding Procedure.............................................11 
   8. Examples.......................................................13 
   9. Summary & Comparisons..........................................15 
   10. Security Considerations.......................................16 
   11. References....................................................16 

Chung & Leung                                                  [Page 2] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
1. Introduction 
    
   ACE37 takes into account the recommendations and findings of the ACE 
   design team to create a "super-ACE" that incorporates the key 
   advantages of the various considered ACEs without complicated mode 
   changes.  The encoding (Section 6) and decoding (Section 7) process 
   is largely similar to and as simple as DUDE-02.  The encoding 
   processes for ACE37 in comparison with DUDE-02 could be summarized: 
    
        ACE37 Encoding Procedure     |     DUDE Encoding Procedure 
    ---------------------------------+--------------------------------- 
    (1) let initial prev = 0x00      | (1) let initial prev = 0x60 
    (2) if n = LDH output "-n"       | (2) if n = hyphen output "-" 
    (3) code block shift to obtain   | (3) diff = prev XOR n 
          ACE37 shifted n (Section 2)| (4) prepend "0" to the last 
    (4) diff = prev XOR n            |      quartet and "1" to others 
    (5) output in appropriate base-4 | (5) output a base-32 character 
          and base-32 form           |      for each corresponding 
          (Sections 3&4)             |      quintet 
    (6) let prev = n                 | (6) let prev = n 
    
   Similarly, the decoding process can be described and compared: 
    
        ACE37 Decoding Procedure     |     DUDE Decoding Procedure 
    ---------------------------------+--------------------------------- 
    (1) let initial prev = 0x00      | (1) let initial prev = 0x60 
    (2) if char = hyphen discard "-" | (2) if char = hyphen consume 
          and output next char       |       and output 0x002D 
    (3) consume and convert char into| (3) consume and convert to 
          duplets and quintets       |       quintets until encoun- 
          (according to Sections 3&4)|       erring a quintet with "0" 
    (4) concatenate to form diff     |       as first bit 
          (based on Sections 4.1&4.2)| (4) strip all first bits off 
    (5) let prev = prev XOR diff     | (5) concatente to form diff 
    (6) reverse code block shifting  | (6) let prev = prev XOR diff 
    (7) output Unicode code point    | (7) output Unicode code point 
    
   The features of ACE37 include: 
    
   Unique & Reversible - the ACE37 encoding scheme yields a unique and 
   consistent result string for a given set of Unicode code points.  
   The encoded string could be decoded back to the original Unicode 
   code points without loss of character data. 
    
   Simple - ACE37 utilizes a one-pass system and the XOR differential 
   function to encode and decode.  Code block shifting is done by a 
   simple calculation instead of mapping or creation of arbitrary 
   reference points. Complex mode changes are not required. 
    
   Spacious - With the code block shifting coupled with a base-32 
   scheme, ACE37 can accommodate up to 21 unique Han characters 
   (including CJK) within the 63 octets allowed by the DNS.  Other 
   Latin based scripts can reach up to 31 characters. 
Chung & Leung                                                  [Page 3] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
    
   Completeness - any sequence of Unicode code points 
   (U+0000..U+10FFFF) could be encoded.  Restrictions of allowed code 
   points is not discussed, but is expected that Nameprep [Nameprep] 
   will be used prior to ACE37 encoding. 
    
   In essence, it captures the focus criterions discussed by the 
   workgroup ACE design team - reversibility, simplicity and 
   compression capability.  Moreover, ACE37 utilizes a very simple code 
   block shifting (Section 2) mechanism to allow up to any 21 CJK 
   ideographs to be encoded within the 63-octet constraint. 
    
2. Code Block Shifting 
    
   While the DNS was not originally designed for multilingual 
   characters, Unicode was not designed with the DNS in mind and 
   therefore code points were apparently not allocated in an ACE-
   friendly way. 
    
   The AMC series [AMC-W & AMC-V] utilizes a number of reference points 
   to achieve better compression efficiency by anticipating and 
   minimizing delta between characters.  For ACE37, a much simpler 
   rendering is used.  More specifically, the entire character block 
   U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000.  That 
   is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on.  To 
   compensate for the downwards shift, the general script and symbol 
   characters in U+0000..U+2FFF will be shifted upwards by 0x7000.  
   Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so 
   on.  All other code points (U+A000..U+10FFFF) are unchanged. 
    
      Original Unicode Allocation   |     ACE37 Code Block Shifted 
    --------------------------------|------------------------------- 
      General Scripts  U+0000 -+    |     +- 0x0000 CJK Misc 
                       U+1000  |    |     |  0x1000 CJK Ideographs 
                               +-   |  -> |  0x2000 
      Symbols          U+2000 -+ \  | /   |  0x3000 
                                  \ |/    |  0x4000 
      CJK Misc         U+3000 -+   \/     |  0x5000 
      CJK Ideographs   U+4000  |   /\     +- 0x6000 
                       U+5000  |  / |\ 
                       U+6000  +--  | \   +- 0x7000 General Scripts 
                       U+7000  |    |  -> |  0x8000 
                       U+8000  |    |     |  
                       U+9000 -+    |     +- 0x9000 Symbols 
                                    | 
      Hangul           U+A000 -+    |     +- 0xA000 Hangul 
                       U+B000  |    |     |  0xB000 
                       U+C000  +----|---> |  0xC000 
                       U+D000  |    |     |  0xD000 
        :                 :   -+    |     +-    :      : 
                                    | 
    

Chung & Leung                                                  [Page 4] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
   This shifting effectively moves the entire Han library to within 
   0x6FFF and therefore could be represented in 15-bits or exactly 3 
   base-32 characters.  (details on base-32 characters in Section 3) 
    
   For example, the Chinese character for <change> with the original 
   Unicode code point at U+8F49, will be shifted to 0x5F49 and can be 
   represented in 3 quintets, and in turn with 3 base-32 characters: 
    
                    Character: <change> 
           Unicode Code Point: U+8F49 
                ACE37 Shifted: 0x5F49 
       Corresponding Quartets: 0101 1111 0100 1001 
           Resulting Quintets: 10111 11010 01001 
                      Base-32: nq9   (further discussed in Section 3) 
    
   This in turn means that any Chinese character could be represented 
   with 3 base-32 characters making the total possible characters 
   within a label, even without further compression introduced by the 
   XOR differential process (Section 6), to be at least 21.  The ACE37 
   code block shifting process could be described as follows: 
    
      for each input code point = n 
      if n <= 9FFF 
         n = n - 0x3000      /*downwards shifting*/ 
         if n <= 0 
            n = 0x9FFF + n   /*compensation for U+0000..U+2FFF*/ 
    
   The character block shifting introduced here is extremely simple and 
   utilizes simple calculation that requires no mapping function.  At 
   the same time, it achieves the goal in adjusting the Unicode 
   allocation so that it becomes more ACE friendly. 
    
3. Base-32 Characters 
    
   Base-32 characters are used in LACE for compression, while DUDE-02 
   and the AMC series only utilizes it for quartet flagging to indicate 
   the last quartet of each encoded code point.  ACE37 utilizes base-32 
   characters for compression while base-4 characters, which will be 
   introduced in Section 4, determine the compressed code point 
   brackets. 
    
   The following table shows the 32 base-32 characters and their 
   corresponding quintets: 
    
   Base-32 Character =to= Corresponding Quintet 
       0 = 00000       8 = 01000       g = 10000       o = 11000 
       1 = 00001       9 = 01001       h = 10001       p = 11001 
       2 = 00010       a = 01010       i = 10010       q = 11010 
       3 = 00011       b = 01011       j = 10011       r = 11011 
       4 = 00100       c = 01100       k = 10100       s = 11100 
       5 = 00101       d = 01101       l = 10101       t = 11101 
       6 = 00110       e = 01110       m = 10110       u = 11110 
       7 = 00111       f = 01111       n = 10111       v = 11111 
Chung & Leung                                                  [Page 5] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
    
   With this layout of base-32 characters, it is also possible to 
   implement a computation based base-32 conversion instead of having 
   to resort to mapping and lookup tables: 
    
      For each quintet = q 
          if q <= 0x0F 
             then hex dump q to form base-32 character 
          if 0x10 <= q <= 0x1F 
             then q = q - 0x10 
                and char(q + 0x67) to form base-32 character 
    
   Note that 0x67 is the code value for the letter "g".  Therefore, for 
   example if the quintet is 0b10001 its base-32 character can be 
   obtained by: 
    
      0x10 <= q=0b10001=0x11 <= 0x1F 
      therefore q = q - 0x10 = 0x11 - 0x10 = 0x01 
            and base-32 character = char(0x01 + 0x67) 
                char(0x68) = "h" 
    
4. Base-4 Characters 
    
   ACE37 goes beyond the 32 characters (base-32) to include the 
   remaining 4 characters {w,x,y,z} in the alphabet.  These base-4 
   characters enable ACE37 to better utilize the existing "resources" 
   (the allowed characters) to represent IDN character information, 
   therefore making it's encoding more efficient. 
    
   The set of base-4 characters are {w,x,y,z} and will be used to 
   represent the following duplets (duplets are groups containing 2 
   bits): 
    
   Base-4 Character =to= Corresponding Duplet 
                  w   =  00 
                  x   =  01 
                  y   =  10 
                  z   =  11 
    
4.1 Base-4 Indicators 
    
   Base-4 characters while carrying character information, also doubles 
   as an indicator for code point brackets.  In DUDE-02, an extra bit 
   was pre-pended to each quartet.  The last quartet of each encoded 
   code point will be pre-pended with "0", marking the end of the code 
   point.  In ACE37, base-4 characters will determine the length 
   (number of ACE37 characters) of the encoded code point.  Actually, 
   to be more precise, the encoded bits are in fact the "diff" and not 
   the code point itself (diff carries the same meaning as in DUDE-02 
   and is further discussed in Sections 6 & 7) 
    


Chung & Leung                                                  [Page 6] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
   The following table explains how base-4 characters are combined with 
   base-32 characters to form a representation of a diff (key: b4=base-
   4, b32=base-32): 
    
             diff value         |bits|       ACE37 Form 
       -------------------------|----|---------------------------- 
                 diff<=0x7F     |  7 | <b4><b32> 
           0x80<=diff<=0x7FFF   | 15 | <b32><b32><b32> 
         0x8000<=diff<=0x1FFFF  | 17 | w<b4><b32><b32><b32> 
        0x20000<=diff<=0xFFFFF  | 20 | ww<b32><b32><b32><b32> 
       0x100000<=diff<=0x10FFFF | 22 | <b4>w<b32><b32><b32><b32> 
    
   Note that the "bits" column represents the maximum number of 
   significant bits for the given diff value.  For example when 
   diff<=0x7F, the maximum value is 0b1111111, therefore the number of 
   significant bits is 7. 
    
   Note also that to encode a 17-bit diff, the letter "w" is used as an 
   indicator to distinguish the sequence from the 7 bit diff where a 
   base-32 character is expected to follow a base-4 character.  Since 
   "w" represents "00" that has no value, it will not be used in the 
   base-4 representation for a 17-bit diff (if a "00" is used, it means 
   that there are only 15 significant bits and therefore should use the 
   15 bit diff form).  This is the case for the 20-bit form as well.  
   The "w" is used as an arbitrary indicator in the 22-bit form and 
   MUST be discarded during decoding. 
    
   By analyzing the ACE37 form, an encoded string could be successfully 
   returned to its original form.  There is no overlap and the form can 
   be determined precisely.  The following 5 rules dictate the 5 
   different ACE37 forms: 
    
   (1) Encode: if diff<=0x7F 
       Decode: if first character is <b4> AND next character NOT <b4> 
               Then it MUST be in 7-bit form: <b4><b32> 
    
   (2) Encode: if 0x80<=diff<=0x7FFF 
       Decode: if first character is <b32> 
               Then it MUST be a 15-bit form: <b32><b32><b32> 
    
   (3) Encode: if 0x8000<=diff<=0x1FFFF 
       Decode: if first character is "w" AND next character is <b4> 
                  AND NOT "w" 
               Then it MUST be in 17-bit form: w<b4><b32><b32><b32> 
    
   (4) Encode: if 0x20000<=diff<=0xFFFFF 
       Decode: if first character is "w" AND next character is "w" 
               Then it MUST be in 20-bit form: ww<b32><b32><b32><b32> 
    
   (5) Encode: if 0x80<=diff<=0x7FFF 
       Decode: if first character is <b4> AND NOT "w" 
                  AND next character is "w" 
               Then it MUST be 22-bit form: <b4>w<b32><b32><b32><b32> 
Chung & Leung                                                  [Page 7] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
    
   Note that the ACE37 scheme can effectively encode a diff of up to 22 
   significant bits or 0x3FFFFF.  The Unicode code points are expected 
   to range only between 0x0000..0x10FFFF, therefore ACE37 will be able 
   to handle any Unicode code point. 
    
   Additionally, base-4 characters (and sometimes base-32 characters) 
   could be used for mixed-case annotation.  This optional mixed-case 
   annotation mechanism is discussed in Appendix B. 
    
4.2 First Code Point Considerations 
    
   There are additional considerations for the first code point that is 
   encoded or decoded to ensure that if the first code point is within 
   the first Unicode plane (U+0000..U+FFFF), it will not occupy more 
   than 4 ACE37 characters. 
    
   This special consideration affects only Rules (1), (3) and (4) 
   explained in Section 4.1.  Rule (1) is discarded for the first code 
   point, therefore any diff under 0x7FFF will be in the form 
   <b32><b32><b32>.  The form for Rule (3) becomes simply 
   <b4><b32><b32><b32> without the "w" indicator.  Similarly, the form 
   for Rule (4) becomes w<b32><b32><b32><b32> with one less "w". 
    
   The first code point considerations can be summarized in the 
   following 4 rules: 
    
   (a) Encode: if diff<=0x7FFF 
       Decode: if first character is <b32> 
               Then it MUST be in 15-bit form: <b32><b32><b32> 
    
   (b) Encode: if 0x8000<=diff<=0x1FFFF 
       Decode: if first character is <b4> AND NOT "w" 
               Then it MUST be in 17-bit form: <b4><b32><b32><b32> 
    
   (c) Encode: if 0x20000<=diff<=0xFFFFF 
       Decode: if first character is "w" 
               Then it MUST be in 20-bit form: w<b32><b32><b32><b32> 
    
   (d) Encode & Decode: same as Rule (5) in Section 4.1 
    
   Besides special considerations for base-4 character usage, prev 
   setting is also specially considered for the first code point.  As 
   laid out in Section 6, in order to detect for the first code point, 
   the prev is evaluated.  If prev = 0x00, it is assumed that it is the 
   first code point as 0x00 SHOULD not be a permitted character for 
   input.  When an LDH is the first code point, there is a need to make 
   a special consideration.  Regularly, if n = LDH is encountered 
   (Section 5), it will be output as "-n" and prev is not changed.  
   However, if the first code point is an LDH, after outputting "-n", 
   prev is updated to = lowercase(n).  This is to ensure and maintain 
   that only the first code point coming in will have a prev = 0x00. 
    
Chung & Leung                                                  [Page 8] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
5. LDH Considerations 
    
   Finally, the 37th character of the entire LDH repertoire, the hyphen 
   will be used to indicate LDH exceptions.  Extending the hyphen 
   consideration of DUDE-02, ACE37 gives special consideration for the 
   entire LDH repertoire.  All LDH characters will be encoded "as is" 
   with the addition of a leading hyphen.  For example, the character 
   "a" will be encoded within ACE37 as "-a".  The hyphen character "-" 
   will be encoded as "--". 
    
   This ensures that each LDH character will only take up 2 character 
   spaces within an ACE37 encoded string and also will allow 
   administrators to see the actual characters, similar to the AMC 
   series.  Unlike the AMC series however, the hyphen is not used to 
   indicate an ongoing mode change, but only the following character.  
   Therefore retaining the simplicity of the DUDE-02 single-mode, 
   single-pass philosophy. 
    
6. Encoding Procedure 
    
   Similar to DUDE, all ordering of bits and quartets is big-endian. 
   The following describes the encoding procedure: 
    
   Set initial value for prev = 0x00 
   for each input code point = n 
      if n is an LDH {A-z, 0-9, -} 
         output "-n"                   (Section 5: LDH Considerations) 
         if prev = 0x00                (Section 4.2: First Code Point) 
            let prev = lowercase(n) 
      else perform code block shifting (Section 2: Code Block Shifting) 
      let diff = prev XOR n            (n after code block shifting) 
      if diff<=0x7F --------------------------------------+ 
         and if this is the first code point (Section 4.2)| 
         then output 15-bit form: <b32><b32><b32>         | 
         else, output 7-bit form: <b4><b32>               | 
      if 0x80<=diff<=0x7FFF                               +-(Section 4: 
         output 15-bit form: <b32><b32><b32>              |   Base-4 
      if 0x8000<=diff<=0x1FFFF                            | Characters) 
         and if this is the first code point (Section 4.2)| 
         output 17-bit form: w<b4><b32><b32><b32>         | 
      if 0x20000<=diff<=0xFFFFF                           | 
         output 20-bit form: ww<b32><b32><b32><b32>       | 
      if 0x100000<=diff<=0x10FFFF                         | 
         output 22-bit form: <b4>w<b32><b32><b32><b32> ---+ 
      let prev = n 
   end and obtain next n and return to: "for each input code point = n" 
    
   The following is a more comprehensive pseudo code: 
    
   let prev = 0x00 
   for each input integer n (in order) do begin 
      if n = "-" or "0..9" or "A..Z" or "a..z" 
      then output "hyphen"+"char(n)" 
Chung & Leung                                                  [Page 9] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
         if prev = 0x00 
            let prev = lowercase(n) 
    
      else begin 
         if n = 0x00 
            then error and abort 
         if n <= 9FFF 
         n = n - 0x30 
            if n < 0 
            then n = 9FFF + n 
    
         let diff = prev XOR n 
    
         if diff <= 0x7F 
            if prev = 0x00 
            then output with 3 base-32 characters 
         else, output first 2 bits with a base-4 character {wxyz} 
            and remaining 5 bits with 1 base-32 character 
    
         if 0x80 <= diff <= 0x7FFF 
         then output all 15 bits with base-32 characters 
    
         if 0x8000 <= diff <= 0xFFFF 
            if prev = 0x00 
            then output first 2 bits with a base-4 {xyz} (except w) 
            and output remaining 15 bits with base-32 
         else, output "w" 
            and output first 2 bits with a base-4 {xyz} (except w] 
            and output remaining 15 bits with base-32 
    
         if 0x10000 <= diff <= 0x1FFFF 
         then output "w" 
            and output first 2 bits with a base-4 {xyz} (except w) 
            and output remaining 15 bits with base-32 
    
         if 0x20000 <= diff <= 0xFFFFFF 
         then output "w" 
            and output all 20 bits with base-32 characters 
    
         if 0x100000 <= diff <= 0x10FFFF 
         then output first 2 bits with a base-4 {xyz} (except w) 
            and output "w" 
            and output remaining 15 bits with base-32 
    
         let prev = n 
      end 
   end 
    
   Nameprep [NAMEPREP] is not discussed in this document, but is 
   expected that it be implemented for IDN.  Hence, regardless of the 
   code point presented, an encoder MUST not produce an incorrect 
   output.  The encoder must fail if it encounters a negative input 
   value. 
Chung & Leung                                                 [Page 10] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
    
   The initial value used is 0x00 so that all domains beginning with a 
   CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter.  
   Note that after the code block shifting (Section 2), the entire Han 
   library is within 0x0000..0x6FFF, while row 0 is fitted to 
   0x7000..0x7FFF.  Therefore by using an initial value of 0x00 the 
   diff for all Han and row 0 characters will be less than 0x7FFF.  The 
   initial value is also used as a check point for the first code point 
   considerations (Section 4.2). 
    
   Additionally, an optional mixed-case annotation mechanism is 
   discussed in Appendix B. 
    
7. Decoding Procedure 
    
   A thorough description of the decoding rules, except for the final 
   reversal of the code block shifting has been presented in Sections 
   4.1 and 4.2.  The following description is a brief representation of 
   the decoding procedure: 
    
   let prev = 0x00 
   while the input string is not exhausted 
      if present character = hyphen               (Section 5: LDH 
         discard and output next character         Considerations) 
      else, depending on the presented form       (Section 4) 
         convert into duplets and quintets        (Section 4 & 3) 
         and concatenate to form diff 
      let prev = prev XOR diff 
      reverse code block shifting:                (Section 2) 
         if prev<=0x9FFF 
            and if prev<=0x6FFF 
                   output character = prev + 0x3000 
            else, output character = prev - 0x7000 
         else output character = prev 
      output character 
   End 
    
   The following is a more comprehensive pseudo code for the decoding 
   precedure: 
    
   let prev = 0x00 
   while the input string is not exhausted do begin 
      if present character = hyphen    /*Section 5:LDH Considerations*/ 
      then consume and discard hyphen 
         and obtain the next character 
         and output character 
         if prev = 0x00                /*Section 4.2:First Code Point*/ 
            let prev = code block shifted lowercase output character 
    
      else, 
         if present character = Base-32 characters (0..v) 
            consume present character and next 2 characters 
            and convert them to quintets according to Base-32 
Chung & Leung                                                 [Page 11] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
            concatenate the resulting quintets to form diff 
            /*15 bit form, 0x80<=diff<=0x7FFF*/ 
    
         if present character = Base-4 characters {xyz} and NOT w 
            consume present character 
               and convert it to a duplet according to Base-4 
             
            if prev = 0x00 
               obtain and consume next 3 characters 
               and convert them to quintets according to Base-32 
               concatenate duplet with the 3 quintets to form diff 
               /*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/ 
    
            else, if next character = Base-32 character (0..v) 
               then consume and convert to quintet according to Base-32 
               concatenate duplet with the quintet to form diff 
               /*7 bit form, diff<=0x7F*/ 
    
            else, obtain next character 
            if next character = Base-4 characters {xyz} and NOT w 
               then fail and indicate error 
    
            else, if next character = w 
               then consume and discard w and obtain next 4 characters 
               consume and convert characters to 
                  quintets according to Base-32 
               concatenate duplet with the 4 quintets to form diff 
               /*22 bit form, 0x100000<=diff<=0x10FFFF*/ 
    
         if present character = w 
            discard "w" and obtain next character 
    
            if next character = Base-4 characters {xyz} and NOT w 
    
               and if prev = 0x00 
                   obtain and consume next 4 characters 
                   and convert characters to quintets based on Base-32 
                   concatenate the 4 quintets to form diff 
                   /*first code point: 20 bit form,*/ 
                   /*0x20000<=diff<=0xFFFFFF       */ 
    
               else, consume and convert to duplet according to Base-4 
                  and obtain and consume next 3 characters 
                  and convert to quintets according to Base-32 
                  concatenate duplet with the 3 quintets to form diff 
                  /*17 bit form, 0x8000<=diff<=0x1FFFF*/ 
    
            else, if next character = w 
               then consume and discard w 
               and obtain and consume next 4 characters 
                  and convert to quintets according to Base-32 
               concatenate duplet the 4 quintets to form diff 
               /*20 bit form, 0x20000<=diff<=0xFFFFFF*/ 
Chung & Leung                                                 [Page 12] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
    
            else, if next character = Base-32 character (0..v) 
               then convert to quintet according to Base-32 
               set quintet to diff 
               /*7 bit form, diff<=0x7F*/ 
    
         fail upon encountering a non-ACE37 character 
            or end-of-input 
    
         let prev = prev XOR diff 
    
         if prev <= 0x9FFF                /*reversal of the code    */ 
            and if prev <= 6FFF           /*block shifting described*/ 
            output = prev + 0x3000        /*in Section 2            */ 
            else, output = prev - 0x7000 
         else, output prev    
      end 
   end 
   encode the output sequence and compare it to the input string 
   fail if they do not match (case insensitively) 
    
8. Examples 
    
   ACE37 is likely to be implemented with an ACE prefix in the form 
   "xx--".  The actual prefix to be used is not discussed in this 
   document.  The following examples are taken from the mailing list as 
   well as from DUDE-02 and the AMC series.  The resulting ACE37 string 
   is compared with that using DUDE: 
    
   (A) JPNIC (the registry of .jp domain) 
    
   Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3  
            U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9  
            U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF  
            U+30FC 
     ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3 
            (57 char) 
   DUDE-02: (error: result string exceeds 59 characters*) 
            Note: 59 characters is the maximum allowable when the ACE  
            prefix "xx--" is included 
    
    
   (B) A health-insurance organization in Tokyo 
    
   Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3  
            U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44  
            U+5408 
     ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char) 
   DUDE-02: (error: result string exceeds 59 characters) 
    
    
    
    
Chung & Leung                                                 [Page 13] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
   (C) 6 hangul syllables 
    
   Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC 
     ACE37: xg9orfsqssvfg3i8t2c (19 char) 
   DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char) 
    
    
   (D) maji<de>koi<suru>5<byou><mae>  (Latin, hiragana, kanji) 
    
   Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069  
            U+3059 U+308B U+0035 U+79D2 U+524D 
     ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char) 
   DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char) 
    
    
   (E) <pafii>de<runba>  (Latin, katakana) 
    
   Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3  
            U+30D0 
     ACE37: 06hw4zmyv-d-ewnwox3 (19 char) 
   DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char) 
    
    
   (F) <sono><supiido><de>  (hiragana, katakana) 
    
   Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067 
     ACE37: 02txj06nzdx8xl05e (17 char) 
   DUDE-02: vsvpvd7hypuivf4q (16 char) 
    
    
   (G) 2 Arbitrary Plane Two Code Points 
    
   Unicode: U+261AF U+261BF 
     ACE37: w4odfwg (7 char) 
   DUDE-02: uyt6rta (7 char) 
    
    
   (H) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky 
    
   Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073  
            U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076  
            U+00ED U+010D U+0065 U+0073 U+006B U+0079 
     ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char) 
   DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char) 
    
    
   (I) Chinese 
    
   Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D 
            U+6587 
     ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char) 
   DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char) 
    
Chung & Leung                                                 [Page 14] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
9. Summary & Comparisons 
    
   In summary, ACE37 is based on the DUDE-02 process with an improved 
   compression scheme for code point sequences that are less likely to 
   cluster too closely together, such as CJK ideographs. 
    
   Since it is the design team's indication that generally 30 
   characters should be good enough and that there are a lot of concern 
   from the Asian community that 14-15 characters is definitely 
   limiting and that few indication from the Latin community that 
   length is really a concern, ACE37 have set its objective to increase 
   the possible number of characters in a worse case scenario closer to 
   20 characters. 
    
   ACE37 have succeeded in creating a very simple variation based on 
   the primary ACEs identified by the design team to create an ACE that 
   achieves dramatically better performance for CJK characters while 
   maintaining the simplicity of DUDE. 
    
   Key Improvements of ACE37 over DUDE-02 
   - much more spacious for Han characters.  Improved worst-case 
     scenario to 21 Han ideographs by introducing code block shifting 
     and utilizing fully base-32 characters 
   - no need to arbitrarily pre-pend flagging bits to identify code 
     point brackets.  Instead base-4 characters and diff forms are used 
   - base-32 and base-4 characters can be easily computed instead of 
     mapped using lookup tables 
    
   Key Improvements of ACE37 over the AMC series 
   - a more simple process, utilizing the one-pass differential 
     mechanism from DUDE-02 
   - a much more simple code block shifting process is used in ACE37 to 
     achieve a similar goal for the complex multiple reference point 
     system used by the AMC series 
   - base-32 and base-4 characters can be easily computed instead of 
     mapped using lookup tables 
    
   Key Improvements of ACE37 over LACE 
   - a more simple process, utilizing the one-pass differential 
     mechanism from DUDE-02 
   - much more spacious for Han characters.  Improved worst-case 
     scenario to 21 Han ideographs by introducing code block shifting 
     and utilizing fully base-32 characters 
   - base-32 and base-4 characters can be easily computed instead of 
     mapped using lookup tables 
    
   Two Excel spreadsheet for ACE37 encoding and decoding can be found 
   at http://www.dnsii.org/ace37/ace37-encode.xls and 
   http://www.dnsii.org/ace37/ace37-decode.xls respectively.  This 
   illustrates the simplicity of ACE37 and provides a handy tool for 
   checking ACE37 encoding and decoding algorithms.  The ACE37-encode 
   spreadsheet also includes a DUDE-encode worksheet. 
    
Chung & Leung                                                 [Page 15] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
10. Security Considerations 
    
   This document does not talk about DNS security issues, and it is 
   believed that the proposal does not introduce additional security 
   problems not already existent and/or anticipated by adding 
   multilingual characters to DNS and/or using ACE. 
 
11. References 
 
   [AMC-W]   Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001. 
    
   [AMC-V]   Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001. 
    
   [DUDE-02] Mark Welter, Brian W. Spolarich & Adam M. 
             Costello, "Differential Unicode Domain Encoding (DUDE)", 
             June 7, 2001. 
    
   [LACE]    Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length-
             based ASCII Compatible Encoding for IDN", January 5, 2001. 
    
   [Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie, 
             "Preparation of Internationalized Host Names", February 
             24, 2001  
    
Appendix A. Acknowledgements 
    
   The ACE37 draft is a combination of DUDE-02, the AMC series and 
   LACE, and takes into consideration the report of the ACE design 
   team.  The authors would therefore like to thank the authors of 
   DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the 
   authors of the AMC series - Adam M.Costello; the authors of LACE - 
   Mark Davis & Paul Hoffman; and, the ACE design team and its advisors 
   - Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence, 
   Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and 
   Erik Nordmark for their inspirations. 
    
Appendix B. Mixed-case annotation 
    
   This section is taken from DUDE and modified for ACE37 
    
   In order to use ACE37 to represent case-insensitive Unicode strings, 
   higher layers need to case-fold the Unicode strings prior to ACE37 
   encoding.  The encoded string can, however, use mixed-case base-4 
   characters as an annotation telling how to convert the folded 
   Unicode string into a mixed-case Unicode string for display 
   purposes. 
    
   Each Unicode code point (unless it is an LDH) is represented by a 
   sequence of base-4 and base-32 characters, the first of which is 
   mostly a base-4 character, which is always a letter {wxyz} (as 
   opposed to a digit).  If that letter is uppercase, it is a 
   suggestion that the Unicode character be mapped to uppercase (if 

Chung & Leung                                                 [Page 16] 
ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001  
 
   possible); if the letter is lowercase, it is a suggestion that the 
   Unicode character be mapped to lowercase (if possible). 
    
   If the code point is an LDH, for example "a", it will be represented 
   as "-a".  To mark the case for an LDH, simply set the LDH to the 
   desired case following the "-".  Fir example if an uppercase "A" is 
   desired, the encoded form SHOULD be "-A". 
    
   Note that there is a possibility that no base-4 character is present 
   for a code point representation.  That is the case for a 15-bit diff 
   form.  In this case, the base-32 characters will be used for case 
   suggestion (if possible), similar to that discussed for using a 
   base-4 character.  However, also note that there is a very remote 
   possibility that all 3 base-32 characters are digits.  If this 
   happens, case unfolding will be aborted.  Since case annotation is 
   an optional feature and used for display purposes only, this is not 
   considered to be a major concern.  Moreover, the possibility of this 
   happening is truly remote at only (32639/27)/1114109 or just 0.1% 
   chance of happening. 
    
   ACE37 encoders and decoders are not required to support these 
   annotations, and higher layers need not use them. 
    
   For example:  In order to suggest that example (H) in Section 8: 
   "Examples" be displayed as: 
   Czech: Pro<ccaron(uppercase)>prost<ecaron(uppercase)> 
          nemLUV<iacute(lowercase)><ccaron(lowercase)>esky 
    
   one could capitalize the ACE37 encoding as: 
     ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char) 
    
Authors: 
 
Edmon Chung 
Neteka Inc. 
2462 Yonge St. Toronto, 
Ontario, Canada M4P 2H5 
edmon@neteka.com 
 
David Leung 
Neteka Inc. 
2462 Yonge St. Toronto, 
Ontario, Canada M4P 2H5 
david@neteka.com 
    








Chung & Leung                                                 [Page 17]