Internet DRAFT - draft-feather-hashed-uri
draft-feather-hashed-uri
INTERNET DRAFT C. Feather
Expires: 16 March 2003 Thus plc
17 September 2002
The Hashed URI
draft-feather-hashed-uri-03.txt
1. Status of this Document
This document is an Internet-Draft and is in full conformance with
Section 10 of RFC 2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or made obsolete by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accesses at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft shadow directories can be accessed
at http://www.ietf.org/shadow.html.
This section will be updated with the appropriate verbiage from RFC 2223
when this document has been found ready for publication as an RFC.
2. Abstract
There are situations where it is desirable to determine whether two
URIs are identical without revealing the value of one or the other if
they are not. For example, the publisher of filtering software may want
to provide a set of URLs to be filtered without identifying the actual
resources in question.
This problem has traditionally been done using cryptographically
secure hashes. The two items are both hashed and the resulting values
then compared. If they are equal it can be assumed that the two original
items were equal as well; if not, the hash does not reveal anything
about the original item. This technique is eminently suitable for this
situation.
This document describes a "hashed" URI naming scheme for representing
the hashed form of a URI [RFC2396] as another URI.
Feather [Page 1]
INTERNET DRAFT The Hashed URI 17 September 2002
3. Formal definitions
Formal definitions in this document use augmented Backus-Naur syntax
[RFC2234]. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
[RFC2119].
4. The hashed URI scheme
4.1. Syntax
A hashed URI is formally described as follows:
hashed-uri = hashed-scheme ":" hash-data *flag
hash-data = hash-algorithm "=" hash-value
hashed-scheme = "hashed"
hash-algorithm = identifier
hash-value = 1*HEXDIG
identifier = 1*(ALPHA / DIGIT / "." / "-")
flag = "+frag" / "+query"
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
DIGIT = %x30-39 ; 0-9
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
The entire URI is case-insensitive.
4.2. Identification of hash algorithms
Each hashing algorithm has an identifier for the purposes of the hashed
URI scheme. Names beginning with "x-" are available for private use. The
Internet Assigned Numbers Authority (IANA) will maintain a registry of
hash algorithms and the associated identifier. The initial contents of
this registry will be:
Identifier Definition
"md5" Message Digest algorithm 5 ("MD5") [RFC1321]
"sha1" Secure Hash Algorithm ("SHA-1") [FIPS-180-1]
A hashed URI MUST only use an identifier that is in the registry or
that is available for private use. An application that interprets hashed
URIs MUST accept "sha1" and SHOULD accept all identifiers in the
registry.
Feather [Page 2]
INTERNET DRAFT The Hashed URI 17 September 2002
4.3. Generating a hashed URI
A hashed URI is generated from an initial URI in the following five
steps.
(1) Any #fragment or ?query component is removed if appropriate. When
generating hashes for publication, whether or not this is appropriate
will depend on the purpose and possibly on the particular URI. When
generating a hash to compare against a hashed URI, the flags on the
latter can be used to determine which to retain.
(2) The URI is reduced to its equivalent canonical form. To do this, the
scheme name is converted to lowercase and a relative URI is converted to
the equivalent absolute URI; any other changes depend on the specific
scheme used. Except where done as part of a canonicalization algorithm
for the specific scheme, no encoding or decoding of escaped characters
is done. A possible canonicalization algorithm for hierarchical URIs is
given in Appendix A.
(3) The resulting string is passed through the appropriate hash
algorithm. The string does not include any trailing zero bytes, a
length, delimiting angle brackets, or any other ornamentation. This
produces a sequence of octets (16 octets for MD5, 20 octets for
SHA-1). If the algorithm requires the data to be a multiple of a
particular size and does not specify a padding rule, the string is
padded with zero bytes to that size (MD5 and SHA-1 both specify their
own padding rules and therefore no padding is done at this stage).
(4) Each octet is converted to two hexadecimal digits in the normal
manner and these are written in the same order as the corresponding
octets.
(5) If a #fragment or ?query was retained in step 1, the "+frag" or
"+query" flag is appended.
4.4. Examples
Initial: mailto:clive@demon.net
Hashed: hashed:md5=cd933f3b87ee60e58917448c9678ee32
Initial: http://www.thus.net
Hashed: hashed:md5=be96302a0468481aca954431cf293e05
or: hashed:sha1=87ed28009f511ed7ef630180a229ba257180a481
Initial: http://www.thus.net#end
Hashed: hashed:sha1=87ed28009f511ed7ef630180a229ba257180a481
or: hashed:sha1=31a79a96e0404778a3b95cb426b04b6114a604a5+frag
Initial: HTTP://www.Thus.net
Hashed: hashed:md5=be96302a0468481aca954431cf293e05
Initial: http://10.20.30.40/a%62c
Hashed: hashed:md5=754a2c63a13aa2e8153d027066e31f8f
Feather [Page 3]
INTERNET DRAFT The Hashed URI 17 September 2002
5. IANA considerations
Section 4.2 requires IANA to maintain a registry of hash algorithms and
identifiers. Since implementers will want to handle as many hashes as
possible, adding new algorithms places a significant burden on them.
Therefore new algorithms should only be added when they are widely
accepted in the community. The IETF Consensus process [RFC2434] is
recommended.
6. Security Considerations
The security considerations for URIs in [RFC2396] continue to apply.
The security of the hashed URI depends completely on that of the hash
algorithm used.
Where the initial URI is limited to a sufficiently small set of
possibilities, an exhaustive search can be used to determine the initial
URI from the hashed URI. Given H hashed URIs and P possible initial
URIs, the time take to do this search is O(H+P) and not O(HP). In
particular, the set of URIs of the form <http://www.domain>, where the
domain is in a major domain registry, may be small enough for such a
search to be feasible.
Because (subject to the above) the initial URI cannot be determined from
a hashed URI, it is necessary to trust the originator of the latter as
to whether the resource has any of the properties claimed for it.
7. References
[FIPS-180-1] "Secure Hash Standard", National Institute of Standards
and Technology, U.S. Department Of Commerce, April 1995.
Also known as: 59 Fed Reg 35317 (1994).
[RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC
1321, April 1992.
[RFC2119] Bradner, S., "Keywords for use in RFCs to Indicate
Requirement Levels", RFC 2119.
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[RFC2396] Berners-Lee, T., R. Fielding and L. Manister, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396,
August 1998.
[RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an
IANA Considerations Section in RFCs", RFC 2434,
October 1998.
Feather [Page 4]
INTERNET DRAFT The Hashed URI 17 September 2002
8. Acknowledgements
The assistance of Richard Clayton in preparing this document is greatly
appreciated. I am also grateful to Ian Cooper for his comments.
9. Change history
Changes in version -01:
* The "sha-1" algorithm identifier changed to "sha1".
* The "+frag" and "+query" flags added.
* The handling of fragments and queries clarified.
* Appendix A altered to eliminate empty path_segments in variant P.
* Some further examples added and others altered.
Changes in version -02:
* None.
Changes in version -03:
* Author's fax number.
10. Author's Address
Clive Feather
Thus plc
322 Regents Park Road
London
N3 2QQ
United Kingdom
Email: clive@demon.net
or: clive@davros.org
Tel: +44 20 8371 1138
Fax: +44 870 051 9937
This document expires 16 March 2003.
Feather [Page 5]
INTERNET DRAFT The Hashed URI 17 September 2002
A Canonicalization of hierarchical URIs
The following algorithm is proposed as a way of canonicalizing
hierarchical URIs; that is, those that use the production "hier_part"
from [RFC2396] appendix A.
In the real world it is not always possible to say whether two URIs
address the same resource, because this will depend on the actual
implementation deployed on network servers. Therefore the algorithm has
two variants - "N" and "P" - which respectively tend towards false
negatives (the URIs are assumed to be different) and false positives
(they are assumed to be the same). The choice of variant will depend on
the application domain.
(1) The changes described in section 4.3 steps (1) and (2) are made.
(2) The URI is decomposed into scheme, authority, and a list of
path_segments. If the scheme uses user/host/port triples as
authorities, the authority is further split into the three parts.
Any of these parts may be empty.
(3) If the port is the default port for the scheme, it is deleted.
(4) Variant N: the scheme and host (if a domain name) are lowercased.
Variant P: all the components are lowercased.
(5) The host (if an IP address) and port have all leading zeroes
removed.
(6) Variant P only: all empty path_segments are removed.
(7) Variant P only: if the last path_segment ends with one of the
strings in the first column, that string is replaced by the one in
the second column:
.html .htm
.jpeg .jpg
.text .txt
.ram .ra
(8) Scheme "http" only: if the last path_segment is "index.htm" or
"index.html", it is removed.
(9) The URI is recomposed, including only those components that remain
and are not empty strings. That is, it is the concatenation of:
[always] scheme ":"
[if there is an authority] "//" authority
[if there is a user and host] "//" user "@" host
[if there is only a host] "//" host
[if there is a port] ":" port
[for each path_segment in order] "/" path_segment
Feather [Page 6]
INTERNET DRAFT The Hashed URI 17 September 2002
Examples:
Original URI: "http://Fred@WWW.Thus.net:0112/Test/index.html"
Variant N: "http://Fred@www.thus.net:112/Test/"
Variant P: "http://fred@www.thus.net:112/test/"
Original URI: "http://111.011.101.001/test//Image.jpeg"
Variant N: "http://111.11.101.1/test//Image.jpeg"
Variant P: "http://111.11.101.1/test/image.jpg"
This document expires 16 March 2003.
Feather [Page 7]