Internet DRAFT - draft-dowser-spec
draft-dowser-spec
Internet-Draft Carlos Bueno
May 2004
A Distributed Web Search Protocol -- Dowser/0.1
draft-dowser-spec-00.txt
Status of This Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This document Expires on November 04, 2004
Abstract
Dowser is an application-level, peer-to-peer protocol for creating a
searchable index and cache of documents. It is intended for both
small-scale intranets and Worldwide Web-scale indexes. This document
describes the messages that members, or "nodes", of the network pass
to each other to distribute workload, request & respond to queries,
and maintain the index.
Bueno [Page 1]
Internet-Draft Dowser/0.1 May 2004
Table of contents
1. Introduction ................................................. 3
1.1 Purpose .................................................... 3
1.2 Definitions ................................................ 3
1.3 URL vs. URI vs. URN ........................................ 4
1.4 Requirements ............................................... 5
2. Some Examples ................................................ 5
2.1 NODEFIND ................................................... 5
2.1.1 Node-id & seed ............................................ 6
2.1.2 Last-key .................................................. 6
2.1.3 Ring-id ................................................... 7
2.1.4 Port ...................................................... 7
2.1.5 Server Responses .......................................... 7
2.2 SEARCH ...................................................... 8
2.2.1 Expires ................................................... 9
2.2.2 Content-key ............................................... 9
2.2.3 Search syntax ............................................. 9
2.3. URL caching ............................................... 10
2.4. Content caching ........................................... 11
2.5. CRAWL ..................................................... 11
2.6.INDEXADD ................................................... 12
3. Announcing .................................................. 13
4. Redundancy .................................................. 14
5. Optional headers ............................................ 14
6. Error conditions ............................................ 14
7. Notes on implementation ..................................... 15
7.1 Ranking .................................................... 15
7.2 Proxying ................................................... 15
7.3 TCP vs UDP ................................................. 16
7.4 Ping/Pong .................................................. 16
7.5 MHTML ...................................................... 16
7.6 Spam ....................................................... 16
7.7 Indexing ................................................... 16
8. Acknowledgements ............................................ 16
9. References .................................................. 17
9.1 Informative ................................................ 17
9.2 Normative .................................................. 17
10. Security Considerations .................................... 17
10.1 Personal Information ...................................... 18
10.2 Sensitive data ............................................ 18
11. IPR & Copyright ............................................ 18
12. Contact Information ........................................ 19
Bueno [Page 2]
Internet-Draft Dowser/0.1 May 2004
1. Introduction
1.1 Purpose
The express purpose of the Dowser protocol is to encourage open
research into web-scale, decentralized indexing systems. The
specification for the Hypertext Transfer Protocol [RFC1945]
has this observation:
Practical information systems require more functionality than
simple retrieval, including search, front-end update, and
annotation. HTTP allows an open-ended set of methods to be used to
indicate the purpose of a request.
HTTP is in widespread use by hundreds of millions of people every
day, and they issue hundreds of millions of searches for information.
Most of these searches are served by centralized "search engines"
that cache copies of as many websites as possible to create their
indexes. The problems of scale have so far been admirably met
[GOOGLE], but the operating costs of web-scale engines are now out of
the reach of most Universities and corporations. Conversely, larger
and larger amounts of storage and processing power are available to
personal computer users every year.
Dowser is an extension to HTTP that can be used by itself or added to
existing HTTP implementations. Each node on the network claims a
small part of the keyspace of a distributed hash table [CHORD].
Indexes of documents are keyed under the hash of the word; documents
themselves are keyed by their Uniform Resource Identifier [URI].
There is also a "cache of caches", to allow popular documents or
indexes to be retrieved from any node that has a copy.
The syntax and formatting rules are inherited from HTTP/1.1
[RFC2616]. The required "Host" header in HTTP/1.1 does not really
apply to Dowser, but including it shouldn't hurt anyone. Several
other headers and return codes are adopted for use.
1.2 Definitions
node
In this protocol there is no difference between a "client" or a
"server". They are all equal "nodes" on the network, capable of
sending and serving requests. Each node is identified by a
40-digit hexadecimal number called a "node-id". They participate
in creating, storing and serving a distributed hash table. This
hash table can contain document caches, indexes of those
documents, and other data.
For sanity's sake, the term "server" can be taken to mean
"responding node" and "client" to mean "requesting node" in the
context of the conversation being described.
Bueno [Page 3]
Internet-Draft Dowser/0.1 May 2004
hash, key
Hash functions are used to evenly distribute data around the
network and to uniquely identify nodes and pieces of information.
Dowser uses the Secure Hash Algorithim, version 2 [SHA]. The terms
"hash" and "key" refer to the 40-digit hexadecimal number gotten
from passing some piece of information through the SHA function,
e.g.:
SHA("foo") == 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
distributed hash table
A distributed hash table is a key-->value mapping that is
contained on many computers, where they keys are hashed.
range
The "range" of a node is defined by the first and last key of the
hash table it is expected to store data under.
left-hand, right-hand
The total keyspace is imagined as the perimeter of a circle, or a
"ring". While standing in the center, points to the left generally
decrease and points to the right generally increase, except at the
point where "fff..." meets "000...".
A useful analog is timezones, where West is left and East is
right. The time is always earlier to the left except at the
International Date Line.
1.3 URL vs. URI vs. URN
The protocol descibed here has a focus different from most
filesharing protocols. Searches for files are not broadcast but
targeted; nodes are authoritative regarding the location and
checksums of data in their ranges. Searches for a particular node are
sent to arbitrary nodes but not broadcast per se; they are not
forwarded but responded to directly.
In most ad-hoc filesharing protocols the only unique identifier is
the checksum. The filesystem names and metadata may be different for
different copies but the checksum remains the same. In HTTP jargon
this is a Universal Resource Name or URN , that which uniquely
identifies the file without necessarily referencing its location or
common name [RFC2396].
The content of a cached web page may change with time. This means the
checksum will change while its Univeral Resource Locator, the origin
of the page, remains constant.
Indexes built from these cached pages are also treated as files, one
index file per word or phrase. An index's Universal Resource Name is
Bueno [Page 4]
Internet-Draft Dowser/0.1 May 2004
that word. (More correctly it would be "dowser://foo", not "foo".)
Searching in Dowser therefore starts with a lookup for the node(s)
who handle the search term(s). The search terms (URN) are then sent
to those nodes, resulting in either a) the content of index file or
b) the latest checksum of the index file and a list of nodes that
have previously downloaded a copy. Retrieving a cache of a web page
is done the same way.
1.4 Requirements
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [34].
An implementation is not compliant if it fails to satisfy one or more
of the MUST or REQUIRED level requirements for the protocols it
implements. An implementation that satisfies all the MUST or REQUIRED
level and all the SHOULD level requirements for its protocols is said
to be "unconditionally compliant"; one that satisfies all the MUST
level requirements but not all the SHOULD level requirements for its
protocols is said to be "conditionally compliant."
In reality, the complications of this protocol are mostly in the
state information, not the ins and outs of the messages being passed.
2. Some Examples
We will assume the reader is familiar with HTTP transactions. It is
probably most instructive to give examples using the new methods and
then a discussion of particulars. Elipses ("...") are used to
truncate long hashes for readability.
2.1 NODEFIND
NODEFINDs are the bread and butter of routing between nodes. The
client wants to find the node(s) that have items under a certain hash
key, in this example, "5c89379d0aa9840ac910fd8cacbde2dbf877214a".
This can be a search term, url, or anything. The responding node may
or may not have this key. If not, it tells the client about nodes it
knows about that are closer to the target.
NODEFIND 5c89379d0aa9840ac910fd8cacbde2dbf877214a Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 12222... a5754...
Last-key: 13333...
Port: 4666
- Response -
Bueno [Page 5]
Internet-Draft Dowser/0.1 May 2004
Dowser/0.1 310 Not in my range
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34....
Last-key: 49999...
192.0.2.10 8000 50000f... 599999...
192.0.2.20 6876 43333a... 610000...
- Or -
Dowser/0.1 211 That's me
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 4000... de34....
Last-key: 7aaaa...
There is a lot to cover here. The first line of the request is like
HTTP, identifying the protocol and version, the 'method' or action to
be done, and the 'path' or key being requested. There are three
additional headers which MUST be included in any message: Ring-id,
Node-id, and Last-key. Additionally, the Port header MUST be included
in every request.
2.1.1 Node-id & seed
The node-id is comprised of two 40-digit hexadecimal numbers,
separated by whitespace. The first number serves as the FIRST key in
this node's range. This number MUST be generated by hashing some
random data iteratively. To help ensure this has occurred, the second
number (the "seed") MUST be the second-to-last result of the
iterations.
Node-id: b274f2e2a8d2881035af5866014e9ad5510ab15d
cdd2ae2594a83ef90c05ee6014b78631db8538d8
This example (linewrapped to fit) is well-formed because the first
number is the SHA hash of the second number. Servers MUST check that
the node-id and its "seed" are correct. If not, they MUST send a 412
"Precondition failed" error.
2.1.2 Last-key
Last-key is also an SHA hash, and is the LAST key in this node's
range. The last-key may change over time, as nodes leave or join the
network.
Note: it has been brought up that a node might be able to route a
large portion of traffic to itself by making its Last-key equal to
the node-id minus 1. In Internet Protocol terms, this would be
similar to setting your network address to 255.255.255.255. There is
Bueno [Page 6]
Internet-Draft Dowser/0.1 May 2004
nothing in the protocol to prevent this, but implementations should
come up with a way to deal with it reasonably. Our feeling that it is
self-correcting. If someone tries it on a network with a lot of nodes
one hopes he or she has good fire-supression equipment handy; but the
concern is the possible damage to the network routing.
2.1.3 Ring-id
Ring-id is yet another SHA hash to identify the group of nodes it
wishes to communicate with. This allows many Dowser networks to be
running on an internet at once. Alternate rings are useful for
testing implementations of the protocol on a large scale, or for
partitioning networks in general.
The "official" public ring-id:
Dowser/0.1 --> deadbeef00000000000000000000000000000000
If the Ring-id header sent does not match the ring-id of the server,
it MUST send a 412 "Precondition falied" message, and may provide
information in the body of the response as to the reason. Note: there
is a security risk in actually revealing the server's ring-id in the
response, if the ring is intended for private data.
2.1.4 Port
Simple enough: we need to advertise what TCP or UDP port we are
listening on for future reference.
2.1.5 Server Responses
If the hash is not in the listening node's range, it returns a 310
code and a list of nodes that are closer to the target. The list of
nodes takes this form:
IP PORT NODE-ID LAST-KEY
If it is in range, it sends a 211 "That's Me" code.
When overlap occurs a node may choose which one to contact first in
an implementation-dependent way, such as IP or network distance, or
closeness of the desired key to the node-id.
Note: IPv4 addresses are used in these examples, but implementations
SHOULD be written to handle IPv6 addresses as well, e.g.:
FEDC:BA98:7654:3210:FEDC:BA98:7654:3210
1080:0:0:0:8:800:200C:4171
3ffe:2a00:100:7031::1
1080::8:800:200C:417A
::192.9.5.5
Bueno [Page 7]
Internet-Draft Dowser/0.1 May 2004
::FFFF:129.144.52.38
2010:836B:4179::836B:4179
2.2 SEARCH
We have the basis of all conversations, and have described how nodes
identify themselves and each other. Now we can go to the next level:
content routing.
- Request -
SEARCH foo+bar+baz Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 1d3470c... 55754...
Last-key: 1e400cc...
Port: 4666
- Response -
Dowser/0.1 200 OK
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
...
- Or -
Dowser/0.1 300 Multiple choices
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
192.0.2.10 8000 50000f... 599999...
192.0.2.20 6876 43333a... 610000...
A SEARCH request is almost exactly like a nodefind except the "path"
is a list of search terms. The server determines if any of the terms
fall into its range. How to break up the terms may be
implementation-specific, but the simplest is to break on whitespace.
If any are in-range it returns a 300 code, the "content-key" of the
data (a SHA checksum of the content) a list of the nodes that have
Bueno [Page 8]
Internet-Draft Dowser/0.1 May 2004
requested that data.
Or, it may return the index data itself. Requesting nodes SHOULD keep
a copy of the data it receives for a reasonable amount of time, as
hinted by the Expires: header. The data should be keyed under the
content-key, for retrieval via a CACHE request. This helps distribute
the workload. The index is in the form:
URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
And is tab-delimited. AGE is the number of seconds since the record
was created. One or more "SNIPPET"s contain text that form an
abstract of the document.
As always, the server can alternately send the standard 310 Not in
Range, or some error code.
2.2.1 Expires:
Indexes and caches of web documents are not long-lived in the way a
media file in traditional peer-to-peer networks are. The Expires
header hints at how long a node should cache this data, expressed in
seconds from the present time.
2.2.2 Content-key:
This is a SHA hash of the data's content, used to positively identify
it. This is different from the hash of the URL or search term. A page
whose URL is
"http://foo.example.com"
... and whose content is
"<h1>Foo!</h1>"
... would have a URL hash of cfab46bb7dbd11e6187360d429586e2942f2d42e
and a content-key of cf5ce65061218164e4148038cc3a56a9e988fe7a. The
URL hash is a unique identifier of the document's origin, while the
content-key is an identifier of a particular version of that
document.
2.2.3 Search syntax
There are a few schemes for search syntax out in the wild. The most
common has operands for words that must or must not appear, and exact
phrases. Implementations MUST honor these operands to the best of
their ability.
foo +bar
Bueno [Page 9]
Internet-Draft Dowser/0.1 May 2004
Means that the documents MUST contain "bar" and MAY contain "foo"
foo -bar
Means that the documents MUST contain "foo" and MUST NOT contain
"bar". When there is only one term without a "-" operand, the "+" is
implied.
"foo bar"
'foo bar'
Means that the documents MUST contain the exact phrase "foo bar", in
that order.
Note: in the SEARCH example in section 2.0, it may seem like the "+"
signs in "foo+bar+baz" are search syntax. They are not; the terms are
"url-encoded", meaning spaces are encoded as "+" and other non-word
characters are encoded as %HEX where HEX is the hexadecimal ASCII
code. Therefore "foo +bar" would appear as "foo+%2Bbar"
Implementations may honor other operands as they see fit.
2.3. URL caching
Like a SEARCH, a URLCACHE does not necessarily return the data -- it
can return the content-hash of the data and a list of the last X
nodes to request that data, where X is implementation-specific. If
none of the listed nodes have the data, the requesting node may then
request it from the original node.
- Request -
URLCACHE fb9642989a2222305832a5d9c29b34... Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 1d3470c... a5754......
Last-key: 1e400...
Url: http://foo.example.com/some/page.html
Port: 4666
- Response -
Dowser/0.1 300 Multiple choices
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
192.0.2.10 8000 50000f... 599999...
192.0.2.20 6876 43333a... 610000...
- Or -
Bueno [Page 10]
Internet-Draft Dowser/0.1 May 2004
Dowser/0.1 200 OK
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
... (file data) ...
2.4. Content caching
This can be page caches, indexes, or really anything. A client sends
a request with the content-key as the path:
- Request -
CACHE 3fdb13677b10691debb3909dd917b00ee751115a Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 1d3470c... de34...
Last-key: 1e400...
Url: http://foo.example.com/some/page.html
Port: 4666
- Response -
Dowser/0.1 200 OK
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
... (file data) ...
- Or -
Dowser/0.1 404 not found
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
## todo: content-length? simultaneous downloading? the Range: & Content-length: headers?
2.5. CRAWL
So we know how to find nodes and content. The last question is: where
does the content come from? And the indexes?
Bueno [Page 11]
Internet-Draft Dowser/0.1 May 2004
CRAWL http://foo.example.com/foo/bar/baz.html Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 12222... a5754......
Last-key: 13333...
Referer: http://foo.example.com/foo/index.html
Port: 4666
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
- Response -
Dowser/0.1 202 Accepted
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
- Or -
Dowser/0.1 310 Not in my range
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34...
Last-key: 49999...
192.0.2.10 8000 50000f... 599999...
192.0.2.20 6876 43333a... 610000...
When a node crawls a page, it usually has links on it. If the hash of
those second-level URLs are not in the node's range, it sends a CRAWL
annoucement to the nodes that do handle that range, so they can
continue the crawl.
A node SHOULD first check (via URLCACHE) if a cached copy exists
before downloading a page directly.
Implementations SHOULD respect the "robots.txt" protocol for nice
spiders.
##todo: so where do we START crawling? that is not part of the
protocol because it could be a network of office machines sharing
their spreadsheets, or whatever -- need to say that in a nice way.
2.6.INDEXADD
An INDEXADD is an informational message to a node that a particular
URL contains a keyword inside the server's range.
INDEXADD 09dd917b00ee751115a3fdb13677b10691debb39 Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Bueno [Page 12]
Internet-Draft Dowser/0.1 May 2004
Node-Id: 12222... a5754......
Last-key: 13333...
Term: foo bar baz
Url: http://foo.example.com/foo/bar/baz.html
Expires: 864000
Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
Port: 4666
##todo: positions, counts, tdf*idf?
- Response -
Dowser/0.1 202 Accepted
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34..
Last-key: 49999...
- Or -
Dowser/0.1 310 Not in my range
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 48888... de34..
Last-key: 49999...
192.0.2.10 8000 50000f... 599999...
192.0.2.20 6876 43333a... 610000...
The Term: header contains what words the server may be interested in,
separated by tabs. An INDEXADD message implies that the requesting
node has a copy of the document. Listening nodes should retrieve it
from that node via a CONTENTCACHE message. This puts a cost on adding
words to the index.
3. Announcing
When a new node joins the network, it must first be provided with a
list of "well-known" nodes and a ring-id. Knowing nothing else about
the network topology, it generates a node-id and sends a NODEFIND
request on that key to those well-known nodes for its own key:
NODEFIND 12222...0 Dowser/0.1
Ring-Id: deadbeef00000000000000000000000000000000
Node-Id: 12222... a5754...
Last-key: 12222...
Port: 4666
This eventually leads the new node to its "neighbors" on the ring.
The new node's range SHOULD initially be of the same length as its
left-hand neighbor, but MAY grow or shrink according to how much
capacity it has at hand, but the Last-key MUST NOT be less than the
node-id of its closest neighbor.
Bueno [Page 13]
Internet-Draft Dowser/0.1 May 2004
4. Redundancy
All nodes may not be available at all times, and the "overlap" is not
guaranteed, requiring some fiat redundancy. So:
Each node assumes a range starting with its nodeid and That defines
its "core" range. It SHALL also handle an "auxilary" range of equal
length, calculated by adding 8 to the first digit. The addtion does
not carry. For example:
Node A "Core" range:
9000000000000000000000000000000000000000 to
c000000000000000000000000000000000000000
Auxilary range:
1000000000000000000000000000000000000000 to
4000000000000000000000000000000000000000
5. Optional headers
To save on bandwidth, nodes MAY compress their responses. A
requesting node MAY indicate that it can accept compressed data with
this optional header:
Accept-Encoding: gzip, deflate
If a server compresses a response, it MUST inidcate that with this
header:
Content-encoding: gzip
The server MAY also send this optional header to SEARCH responses to
help guide the user:
Related-terms: meta, syntax, jargon
6. Error conditions
Many error codes from HTTP/1.1 apply to Dowser.
500 Internal Server error
501 Not Implemented
503 Server busy (Service unavailable)
Bueno [Page 14]
Internet-Draft Dowser/0.1 May 2004
505 HTTP Version Not Supported
400 Bad request
404 Not Found
412 Precondition failed
When a node's id and seed don't check out, for instance. This is
different from a mal-formed request which should be responded to
with a 400. Also, nodes may keep a "blacklist" of node-ids and/or
IPs that they do not wish to do business with (See SPAM). In the
case of blacklists, the server MAY simply close the connection,
but it's polite to give some error information.
Implementations MAY elect to withold the Ring-id header in this
case, for security.
7. Notes on implementation
A reference implementation is currently in alpha testing and should be
released by mid-summer 2004 [IMPL]
7.1 Ranking
There is not a lot of ranking data returned by SEARCH requests. The
idea is to return somewhat raw indexes to the client and let it sort
things out through intersecting multiple responses and its own
biases.
One time-saving trick may be employed by implementations: recall in
the SEARCH example, all of the search terms were sent, even though
it's unlikely one node will handle all of them. The reason is to
allow nodes to pick out terms that tend to accompany the terms they
are responsible for. The node may then keep an extra index of those
accompanying terms and use it to trim and/or rank the results sent
back.
7.2 Proxying
If a relatively new node receives a request for a key that is in its
range but not its datastore, it may create a new request to its
upstream neighbor for it, on the theory that it may still be in that
node's datastore. It then keeps a copy while forwarding it to the
original requesting node. This allows new nodes to build on the work
of older ones.
Implementations might want to work out semantics for long-lived
connections on the lines of HTTP/1.1 Keep-alive headers.
Bueno [Page 15]
Internet-Draft Dowser/0.1 May 2004
7.3 TCP vs UDP
Several methods (NODEFIND, CRAWL, INDEXADD) have two interesting
properties: they will be used often, and involve a very small amount
of data. It is tempting to implement those parts over UDP and the
rest over TCP, or to come up with some tricks to do it all over UDP.
7.4 Ping/Pong
Neighboring nodes should be in the habit of pinging their nearest
neighbors, if they have not heard from them in a while. A "ping" can
take the form of a NODEFIND using the neighbor's node-id.
7.5 MHTML
HTML documents of today tend to be comprised of many files including
graphics, stylesheets, etc. It may be useful for implementations to
offer "MHTML" versions of cached documents that contain these
supporting files in one MIME-encoded file. If a client supports
MHTML, it should include it in the Accept: header. Common-sense
restrictions apply, such as not including executable scripts or
plugins and files that do not come from the same domain.
7.6 Spam
The network described above may one day contain thousands of nodes.
Thousands of nodes implies thousands of users looking for
information, which nowadays means spam. An implementation may elect
to deal with it in the way email clients do, with filters built from
analysis of links classified by the user as "spam" or "useful". Even
in the abscence of spurious traffic a good Bayesian filter may help
tailor results.
A sophisticated method of injecting targeted spam involves generating
node-ids until one pair is "close enough" to the hash of a desired
keyword, where "close enough" is matching the first 4 or 5 digits.
Even thousands of nodes will be sparse in the keyspace, so an
exhaustive search is not necessary.
7.7 Indexing
A large sample of web documents suggests that there is a fairly
constant number of terms, something on the close order of 14 million
[GOOGLE].
An unpublished survey of a large sample of search engine queries
suggests that there is also a fairly constant number of unique terms
searched, on the close order of 1 million. This is very interesting.
Bueno [Page 16]
Internet-Draft Dowser/0.1 May 2004
Suppose we give more "weight" to words found in the query stream over
those that are not. As we've seen in the adventures of Internet
search engines over the last few years, documents often lie about
their contents, but users rarely lie about their desires. They are
vauge, perhaps, but not false.
. . . that is, until the query stream becomes a thing of value. A
clever person might include a popular word and a made-up one into his
or her web page, then send thousands of queries with both of those
words to artificially boost the correlation. I have a wonderful proof
describing how this gambit might be defeated, but unfortunately there
is not enough room to include it here.
8. Acknowledgements
This specification builds on the years of work put into designing and
implementing Dr. Berner-Lee's HTTP protocol by many thousands of
people around the world. You know who you are.
Ideas and algorithims were stolen liberally from Chord, Circle, and
Freenet. Lots and lots of ideas from the original distributed search
projects, Harvest and WebAnts.
Special thanks to Richard Vasquez and Thomas Lackner for their
devious minds and help with early implementations.
9. References
9.1 Informative
[GOOGLE] Brin & Page, "The Anatomy of a Search Engine", Stanford
University, 1997
[CHORD] Stoica, et al. "A Scalable Peer-to-peer Lookup Service for
Internet Applications", Sigcomm 2001
http://www.pdos.lcs.mit.edu/papers/chord:sigcomm01/
[IMPL] Dowser P2P client, http://dowser.sourceforge.net
9.2 Normative
[RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax and Semantics", RFC
2396, August 1998.
[HTTP] Feilding, et al. "Hypertext Transfer Protocol -- HTTP/1.1",
RFC2616, June 1999
[SHA] Eastlake & Jones, US Secure Hash Algorithm 1 (SHA1), RFC3174,
September 2001
Bueno [Page 17]
Internet-Draft Dowser/0.1 May 2004
10. Security Considerations
10.1 Personal Information
Dowser deals directly with people's searching habits. While spreading
this information around is kind of the point, implementations SHOULD
take care not to "leak" any more than neccessary. For instance, it
may be tempting to skip the NODEFIND sequence and just issue SEARCHes
or URLCACHEs, since they also return node-finding information. This
sends people's search terms and requested urls to nodes that don't
strictly need it, and SHOULD be avoided.
Implementations can be expected to work closely with traditional HTTP
applications, with their own privacy & security considerations.
10.2 Sensitive data
Dowser was originally intended to deal only with publically
accessible information, but it can also be used for sharing
sensitive data among trusted computers, say, all the word processor
documents on an office LAN. For these situations, implementations
should use some sort of IP whitelisting or subnetting and make the
user aware of the risks.
11. IPR & Copyright
The IETF invites any interested party to bring to its
attention any copyrights, patents or patent applications, or
other proprietary rights which may cover technology that may be
required to practice this standard. Please address the
information to the IETF Executive Director.
Copyright (C) The Internet Society (2004). All Rights
Reserved.
This document and translations of it may be copied and
furnished to others, and derivative works that comment on or
otherwise explain it or assist in its implmentation may be
prepared, copied, published and distributed, in whole or in
part, without restriction of any kind, provided that the above
copyright notice and this paragraph are included on all such
copies and derivative works. However, this document itself may
not be modified in any way, such as by removing the copyright
notice or references to the Internet Society or other Internet
organizations, except as needed for the purpose of developing
Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or
as required to translate it into languages other than English.
The limited permissions granted above are perpetual and will
not be revoked by the Internet Society or its successors or
assigns.
Bueno [Page 18]
Internet-Draft Dowser/0.1 May 2004
This document and the information contained herein is provided
on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
12. Contact Information
Carlos Bueno
6231 SW 78th St., Ste. 20
Miami, FL, USA 33143
Email: carlos@bueno.org
This document expires on November 4, 2004
Bueno [Page 19]