Internet DRAFT - draft-giudici-web-robots-cntrl
draft-giudici-web-robots-cntrl
INTERNET DRAFT F. Giudici, A. Sappia
Category: Informational University of Genoa, Italy
February 22, 1997 Expires August 22, 1997
An Extension to the Web Robots Control Method
for supporting Mobile Agents
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents on the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress''.
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
ftp.isi.edu (US West Coast).
1. Abstract
The Web Robots Control Standard [1] is a method for administrators of
sites on the World-Wide-Web to give instructions to visiting Web
robots. This document describes an extension for supporting Robots
based on Mobile Agents, in a way that is independent of the
technology used for their actual implementation.
2. Introduction
Web Robots are Web client programs that automatically traverse the
World Wide Web by retrieving a document and recursively retrieving
all documents that are referenced. Robots are used for maintenance,
indexing and search purposes.
``Classic'' Robots perform their job from the host from which they
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 1]
INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997
have been launched; recent technologies offer the possibility of
writing Robots that are able to physically move through the network,
to operate within the website that hosts data being processed.
Mobile Robots can lead to bandwidth and computational power savings,
as well as to personalized search robots. A more detailed discussion
of Mobile Robots pros and cons is out of the purposes of this
document.
Mobile Agents [5] is a technology that, among other things, allows
the implementation of Mobile Robots. Mobile Agents are a
computational paradigm in which programs can ``migrate'' from host to
host, preserving their current state.
To migrate through the Internet, Mobile Agents have to transfer data
over the networks, for both their code and their internal data
structures. On this purpose, they need a communication protocol.
To receive and execute a Mobile Agent, a host must be equipped with a
proper daemon that listen a port for incoming requests.
Given the protocol name and the port number that the daemon is
listening, addresses for Mobile Agents destinations can be written in
form of a URL [2] as follows:
<protocol> :// <network address> : <port number>
For instance, considering the Agent Transfer Protocol (ATP) [3] and
given a fictional site www.fict.org, a valid address for dispatching
a Mobile Agent could be
atp://www.fict.org:434
3. Specification
To control the way Robots can access a WWW site, a method is being
currently used [1]. Simply speaking, the method states that a special
document, named /robots.txt and whose MIME type is text/plain, should
be available at the root of the website. Referring to the previous
example, the URL of this document would be
http://www.fict.org/robots.txt
/robots.txt contains a list of records that describe in details which
subtrees of the website are available for exploration by a given
Robot and which are not. The format of these records is the following
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 2]
INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997
one:
<record name> ":" <record contents>
A typical example follows:
User-agent: webcrawler
Allow: /
Disallow: /reserved
The method specifications allow extensions to this structure, so new
records can be added by just defining new tokens.
3.1. The-Mobile-agent-server record
To control dispatching of Mobile Robots, a new record type is defined
with the following form (the formal syntax is described in the next
section):
Mobile-agent-server: <path> <url>
These records associate a well defined path on the website to the URL
of a host that accepts Mobile Robots for exploring that path.
More than one Mobile-agent-server line can be used, and in this case
more recent lines always override older ones. Using multiple lines
allows to assign different subtrees to different Mobile Agent capable
hosts, or eventually to none. In the following example the website
root (/) is not assigned to any host, while /dir1 and /dir1/dir2 are
assigned to different targets:
Mobile-agent-server: / none
Mobile-agent-server: /dir1 atp://www.fict.org:544
Mobile-agent-server: /dir1/dir2 atp://www.fict.org:543
This mechanism is independent of the protocol and the programming
language used for implementing the Mobile Robot.
3.2. Formal Syntax
This is a BNF-like description of the Mobile-agent-server record
line, using the conventions of RFC 822 [4], except that "|" is used
to designate alternatives. Briefly, literals are quoted with "",
parentheses "(" and ")" are used to group elements, optional elements
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 3]
INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997
are enclosed in [brackets], and elements may be preceded with <n>* to
designate n or more repetitions of the following element; n defaults
to 0.
The Mobile Robot extension defines a new record line as follows:
mobileagentrec = "Mobile-agent-server:" *space path
*space (simplified_url | "none")
simplified_url = scheme "://" net_loc
scheme = 1*( alpha | digit | "+" | "-" | "." )
net_loc = *( pchar | ";" | "?" )
space = 1*(SP | HT)
The simplified URL is a subcase of a URL as defined in RFC 1808 [2]
and only designates a protocol, a network location and a port number.
The syntax for "path" and other symbols are defined in RFC 1808 and
reproduced here for convenience:
path = fsegment *( "/" segment)
fsegment = 1*pchar
segment = *pchar
pchar = uchar | ":" | "@" | "&" | "="
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
escape = "%" hex hex
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
alpha = lowalpha | hialpha
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" |
"h" | "i" | "j" | "k" | "l" | "m" | "n" |
"o" | "p" | "q" | "r" | "s" | "t" | "u" |
"v" | "w" | "x" | "y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" |
"H" | "I" | "J" | "K" | "L" | "M" | "N" |
"O" | "P" | "Q" | "R" | "S" | "T" | "U" |
"V" | "W" | "X" | "Y" | "Z"
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" |
"7" | "8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 4]
INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997
4. Examples
This section contains an example of how an extended /robots.txt may
be used.
Let us suppose that a fictional site has the following URLs:
http://www.fict.org/
http://www.fict.org/index.html
http://www.fict.org/services/
http://www.fict.org/services/index.html
http://www.fict.org/robots.txt
http://www.fict.org/home/
http://www.fict.org/home/user1/
http://www.fict.org/home/user1/index.html
http://www.fict.org/home/user2/
http://www.fict.org/home/user2/index.html
http://www.fict.org/home/user3/
http://www.fict.org/home/user3/index.html
Let be user1.fict.org and user2.fict.org two hosts equipped for
receiving Mobile Agents, for example by means of the ATP protocol.
The /robots.txt contains Mobile Agents directives as follows:
Mobile-agent-server: / atp://www.fict.org:8001
Mobile-agent-server: /home/ none
Mobile-agent-server: /home/user1/ atp://user1.fict.org:854
Mobile-agent-server: /home/user2/ atp://user2.fict.org:831
The following matrix shows if Mobile Agents are supported for
indexing a given document, and on which host:
URL HOST
http://www.fict.org/index.html atp://www.fict.org:8001
http://www.fict.org/services/ atp://www.fict.org:8001
http://www.fict.org/services/index.html atp://www.fict.org:8001
http://www.fict.org/robots.txt atp://www.fict.org:8001
http://www.fict.org/home/ not available
http://www.fict.org/home/user1/ atp://user1.fict.org:854
http://www.fict.org/home/user1/index.html atp://user1.fict.org:854
http://www.fict.org/home/user2/ atp://user1.fict.org:831
http://www.fict.org/home/user2/index.html atp://user1.fict.org:831
http://www.fict.org/home/user3/ not available
http://www.fict.org/home/user3/index.html not available
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 5]
INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997
5. Security considerations
The Mobile-agent-server record can expose the existence of resources
not otherwise linked to on the site, which may aid people guessing
for URLs.
If the exposed resource is the URL of a document, no further risks
are induced other than those ones already implied by the standard
mechanism.
If the exposed resource is the URL of a site that can host Mobile
Agents, security problems are to be dealt with at the site itself by
means of a proper security model that should allow incoming Robots to
only perform those operations needed for exploring the assigned
website subtrees. However this is an issue related to the specific
technology used for the implementation of the Mobile Robots and it is
not to be discussed here.
The same considerations about impersonation and encryption stated in
the Standard Specification also apply here.
6. References
[1] Koster, M. "A Standard for Robot Exclusion",
http://info.webcrawler.com/mak/projects/robot/norobots.html, June
1994.
[2] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform Resource
Locators (URL)", RFC 1738, CERN, Xerox PARC, University of Minnesota,
December 1994.
[3] Lange, D. B., "Agent Transfer Protocol - ATP/0.1 Draft", IBM
Tokyo Research Laboratory,
http://www.trl.ibm.co.jp/aglets/atp/atp.htm, July 1996.
[4] Crocker, D., "Standard for the Format of ARPA Internet Text
Messages", STD 11, RFC 822, UDEL, August 1982.
[5] Chang, D. T., and Lange, D. B., "Mobile Agents: A New Paradigm
for Distributed Object Computing on the WWW", IBM Tokyo Research
Laboratory, OOPSLA'96 Workshop "Toward the integration of WWW and
Distributed Object Technology",
http://www.trl.ibm.co.jp/aglets/atp/ma.html.
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 6]
INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997
7. Authors' Addresses
Fabrizio Giudici, fritz@dibe.unige.it, phone: +39-10-3532192
Andrea Sappia, sappia@dibe.unige.it, phone: +39-10-3532192
Electronic Systems and Networking Group
Department of Biophysical and Electronic Engineering
University of Genoa
Via Opera Pia 11/a, 16145 - Genoa, ITALY
Expires August 22, 1997
Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 7]