Internet DRAFT - draft-gietz-ldapindex
draft-gietz-ldapindex
HTTP/1.1 200 OK
Date: Tue, 09 Apr 2002 00:06:26 GMT
Server: Apache/1.3.20 (Unix)
Last-Modified: Mon, 28 Jun 1999 11:04:00 GMT
ETag: "2e9a1d-40b9-377756a0"
Accept-Ranges: bytes
Content-Length: 16569
Connection: close
Content-Type: text/plain
Internet-Draft P. Gietz
Category: Informational University of Tuebingen
<draft-gietz-ldapindex-00.txt> P. Valkenburg
Expires: December 25, 1999 SURFnet
H. Bekker
SURFnet
Requirements and overview for an
European LDAP indexing service
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
This memo provides information for the Internet community.
This memo does not specify an Internet standard of any kind.
Distribution of this memo is unlimited.
Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working
documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of
six months and may be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to use
Internet-Drafts as reference material or to cite them other
than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be
accessed at http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 25, 1999.
Abstract
This document describes the overall concept of a distributed
indexing system based on the Common Indexing Protocol (CIP)
[1], as it will be implemented in the EC co-funded
DESIRE II project and afterwards maintained as a service by
the European academia. Although the system is designed for
multi-purpose usage, the main focus of this document lies on
its application as a LDAPv3 [2] Directory white pages
index, although the system is designed to be applicable to
other indexing problems as well.
NOTE:
This document will be accompanied by a more technical document
about the server side of the indexing system.
1. Introduction
The main aim of this document is to define a directory indexing
system that is deployable in a European context, making
directory information available to the research community of
all participating countries. This indexing system fulfills a
need that was already postulated in [3].
To implement an indexing system on such a large scale, a
hierarchical index creation and distribution is necessary
for overall performance and scalability issues. In such a model
index servers located at higher levels of the hierarchy gather
the index objects of server located on lower levels of the
hierarchy. For example the index server of an organization
collects the index objects of all departmental directory
servers, the index server of a country collects all index
objects of the organisational index server. This ends up in
one root index server that includes the index objects of all
country level index servers that are part of the indexing
system.
Since it is not advisable to have one single point of
information retrieval to which all clients that want to
retrieve index information would have to connect to, the
collection of index objects has to be redistributed downwards
the same hierarchy. Since the management of such a big
collection of index objects requires a considerable amount of
hardware power they will not be distributed down to the single
server, but might only reach to country level.
The mechanisms proposed in this document can be described as
a subset of the Common Index Protocol (CIP), which is seen as
the future standard for indexing in the Internet. Though not
all features defined in CIP are planed to be implemented the
overall structure should be compatible with this standard.
The whole indexing system of gathering and distributing and
searching index objects should be managed by the server side.
Clients should not need to have special features for retrieving
the index information, which means that an index server has to
respond to a client the same way a normal server would do, in
case it doesn't have the requested data: It just gives back a
referral to a server that might have them. In the case of an
index server the probability that the referral points
to a server that has the data is very high. That is the only
difference for the client. Although every client capable of
chasing referrals could be used in the proposed indexing
system, a client that includes special index related features
is favourable due to special problems of index query, like the
possibility of a huge amount of referrals that could have to
be dealt with. Also an index aware client can provide a better
user interface that gives index specific information.
2. Gathering of Index objects
The atomic entities of the indexing system in its first stage
are the index objects of the single server that are included
in the indexing system.
The format of the index should be the Tagged Index Object
(TIO), as defined in [4]. The advantage of the TIO is, that
all the indexed attributes of one directory entry can be
identified, and search filters including more than one
attribute can be used. The Data Set Identifier (DSI)
should be used to uniquely identify a given data set among all
data sets indexed. All index objects should be stored together
with the DSI and the base-URI(s), which is crucial for
generating referrals to the complete data of an indexed entry.
These index objects will not be modified in their content
while their transport up and down the hierarchy; they
will not be aggregated to bigger index objects. Although
such an aggregation is defined in 3.2.3 of [1], it produces in
combination with the TIO hardly manageable problems. Through
aggregation the tags of the TIO would change, which makes the
retrieval more difficult. Since the index object includes
information about the data server in its MIME transport header
[5] (the DSI and one more base-URIs), retrieval would
have to follow back the steps of aggregation to finally reach
the LDAP server. The update of index objects again would be
difficult in terms of retrieving the right index entries in
the right index objects, where again the whole aggregation
path has to be followed. If, as proposed here, the index
objects are not changed, the case of an update is quite
straightforward: a new index object is to be produced and the
old index object just has to be replaced in the index object
collections.
The DSI provides a perfect means for the identification of the
index object to be replaced. Incremental update of single
index objects is included in the TIO definition, which allows
you to specify data blocks for add, delete and update
operations. To unambiguously identify the record for the
delete and update operations a unique identifier of the entry
must be included in the index object. In the case of LDAP
directories this identifier would be the whole untokenized DN.
In a first approach the DESIRE II index system will not use
this feature of incremental updates.
The index objects can be built by dedicated crawlers that
crawl through the DIT sub tree of one server to collect the
data. A TIO converter can then in a second step produce the
index object from those data. The decision which of the entries
to crawl and which attribute values to collect, has to be done
by each participating organisation, the maintainer of the
single server respectively. These definitions should be made
via crawler access policies stored in the directory itself and
understood by the crawler. A separate document will define
the mechanisms and the storage model for such a crawler access
policy. To make sure that only crawlers compliant to this
policy mechanism are able to get the data, the crawler
has to authenticate itself. In a first stage, the crawler could
be directed via access control mechanisms inherent in the
Directories. With such a mechanism in place it becomes
irrelevant in terms of privacy issues, who will maintain and
run such a crawler. It could either be the organisation itself,
the National Research Network for all or a subset of
organisations in a country, or even the maintainer of the
central index objects at the root of the system.
The single servers that are part of the index system will be
registered. Registered server will be put in a list, which will
be accessed by the crawler or the maintainer of the crawler to
retrieve knowledge about host and port of the server. The
details of the registration process is outside the scope of
this document.
3. Distribution of the index objects
To prevent a single index entry point, where all the worlds'
clients would connect to, the gathered index objects (TIO
collections) have to be distributed downwards again. Every
country level should provide an index servers for the complete
TIO collection. If appropriate, this index could be distributed
to several index servers at different locations in the
respective country.
The downward distribution of the indices, as well as the
upward sending of the indices to be gathered can be
performed via simple FTP transfer for a proof of concept.
More advanced transport mechanisms defined in the CIP
Transport Protocols draft [6] can be used instead eventually.
4. Query routing
The clients should not have to provide special features for
using the index system. It connects to an index server in the
same way it would connect to any other directory server. The
access protocol is plain LDAP (v3). The server should then
perform the following algorithm:
Perform a search in the locally stored data set, and return
the data if found. If no data matched the search filter, the
server should consult the index server to search for
appropriate entries and return the referrals to the entries,
based on the base-URI found in the index.
The user could influence this algorithm by adding a base DN
which defines the entry point and limits the search. The user
can herewith, e.g. start the search from the root level, or
from any other level in the hierarchy. In any case the client
does not have to know anything about the indexing system
except the hostname and port number of one nearby server,
which is a part of the index system.
5. The over all concept
* A crawler collects the to be indexed data from standard
organisational LDAP servers using LDAP searches.
* A TIO converter builds Tagged Index Objects of these servers,
which have to include knowledge for referrals (Base-URI) in
the MIME wrapper.
* A TIO transporter passes them on to a country
level referral index server (TIO/LDAP Referral Server), using
one of the CIP defined transport protocols (e.g., HTTP).
* The referral index server stores the index objects.
* The TIO transporter distributes the country index
objects to a root referral index server.
* The TIO transporter distributes the index objects of the root
referral index server back to the country level referral
index servers.
* A LDAP client (dedicated client, web browser, mail agent,
etc.) sends an LDAP search to a country level LDAP index
servers (native protocol sever).
* The country level LDAP index server fetches LDAP referral(s)
from the country referral index server which refer to the
data matching the search.
* The country level LDAP index server gives back the
referral(s) to the Client.
* The Client interprets the referral(s) and retrieves the data
from the original LDAP server.
6. Security Considerations
6.1 Personal data and privacy legislation
Since white pages directories contain personal data (i.e.
e.g. name, email address, telephone number), it is important
to conform to European privacy legislation. Even if all the
data are public data and published in the directory with the
consent of the affected persons, it is against that
legislation to make available a bulk of such data. While
transferred from one server to the other the index objects are
vulnerable to get stolen by commercial data brokers and
spammers. It is therefore necessary to protect the index
object data while transferring them on the net.
6.2 Encryption of the index objects
To secure the index object distribution process the data
should be encrypted. Since CIP data are MIME encoded a MIME
compatible encryption method is preferable, because then the
security feature is independent of the transport protocol,
let it be HTTP or FTP or email. The CIP authors advise
to use PGP encrypted S/MIME as defined in [7]. PGP has
a variety of advantages.
* It is commonly used in the Internet.
* It is easy to include into a MIME application..
* It provides means for public key asymmetrical encryption
* It provides means for symmetrical encryption as well.
* In addition it provides a means of signing the data in a
way that even one missing byte in the data makes the
signature invalid
* All PGP functionality can be activated by a program without
human interference
* If implemented with care the passphrase that has to be
inputed to the PGP program can be securely stored and used
without the possibility of snooping from outside.
6.3 Authentication between servers
All servers included in the indexing system are known due to a
registration process. The maintainer of the data servers can
define which data are to be included into the index. The
index servers and the crawlers that take part in the index
object gathering and distribution are also known. To prevent
wrong index objects to be included into the index server,
index object supplying programs should authenticate themselves.
Servers could provide special applications entries with
passwords to bind to before sending the data. A better method
of authentication would be the signing of the data via a
digital signature. This again could be implemented with a
public key infrastructure like PGP.
7. Acknowledgement
Work on this specification was supported by the European
Commission and by DANTE, Cambridge as part of the EC Project
DESIRE II.
8 References
[1] Allen, J., Mealling, M., "The Architecture of the Common
Indexing Protocol (CIP)", draft-ietf-find-cip-arch-02.txt
(work in progress), November 1998.
[2] Wahl, M., Howes, T. and S. Kille, "Lightweight Directory
Access Protocol (v3)", RFC 2251, December 1997.
[3] Postel, J, Anderson, C., "White Pages Meeting Report",
RFC 1588, February 1994.
[4] Hedberg, R., Greenblatt, B., Moats, R. and M. Wahl, "A
Tagged Index Object for use in the Common Indexing
Protocol", draft-ietf-find-cip-tagged-07.txt (work in
progress), March 1998.
[5] Allen, J., Mealling, M., "MIME Object Definitions for the
Common Indexing Protocol (CIP)",
draft-ietf-find-cip-mime-03.txt (work in progress),
November 1998.
[6] Allen, J., Leach, P. J. "CIP Transport Protocols",
draft-ietf-find-cip-trans-01.txt (work in progress),
April 1999
[7] Elkins, M., "MIME Security with Pretty Good Privacy
(PGP)", RFC 2015, October 1996.
9 Authors4Address
Peter Gietz
ZDV, Universitaet Tuebingen
Waechterstr.76
D-72074 Tuebingen
Germany
Phone: +49 7073 2970336
Email: peter.gietz@directory.dfn.de
Peter Valkenburg
SURFnet
Postbus 19035
NL-3501 DA Utrecht
The Netherlands
Phone: +31 30 2305305
Email: Peter.Valkenburg@SURFnet.nl
Henny Bekker
SURFnet Expertise Centrum
Postbus 19035
NL-3501 DA Utrecht
The Netherlands
Phone: +31 30 2305305
Email: Henny.Bekker@sec.nl