Internet-Draft                                           P. Gietz
Category: Informational                   University of Tuebingen
<draft-gietz-ldapindex-00.txt>                      P. Valkenburg
Expires: December 25, 1999                                SURFnet
                                                        H. Bekker
                                                          SURFnet


                   Requirements and overview for an 
                    European LDAP indexing service
                    

Status of this Memo


   This document is an Internet-Draft and is in full conformance 
   with all provisions of Section 10 of RFC2026.
   
   This memo provides information for the Internet community. 
   This memo does not specify an Internet standard of any kind. 
   Distribution of this memo is unlimited.
 
   Internet-Drafts are working documents of the Internet 
   Engineering Task Force (IETF), its areas, and its working 
   groups. Note that other groups may also distribute working 
   documents as Internet-Drafts.
   
   Internet-Drafts are draft documents valid for a maximum of 
   six months and may be updated, replaced, or obsoleted by 
   other documents at any time. It is inappropriate to use 
   Internet-Drafts as reference material or to cite them other 
   than as "work in progress."
   
   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt
                                                                  
   The list of Internet-Draft Shadow Directories can be 
   accessed at http://www.ietf.org/shadow.html.
   
   This Internet-Draft will expire on December 25, 1999.


Abstract

   This document describes the overall concept of a distributed 
   indexing system based on the Common Indexing Protocol (CIP)
   [1], as it will be implemented in the EC co-funded 
   DESIRE II project and afterwards maintained as a service by 
   the European academia. Although the system is designed for 
   multi-purpose usage, the main focus of this document lies on 
   its application as a LDAPv3 [2] Directory white pages 
   index, although the system is designed to be applicable to 
   other indexing problems as well. 


   NOTE:   
   This document will be accompanied by a more technical document
   about the server side of the indexing system.

1. Introduction

   The main aim of this document is to define a directory indexing 
   system that is deployable in a European context, making 
   directory information available to the research community of 
   all participating countries. This indexing system fulfills a
   need that was already postulated in [3].

   To implement an indexing system on such a large scale, a
   hierarchical index creation and distribution is necessary 
   for overall performance and scalability issues. In such a model
   index servers located at higher levels of the hierarchy gather
   the index objects of server located on lower levels of the
   hierarchy. For example the index server of an organization
   collects the index objects of all departmental directory
   servers, the index server of a country collects all index
   objects of the organisational index server. This ends up in 
   one root index server that includes the index objects of all
   country level index servers that are part of the indexing
   system. 
   
   Since it is not advisable to have one single point of 
   information retrieval to which all clients that want to 
   retrieve index information would have to connect to, the 
   collection of index objects has to be redistributed downwards 
   the same hierarchy. Since the management of such a big 
   collection of index objects requires a considerable amount of 
   hardware power they will not be distributed down to the single 
   server, but might only reach to country level. 

   The mechanisms proposed in this document can be described as 
   a subset of the Common Index Protocol (CIP), which is seen as 
   the future standard for indexing in the Internet. Though not 
   all features defined in CIP are planed to be implemented the 
   overall structure should be compatible with this standard.

   The whole indexing system of gathering and distributing and 
   searching index objects should be managed by the server side. 
   Clients should not need to have special features for retrieving
   the index information, which means that an index server has to 
   respond to a client the same way a normal server would do, in 
   case it doesn't have the requested data: It just gives back a 
   referral to a server that might have them. In the case of an 
   index server the probability that the referral points 
   to a server that has the data is very high. That is the only 
   difference for the client. Although every client capable of 
   chasing referrals could be used in the proposed indexing 
   system, a client that includes special index related features
   is favourable due to special problems of index query, like the 
   possibility of a huge amount of referrals that could have to 
   be dealt with. Also an index aware client can provide a better 
   user interface that gives index specific information. 

2. Gathering of Index objects

   The atomic entities of the indexing system in its first stage 
   are the index objects of the single server that are included 
   in the indexing system. 

   The format of the index should be the Tagged Index Object
   (TIO), as defined in [4]. The advantage of the TIO is, that 
   all the indexed attributes of one directory entry can be
   identified, and search filters including more than one
   attribute can be used. The Data Set Identifier (DSI) 
   should be used to uniquely identify a given data set among all
   data sets indexed. All index objects should be stored together
   with the DSI and the base-URI(s), which is crucial for
   generating referrals to the complete data of an indexed entry.

   These index objects will not be modified in their content
   while their transport up and down the hierarchy; they 
   will not be aggregated to bigger index objects. Although 
   such an aggregation is defined in 3.2.3 of [1], it produces in
   combination with the TIO hardly manageable problems. Through
   aggregation the tags of the TIO would change, which makes the
   retrieval more difficult. Since the index object includes
   information about the data server in its MIME transport header
   [5] (the DSI and one more base-URIs), retrieval would
   have to follow back the steps of aggregation to finally reach
   the LDAP server. The update of index objects again would be
   difficult in terms of retrieving the right index entries in 
   the right index objects, where again the whole aggregation 
   path has to be followed. If, as proposed here, the index
   objects are not changed, the case of an update is quite
   straightforward: a new index object is to be produced and the
   old index object just has to be replaced in the index object
   collections. 

   The DSI provides a perfect means for the identification of the
   index object to be replaced. Incremental update of single 
   index objects is included in the TIO definition, which allows
   you to specify data blocks for add, delete and update
   operations. To unambiguously identify the record for the 
   delete and update operations a unique identifier of the entry
   must be included in the index object. In the case of LDAP
   directories this identifier would be the whole untokenized DN.

   In a first approach the DESIRE II index system will not use
   this feature of incremental updates.

   The index objects can be built by dedicated crawlers that 
   crawl through the DIT sub tree of one server to collect the 
   data. A TIO converter can then in a second step produce the 
   index object from those data. The decision which of the entries 
   to crawl and which attribute values to collect, has to be done 
   by each participating organisation, the maintainer of the
   single server respectively. These definitions should be made
   via crawler access policies stored in the directory itself and
   understood by the crawler. A separate document will define
   the mechanisms and the storage model for such a crawler access 
   policy. To make sure that only crawlers compliant to this
   policy mechanism are able to get the data, the crawler 
   has to authenticate itself. In a first stage, the crawler could
   be directed via access control mechanisms inherent in the 
   Directories. With such a mechanism in place it becomes 
   irrelevant in terms of privacy issues, who will maintain and 
   run such a crawler. It could either be the organisation itself, 
   the National Research Network for all or a subset of 
   organisations in a country, or even the maintainer of the 
   central index objects at the root of the system.

   The single servers that are part of the index system will be 
   registered. Registered server will be put in a list, which will 
   be accessed by the crawler or the maintainer of the crawler to 
   retrieve knowledge about host and port of the server. The 
   details of the registration process is outside the scope of
   this document.

3. Distribution of the index objects

   To prevent a single index entry point, where all the worlds'
   clients would connect to, the gathered index objects (TIO 
   collections) have to be distributed downwards again. Every 
   country level should provide an index servers for the complete
   TIO collection. If appropriate, this index could be distributed
   to several index servers at different locations in the
   respective country. 

   The downward distribution of the indices, as well as the 
   upward sending of the indices to be gathered can be 
   performed via simple FTP transfer for a proof of concept. 
   More advanced transport mechanisms defined in the CIP 
   Transport Protocols draft [6] can be used instead eventually.

4. Query routing

   The clients should not have to provide special features for
   using the index system. It connects to an index server in the
   same way it would connect to any other directory server. The
   access protocol is plain LDAP (v3). The server should then
   perform the following algorithm:

   Perform a search in the locally stored data set, and return 
   the data if found. If no data matched the search filter, the
   server should consult the index server to search for
   appropriate entries and return the referrals to the entries,
   based on the base-URI found in the index. 

   The user could influence this algorithm by adding a base DN
   which defines the entry point and limits the search. The user
   can herewith, e.g. start the search from the root level, or
   from any other level in the hierarchy. In any case the client
   does not have to know anything about the indexing system 
   except the hostname and port number of one nearby server,
   which is a part of the index system.


5. The over all concept

   * A crawler collects the to be indexed data from standard
     organisational LDAP servers using LDAP searches.
   * A TIO converter builds Tagged Index Objects of these servers,
     which have to include knowledge for referrals (Base-URI) in
     the MIME wrapper.
   * A TIO transporter passes them on to a country 
     level referral index server (TIO/LDAP Referral Server), using
     one of the CIP defined transport protocols (e.g., HTTP).
   * The referral index server stores the index objects. 
   * The TIO transporter distributes the country index
     objects to a root referral index server.
   * The TIO transporter distributes the index objects of the root
     referral index server back to the country level referral
     index servers.
   * A LDAP client (dedicated client, web browser, mail agent,
     etc.) sends an LDAP search to a country level LDAP index
     servers (native protocol sever).
   * The country level LDAP index server fetches LDAP referral(s)
     from the country referral index server which refer to the
     data matching the search.
   * The country level LDAP index server gives back the
     referral(s) to the Client.
   * The Client interprets the referral(s) and retrieves the data
     from the original LDAP server.


6. Security Considerations

6.1 Personal data and privacy legislation

   Since white pages directories  contain personal data (i.e. 
   e.g. name, email address, telephone number), it is important 
   to conform to European privacy legislation. Even if all the 
   data are public data and published in the directory with the
   consent of the affected persons, it is against that 
   legislation to make available a bulk of such data. While
   transferred from one server to the other the index objects are
   vulnerable to get stolen by commercial data brokers and 
   spammers. It is therefore necessary to protect the index 
   object data while transferring them on the net.  

6.2 Encryption of the index objects

   To secure the index object distribution process the data 
   should be encrypted. Since CIP data are MIME encoded a MIME 
   compatible encryption method is preferable, because then the 
   security feature is independent of the transport protocol, 
   let it be HTTP or FTP or email. The CIP authors advise 
   to use PGP encrypted S/MIME as defined in [7]. PGP has 
   a variety of advantages. 

   * It is commonly used in the Internet.
   * It is easy to include into a MIME application.. 
   * It provides means for public key asymmetrical encryption 
   * It provides means for symmetrical encryption as well.
   * In addition it provides a means of signing the data in a 
     way that even one missing byte in the data makes the 
     signature invalid
   * All PGP functionality can be activated by a program without
     human interference
   * If implemented with care the passphrase that has to be
     inputed to the PGP program can be securely stored and used 
     without the possibility of snooping from outside.


6.3 Authentication between servers

   All servers included in the indexing system are known due to a
   registration process. The maintainer of the data servers can 
   define which data are to be included into the index. The 
   index servers and the crawlers that take part in the index
   object gathering and distribution are also known. To prevent
   wrong index objects to be included into the index server, 
   index object supplying programs should authenticate themselves.
   Servers could provide special applications entries with
   passwords to bind to before sending the data. A better method
   of authentication would be the signing of the data via a 
   digital signature. This again could be implemented with a 
   public key infrastructure like PGP.


7. Acknowledgement

   Work on this specification was supported by the European
   Commission and by DANTE, Cambridge as part of the EC Project 
   DESIRE II.


8 References

   [1]  Allen, J., Mealling, M., "The Architecture of the Common
       Indexing Protocol (CIP)", draft-ietf-find-cip-arch-02.txt
       (work in progress), November 1998.

   [2] Wahl, M., Howes, T. and S. Kille, "Lightweight Directory
       Access Protocol (v3)", RFC 2251, December 1997.

   [3]  Postel, J, Anderson, C., "White Pages Meeting Report", 
       RFC 1588, February 1994.

   [4]  Hedberg, R., Greenblatt, B., Moats, R. and M. Wahl, "A
       Tagged Index Object for use in the Common Indexing
       Protocol", draft-ietf-find-cip-tagged-07.txt (work in 
       progress), March 1998.

   [5]  Allen, J., Mealling, M., "MIME Object Definitions for the
       Common Indexing Protocol (CIP)",
       draft-ietf-find-cip-mime-03.txt (work in progress),
       November 1998.
   [6]  Allen, J., Leach, P. J. "CIP Transport Protocols", 
       draft-ietf-find-cip-trans-01.txt (work in progress), 
       April 1999

   [7]  Elkins, M., "MIME Security with Pretty Good Privacy 
       (PGP)", RFC 2015, October 1996.


9 Authors4Address

   Peter Gietz
   ZDV, Universitaet Tuebingen
   Waechterstr.76
   D-72074 Tuebingen
   Germany

   Phone: +49 7073 2970336
   Email: peter.gietz@directory.dfn.de

   Peter Valkenburg
   SURFnet
   Postbus 19035
   NL-3501 DA Utrecht
   The Netherlands

   Phone: +31 30 2305305
   Email: Peter.Valkenburg@SURFnet.nl

   Henny Bekker
   SURFnet Expertise Centrum
   Postbus 19035
   NL-3501 DA Utrecht
   The Netherlands

   Phone: +31 30 2305305
   Email: Henny.Bekker@sec.nl