Internet-Draft P. Gietz Category: Informational University of Tuebingen P. Valkenburg Expires: December 25, 1999 SURFnet H. Bekker SURFnet Requirements and overview for an European LDAP indexing service Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 25, 1999. Abstract This document describes the overall concept of a distributed indexing system based on the Common Indexing Protocol (CIP) [1], as it will be implemented in the EC co-funded DESIRE II project and afterwards maintained as a service by the European academia. Although the system is designed for multi-purpose usage, the main focus of this document lies on its application as a LDAPv3 [2] Directory white pages index, although the system is designed to be applicable to other indexing problems as well. NOTE: This document will be accompanied by a more technical document about the server side of the indexing system. 1. Introduction The main aim of this document is to define a directory indexing system that is deployable in a European context, making directory information available to the research community of all participating countries. This indexing system fulfills a need that was already postulated in [3]. To implement an indexing system on such a large scale, a hierarchical index creation and distribution is necessary for overall performance and scalability issues. In such a model index servers located at higher levels of the hierarchy gather the index objects of server located on lower levels of the hierarchy. For example the index server of an organization collects the index objects of all departmental directory servers, the index server of a country collects all index objects of the organisational index server. This ends up in one root index server that includes the index objects of all country level index servers that are part of the indexing system. Since it is not advisable to have one single point of information retrieval to which all clients that want to retrieve index information would have to connect to, the collection of index objects has to be redistributed downwards the same hierarchy. Since the management of such a big collection of index objects requires a considerable amount of hardware power they will not be distributed down to the single server, but might only reach to country level. The mechanisms proposed in this document can be described as a subset of the Common Index Protocol (CIP), which is seen as the future standard for indexing in the Internet. Though not all features defined in CIP are planed to be implemented the overall structure should be compatible with this standard. The whole indexing system of gathering and distributing and searching index objects should be managed by the server side. Clients should not need to have special features for retrieving the index information, which means that an index server has to respond to a client the same way a normal server would do, in case it doesn't have the requested data: It just gives back a referral to a server that might have them. In the case of an index server the probability that the referral points to a server that has the data is very high. That is the only difference for the client. Although every client capable of chasing referrals could be used in the proposed indexing system, a client that includes special index related features is favourable due to special problems of index query, like the possibility of a huge amount of referrals that could have to be dealt with. Also an index aware client can provide a better user interface that gives index specific information. 2. Gathering of Index objects The atomic entities of the indexing system in its first stage are the index objects of the single server that are included in the indexing system. The format of the index should be the Tagged Index Object (TIO), as defined in [4]. The advantage of the TIO is, that all the indexed attributes of one directory entry can be identified, and search filters including more than one attribute can be used. The Data Set Identifier (DSI) should be used to uniquely identify a given data set among all data sets indexed. All index objects should be stored together with the DSI and the base-URI(s), which is crucial for generating referrals to the complete data of an indexed entry. These index objects will not be modified in their content while their transport up and down the hierarchy; they will not be aggregated to bigger index objects. Although such an aggregation is defined in 3.2.3 of [1], it produces in combination with the TIO hardly manageable problems. Through aggregation the tags of the TIO would change, which makes the retrieval more difficult. Since the index object includes information about the data server in its MIME transport header [5] (the DSI and one more base-URIs), retrieval would have to follow back the steps of aggregation to finally reach the LDAP server. The update of index objects again would be difficult in terms of retrieving the right index entries in the right index objects, where again the whole aggregation path has to be followed. If, as proposed here, the index objects are not changed, the case of an update is quite straightforward: a new index object is to be produced and the old index object just has to be replaced in the index object collections. The DSI provides a perfect means for the identification of the index object to be replaced. Incremental update of single index objects is included in the TIO definition, which allows you to specify data blocks for add, delete and update operations. To unambiguously identify the record for the delete and update operations a unique identifier of the entry must be included in the index object. In the case of LDAP directories this identifier would be the whole untokenized DN. In a first approach the DESIRE II index system will not use this feature of incremental updates. The index objects can be built by dedicated crawlers that crawl through the DIT sub tree of one server to collect the data. A TIO converter can then in a second step produce the index object from those data. The decision which of the entries to crawl and which attribute values to collect, has to be done by each participating organisation, the maintainer of the single server respectively. These definitions should be made via crawler access policies stored in the directory itself and understood by the crawler. A separate document will define the mechanisms and the storage model for such a crawler access policy. To make sure that only crawlers compliant to this policy mechanism are able to get the data, the crawler has to authenticate itself. In a first stage, the crawler could be directed via access control mechanisms inherent in the Directories. With such a mechanism in place it becomes irrelevant in terms of privacy issues, who will maintain and run such a crawler. It could either be the organisation itself, the National Research Network for all or a subset of organisations in a country, or even the maintainer of the central index objects at the root of the system. The single servers that are part of the index system will be registered. Registered server will be put in a list, which will be accessed by the crawler or the maintainer of the crawler to retrieve knowledge about host and port of the server. The details of the registration process is outside the scope of this document. 3. Distribution of the index objects To prevent a single index entry point, where all the worlds' clients would connect to, the gathered index objects (TIO collections) have to be distributed downwards again. Every country level should provide an index servers for the complete TIO collection. If appropriate, this index could be distributed to several index servers at different locations in the respective country. The downward distribution of the indices, as well as the upward sending of the indices to be gathered can be performed via simple FTP transfer for a proof of concept. More advanced transport mechanisms defined in the CIP Transport Protocols draft [6] can be used instead eventually. 4. Query routing The clients should not have to provide special features for using the index system. It connects to an index server in the same way it would connect to any other directory server. The access protocol is plain LDAP (v3). The server should then perform the following algorithm: Perform a search in the locally stored data set, and return the data if found. If no data matched the search filter, the server should consult the index server to search for appropriate entries and return the referrals to the entries, based on the base-URI found in the index. The user could influence this algorithm by adding a base DN which defines the entry point and limits the search. The user can herewith, e.g. start the search from the root level, or from any other level in the hierarchy. In any case the client does not have to know anything about the indexing system except the hostname and port number of one nearby server, which is a part of the index system. 5. The over all concept * A crawler collects the to be indexed data from standard organisational LDAP servers using LDAP searches. * A TIO converter builds Tagged Index Objects of these servers, which have to include knowledge for referrals (Base-URI) in the MIME wrapper. * A TIO transporter passes them on to a country level referral index server (TIO/LDAP Referral Server), using one of the CIP defined transport protocols (e.g., HTTP). * The referral index server stores the index objects. * The TIO transporter distributes the country index objects to a root referral index server. * The TIO transporter distributes the index objects of the root referral index server back to the country level referral index servers. * A LDAP client (dedicated client, web browser, mail agent, etc.) sends an LDAP search to a country level LDAP index servers (native protocol sever). * The country level LDAP index server fetches LDAP referral(s) from the country referral index server which refer to the data matching the search. * The country level LDAP index server gives back the referral(s) to the Client. * The Client interprets the referral(s) and retrieves the data from the original LDAP server. 6. Security Considerations 6.1 Personal data and privacy legislation Since white pages directories contain personal data (i.e. e.g. name, email address, telephone number), it is important to conform to European privacy legislation. Even if all the data are public data and published in the directory with the consent of the affected persons, it is against that legislation to make available a bulk of such data. While transferred from one server to the other the index objects are vulnerable to get stolen by commercial data brokers and spammers. It is therefore necessary to protect the index object data while transferring them on the net. 6.2 Encryption of the index objects To secure the index object distribution process the data should be encrypted. Since CIP data are MIME encoded a MIME compatible encryption method is preferable, because then the security feature is independent of the transport protocol, let it be HTTP or FTP or email. The CIP authors advise to use PGP encrypted S/MIME as defined in [7]. PGP has a variety of advantages. * It is commonly used in the Internet. * It is easy to include into a MIME application.. * It provides means for public key asymmetrical encryption * It provides means for symmetrical encryption as well. * In addition it provides a means of signing the data in a way that even one missing byte in the data makes the signature invalid * All PGP functionality can be activated by a program without human interference * If implemented with care the passphrase that has to be inputed to the PGP program can be securely stored and used without the possibility of snooping from outside. 6.3 Authentication between servers All servers included in the indexing system are known due to a registration process. The maintainer of the data servers can define which data are to be included into the index. The index servers and the crawlers that take part in the index object gathering and distribution are also known. To prevent wrong index objects to be included into the index server, index object supplying programs should authenticate themselves. Servers could provide special applications entries with passwords to bind to before sending the data. A better method of authentication would be the signing of the data via a digital signature. This again could be implemented with a public key infrastructure like PGP. 7. Acknowledgement Work on this specification was supported by the European Commission and by DANTE, Cambridge as part of the EC Project DESIRE II. 8 References [1] Allen, J., Mealling, M., "The Architecture of the Common Indexing Protocol (CIP)", draft-ietf-find-cip-arch-02.txt (work in progress), November 1998. [2] Wahl, M., Howes, T. and S. Kille, "Lightweight Directory Access Protocol (v3)", RFC 2251, December 1997. [3] Postel, J, Anderson, C., "White Pages Meeting Report", RFC 1588, February 1994. [4] Hedberg, R., Greenblatt, B., Moats, R. and M. Wahl, "A Tagged Index Object for use in the Common Indexing Protocol", draft-ietf-find-cip-tagged-07.txt (work in progress), March 1998. [5] Allen, J., Mealling, M., "MIME Object Definitions for the Common Indexing Protocol (CIP)", draft-ietf-find-cip-mime-03.txt (work in progress), November 1998. [6] Allen, J., Leach, P. J. "CIP Transport Protocols", draft-ietf-find-cip-trans-01.txt (work in progress), April 1999 [7] Elkins, M., "MIME Security with Pretty Good Privacy (PGP)", RFC 2015, October 1996. 9 Authors4Address Peter Gietz ZDV, Universitaet Tuebingen Waechterstr.76 D-72074 Tuebingen Germany Phone: +49 7073 2970336 Email: peter.gietz@directory.dfn.de Peter Valkenburg SURFnet Postbus 19035 NL-3501 DA Utrecht The Netherlands Phone: +31 30 2305305 Email: Peter.Valkenburg@SURFnet.nl Henny Bekker SURFnet Expertise Centrum Postbus 19035 NL-3501 DA Utrecht The Netherlands Phone: +31 30 2305305 Email: Henny.Bekker@sec.nl