Trill WG R. Parameswaran, INTERNET-DRAFT Brocade Communications, Inc. Intended Status:Informational August 25, 2016 Expires: February 22, 2017 TRILL: Parent node Shifts in Tree Construction, Mitigation. Abstract This draft documents a known problem in the Trill tree construction mechanism and offers two different approaches to solve the problem. Status of This Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Distribution of this document is unlimited. Comments should be sent to the authors or the TRILL working group mailing list: trill@ietf.org. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Terminology and Acronyms. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 1. Introduction. Trill is a data center technology that uses link-state routing mechanisms in a layer 2 setting, and serves as a replacement for spanning-tree. Trill uses trees rooted at pre-determined nodes as a way to distribute multi-destination traffic. Multi-destination traffic includes traffic such as layer-2 broadcast frames, unknown unicast flood frames, and layer 2 traffic with multicast MAC addresses (collectively referred to as BUM traffic). Multi-destination traffic is typically hashed onto one of the available trees and sent over the tree, potentially reaching all nodes in the network (hosts behind which may own/need the packet in question). 2. Tree construction in Trill and the associated parent-shift issue. Tree construction in Trill is defined by [RFC6325], with additional corrections defined in [RFC7780]. A-- --B / \ \/ /\ / \/\ _/_ \ /__ _/\ / \\ // \/ \\ 1 2 3 \ | / \ | / \ | / \ | / \ | / \ | / \ |/ C The tree construction mechanism used in Trill codifies certain tree construction steps which make the resultant trees very brittle. Specifically, the parent selection mechanism in Trill causes problems in case of node failures. Trill uses the following rule - when constructing an SPF tree, if there are multiple possible parents for a given node (i.e. if multiple upstream nodes can potentially pull in a given node during SPF, all at the same cumulative cost, then the parent selection is imposed in the following manner): [RFC6325]: "When building the tree number j, remember all possible equal cost parents for node N. After calculating the entire 'tree' (actually, directed graph), for each node N, if N has 'p' parents, then order the parents in ascending order according to the 7-octet IS-IS ID considered as an unsigned integer, and number them starting at zero. For tree j, choose N's parent as choice j mod p." There is an additional correction posted to this in [RFC7780]: [RFC7780], Section 3.4: "Section 4.5.1 of [RFC6325] specifies that, when building distribution tree number j, node (RBridge) N that has multiple possible parents in the tree is attached to possible parent number j mod p. Trees are numbered starting with 1, but possible parents are numbered starting with 0. As a result, if there are two trees and two possible parents, then in tree 1 parent 1 will be selected, and in tree 2 parent 0 will be selected. This is changed so that the selected parent MUST be (j-1) mod p. As a result, in the case above, tree 1 will select parent 0, and tree 2 will select parent 1. This change is not backward compatible with [RFC6325]. If all RBridges in a campus do not determine distribution trees in the same way, then for most topologies, the RPFC will drop many multi-destination packets before they have been properly delivered." 3. Issues with the Trill tree construction algorithm. With this tree construction mechanism in mind, let's look at the Spine-Leaf topology presented in section 2 and consider the calculation of Tree number 1 in Trill. Consider the spine-leaf network presented in section 2, with spine nodes A, B and leaf nodes 1,2,3. A-- --B / \ \/ /\ / \/\ _/_ \ /__ _/\ / \\ // \/ \\ 1 2 3 \ | / \ | / \ | / \ | / \ | / \ | / \ |/ C Assume that in the above topology, when ordered by 7-octet ISIS-id, 1 < 2 < 3 holds and that the root for Tree number 2 is A. Given the ordered set {1, 2, 3} , these nodes have the following indices (with a starting index of 0): Node Index 1 0 2 1 3 2 Given the SPF constraint and that the tree root is A, the parent for nodes 1,2, and 3 will be A. However, when the SPF algorithm tries to pull B or C into the tree, we have a choice of parents, namely 1, 2, or 3. Given that this is tree 2, the parent will be the one with index (2-1) mod 3 (which is equal to 1). Hence the parent for node B will be node 2. The tree that results from Trill's tree calculation looks like the following. A /|\ / | \ / | \ 1 2 3 /\ / \ B C However, due to Trill's parent selection algorithm, the sub-tree rooted at Node 2 will be impacted even if Node 1 or Node 3 go down. Take the case where Node 1 goes down. Tree 1 must now be re-computed (this is normal) - but now, when the SPF computation is underway, when the SPF process tries to pull in B, the list of potential parents for B now are {2 and 3}. So, after ordering these by ISIS-Id as {2, 3} (where 2 is considered to be at index of 0 and 3 is considered to be at index 1), for tree 1, we apply Trill's formula of: Parent's index = (TreeNumber-1) mod Number_of_parents. = (2-1) mod 2 = 1 mod 2 = 1 (which is the index of Node 3) The re-calculated tree now looks as shown below. The shift in parent nodes (for B) causes lot of disruption to live traffic in the network, and is unnecessary in absolute terms because the existing parent for node B, node 2, was not perturbed in any way. This churn could simply be avoided with a better algorithm/approach. A / \ / \ / \ 2 3 /\ / \ B C The parent shift issue noted above can be solved in at least two different ways: a) by means of the affinity TLV. b) by means of a network wide policy based parent selection mechanism. While the techniques identified in this draft have an immediate benefit when applied to spine/leaf networks popular in data-center designs, nothing in either of the approaches outlined below assumes a spine-leaf network. The techniques outlined below will work on any connected graph. Furthermore, no symmetry in link-cost is assumed. 4. Alternative A - Using the Affinity sub-TLV. At a high level, this problem can be solved by having the affected parent send out an affinity sub-TLV identifying the children for which it wants to preserve the parent child relationship, subject to network events which may change the structure of the tree. It would be sufficient to have a local configuration option (e.g. a CLI) at one of the nodes which would be the preferred parent, and this may very well be the node that would be the node that would be the parent under the normal trill tree construction method. In such case, the following steps may provide a way to implement this proposal: a. The operator locally configures the current parent to indicate its stickiness in tree construction for a specific tree number and tree root via the affinity sub-TLV. This can also be done before tree construction if the operator consults the 7 octet ISIS-ID relative ordering of the concerned nodes and infers upfront which of the potential parent nodes would become the parent node for a given set of children on that tree number under the Trill tree calculation mechanism. b. The configured node advertises its preference via the affinity sub-TLV when it completes a normal Trill specified tree calculation, and finds itself the parent of multiple child nodes per the calculation. The affinity sub-TLV must reflect the concerned tree number and the child nodes for which the concerned node is a parent node. c. When any change event happens in the network, one which forces a tree re-calculation for the concerned tree, the configured node should run thru the normal trill tree calculation agnostic of the fact that it has published an affinity sub-TLV. During the tree calculation, if it turns out to be on the list of potential parents for some or all of the child nodes for which it was publishing the affinity sub-TLV (disregarding its possible exclusion as parent due to 7 octet ISIS ID ordering logic), then it should react in the following manner: i. if it is still a potential parent for some of the children in the previous iteration of the tree, it should send out an updated affinity sub-TLV identifying the correct sub-set of children for which the node aspires to continue the parent relationship. ii. if the tree structure changes such that it is no longer a potential parent for any of the child nodes in the advertised affinity sub-TLV, then it must retract the affinity sub-TLV. d. Situations where the node advertising the Affinity sub-TLV dies or restarts should be handled using the normal handling for such scenarios relating to the parent Router Capability TLV, and as specified in [RFC4971]. e. Conflicts in the Affinity sub-TLV from different originators for the same tree number and child node should be handled as specified in [RFC7783]. If the node is forced to retract its affinity sub-TLV, it can then repeat these steps in a subsequent tree construction, if the same node becomes a parent again. In terms of nodes that do not support this algorithm, they are expected to seamlessly inter-operate with this scheme, so long as they understand and honor the affinity TLV. 4. Alternative B - Using a variant of the Djikstra SPF algorithm. There is another way in which the same goal can be achieved - the classic SPF algorithm may be modified to address the above problem. The approach requires all nodes in the network to use the same algorithm and same tie-break policy. The algorithm models the SPF algorithm in traditional terms, referencing two sets of vertices, PATH, and TENT. PATH contains vertices known to be reachable at optimal cost from the root vertex v. TENT contains vertices being examined for optimal reachability. TENT contains satellite information for each vertex, which is the immediate parent that caused the vertex to be moved into TENT, and the cost of the link between them. The algorithm is presented in a pseudocode that is similar in syntax to the 'C' language. The modified algorithm is presented below , and differs from the classic SPF algorithm in the following way: When pulling vertices from TENT to PATH, if there is more than one vertex at the same minimum cost, it considers them all simultaneously and equally. Additionally, the SPF algorithm allows a policy filter to be placed during the parent selection process. Guide-lines are presented below on how the policy may be specified so as to achieve the same effect as using the affinity TLV in the alternative approach above. policy_parent_selection_optimized_spf_run(vertex root, PolicyFunction PF, Tree PreviousTree) /* representation of the * same tree as computed * in the previous iteration, * as a reference for the * policy tie-breaker * function below. */ { /* L, LL are lists of tuples of the following form */ List<(vertex, parent, link(parent, vertex), link-cost(parent, vertex))> L, LL; /* Heap is ordered in ascending order of link-costs */ HEAP tmpHEAP; /* * map_vertex2parents_heap maps a vertex to a HEAP of tuples, each * identifying a parent relationship. The heap is ordered in * ascending order of link-cost. It also serves as a representation * of the set TENT. */ Mapping map_vertex2parents_heap{vertex -> HEAP of tuples of type (vertex, parent, link(parent,vertex) link-cost(parent, vertex)) }; initialize_to_empty(L); initialize_to_empty(LL); initialize_to_empty(map_vertex2parents_heap); for each vertex in graph { if vertex is not root then { set distance_from_root(vertex) = infinity; vertex.in_PATH = FALSE; } } /* Implicit: distance_from_root(root) = 0 */ for each link (root,w) that is attached to root { if (distance_from_root(w) > link-cost(root, w)) { map_vertex2parents_heap{w}.add_to_heap( (w, root, link(root,w), link-cost(root,w))); } } /* * Iterate while there are vertices (tuples) in TENT - TENT is * realized via map_vertex2parents_heap which is a mapping table * with each entry of the table representing a (per-vertex) HEAP * of tuples. Each tuple represents a link i.e. a parent-child * relationship along with the identity of the link and its * associated cost. */ while (not_empty(map_vertex2parents_heap)) { /* There may be more than one vertex in TENT at * minimal cost, and this modified SPF algorithm considers all * of them simultaneously, processing them as a list, in terms * of their addition to PATH, and in terms of equally * considering them parents of as-yet undiscovered vertices * that may be reachable at the same total cost from members * of the list. * priority_heap_minimum_multiple_dequeue_as_list takes the * heap and returns a list of all the vertexs that are at * minimum cost from v. The list is returned as a list of * tuples, where each tuple is of the form: * * (vertex, parent, link(parent, vertex), * link-cost(vertex, parent)) */ initialize_to_empty(tmpHEAP); /* * Iterate over the vertices comprising the key-space of * map_vertex2parents_heap. The minimum from TENT is * derived in two steps - find the minimum cost for each vertex * from its potential parents (find the tuples), and then find * the minimum of the above minimums using a temporary HEAP. */ foreach (vertex v in valid-keys(map_vertex2parents_heap)) { /* tuples of the type * (vertex, parent,link(vertex, parent), * link-cost(vertex, parent)) at a minimum cost per hash * table vertex entry are inserted to tmpHEAP. Each entry * in tmpHEAP is a tuple. */ if (is-empty(map_vertex2parents_heap{v}) continue; /* Move on to the next vertex */ /* * heap_dequeue_multiple_minimum_as_list() is assumed to * dequeue multiple entries at the top of the heap, if * they are all at the same minimum value of the heap * metric (cumulative reachability cost for this * application). */ tmpHEAP.add_to_heap(heap_dequeue_multiple_minimum_as_list( map_vertex2parents_heap{v})); } L = heap_dequeue_multiple_minimum_as_list(tmpHEAP); /* * L is a list of tuples. * L could potentially have only a single tuple. Iterate over * tuples in L. Tuples in L are optimally reachable from nodes * already in PATH. */ foreach tuple (y, y_parent, link(y_parent, y), link-cost(y_parent, y)) in L { /* * Identify tuples relating to y's potential parents, * all leading to optimal reachability for y, collect * them in LL. */ LL = heap_dequeue_multiple_minimum_as_list( map_vertex2parents_heap{y}); /* * Policy function below takes a vertex, and its list of * eligible parents, and chooses one parent for the vertex. * It can be simply thought of as a tie-breaker that * chooses one parent out of the list of eligible parents * of y, but based on the consideration specified by the * policy, and being able to do so in a way that * minimizes tree churn. * * ASSUMPTION: Policy Function needs to be able * to deal with trivial case of the list of eligible * parents having only one parent, it must select that one * parent unconditionally in that case for the given child * node. * * PolicyFunction is not explicitly defined here, it may * take other parameters such as a representation of the * previous computation of the tree to help it make a * selection that helps preserve links in the tree as they * existed prior to this computation. */ if (y_parent == PolicyFunction(y, LL, PreviousTree, TreeNumber)) { /* * if y is allowed by the policy function, it can get * pulled into PATH here. However, if y is excluded * at this stage, then it must have another potential * instance in TENT (but see assumption above), which * will pull it in thru a different parent - * correctness of the SPF algorithm guarantees this, * and this other instance with other parent must be * in the list L. */ y.in_PATH = TRUE; distance_from_root(y) = distance_from_root(y_parent) + link-cost(y_parent,y); if (y_parent == root) /* directly connected */ nexthop_link_at_root(y) = link (root ,y); else nexthop_link_at_root(y) = nexthop_link_at_root(y_parent); /* y is now in PATH. * identify_vertex_as_parent() is defined further * down below, it marks y as parent for its * downstream children, by updating the * map_vertex2parents_heap entry for nodes * immediately downstream of y. Note * that we do not need to iterate over all instances * of y in LL - this can be done by simply examining * all links on y and pulling in those not in PATH. */ identify_vertex_as_parent(y, map_vertex2parents_heap); } } } for each vertex w in graph { if (w is not root) { print (distance_from_root(w), nexthop_link_at_root(w)); } } } identify_vertex_as_parent(vertex y, Mapping map_vertex2parents_heap) { for each link (y,x) that is attached to y { if (x.in_PATH == FALSE) // Ignore nodes already in PATH. { /* * Routine is only called on vertices y that are already * in PATH, so child x will be reachable optimally from y. * * y is a potential parent of x, because x is optimally * reachable from y. But we need to see below if there are * other potential parents of x at the same optimal cost. * */ if (distance_from_root(x) > (distance_from_root(y) + cost(y,x)) { map_vertex2parents_heap{x}.add_to_heap(x, y, (y,x), distance_from_root(y)+cost(y,x)); } } } } 4.1 Choice of the policy function. The exact form of the policy function is left unspecified here, but the choice of the right policy function is important in being able to solve the parent selection issue described in the draft. In order to solve the discussed problem, the policy function must preserve parent-child links as they existed in previous computations of the tree. It can do this by consulting a static list of tree-links from the previous tree computation, or by consulting a representation of the previous computation of the SPF tree as input, to identify and preserve parent-child relationships in the prior SPF tree. The policy is applied on a best-effort basis. The policy is consulted during the normal SPF tree construction mechanism if a given child node can be pulled in from a choice of multiple parents, and if one of those parents had pulled in that child node in a previous computation of the tree. When used in this manner, if a parent node in the previous tree computation is no longer alive/reachable, it should not be used as a reference in preserving any parent-child relationships. A secondary policy or fall-back tie-break may be needed to identify a parent for a given child in this situation, and this is detailed below. Also, as noted in the code section, in the case where a child node has only one parent, the policy function must unconditionally select that parent node to pull in the child to the SPF tree. The policy function may also take the tree number as an input parameter to determine a fallback tie-break. When the policy function is set up in the manner described above, it helps maintain parent affinity by explicitly breaking ties preferentially so that prior parent relationships are maintained in a new computation of the tree. However, the challenge of synchronizing the application of the policy across rBridges in the network needs to be addressed. In order to do this, the following rules may need to be adhered to when using this approach: a. The first tree computation is carried out using the Trill specified tree construction mechanism. b. So long as the root node for that tree number does not change physically, the second and subsequent tree computations for the same tree can use this algorithm and try to feed in a representation of the previous tree as guide to the policy function, subject to the following caveats: i. if a non-root node that was a parent in a previous computation is no longer reachable/alive, then the computation should fall back to the default Trill parent selection rule for the set of remaining parent candidates and the associated children. Note that if a sibling of the parent node goes away or changes its link connectivity, it does not impact the concerned parent's child relationships. ii. If a link between a parent node and child node (in the previous computation of that tree) goes down, then that child alone should be pulled in to the SPF tree using Trill specified tie-break rule. Note that this may result in some children being pulled in through the old parent and some pulled in thru a different parent, selected by the Trill tie-break rule. c. If the root node for that tree number changes to a physically different node, then the first tree computation for that tree number with that new tree root should be carried out using the default Trill parent-selection rules. The second and subsequent computations can leverage this approach, each feeding in a representation of the previous tree as a reference for the current computation. d. There may be cases during tree construction with this approach where more than one parent finds a match in the representation of the previous tree - in this case the tie should be broken according to the default Trill parent selection rule. This can happen, when a node that was a parent in a previous computation becomes a child node of its former child in the current tree, due to a link going down, for example. e. The policy function must always be fed the representation of the most recent prior tree computation for that tree number. In essence, the above rules mean that when pulling in a particular node, from a given choice of parent nodes during the SPF tree construction, the policy function first tries to identify if one of the potential parents during the current SPF run had pulled in that child during the previous tree computation. If yes, then that parent is preferred and the parent-child relationship is preserved during the current tree computation. If, however, no reference could be found with the prior computation, the parent selection falls back to the default Trill parent selection rules. 5. Network wide selection of the tree computation algorithm. For either of the two alternative mechanisms specified here, this draft takes no position on how rBridges in the network may negotiate and select either of these algorithms as the preferred tree computation mechanism at this time. 6. Benefits. The advantage to using either of the approaches outlined above, is that trees produced by this algorithm are not subject to the parent-shift issue identified in the case of Trill, so long as the policy function or network wide selection of the parent selects links in a predictable manner across independent SPF runs. It tries to protect against network events which can impact existing parent child relationships in a tree. 7. Security Considerations. The proposal primarily influences tree construction and tries to preserve parent-child relationships in the tree from prior computations of the same tree, without changing any of operational aspects of the protocol. Hence, no new security considerations for Trill are raised by this proposal. 8. IANA Considerations. No new registry entries are requested to be assigned by IANA. The Affinity Sub-TLV has been defined in [RFC7176], and this proposal does not change its semantics in any way. 9. Informative References. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC6325] Perlman, R., Eastlake 3rd, D., Dutt, D., Gai, S., and A. Ghanwani, "Routing Bridges (RBridges): Base Protocol Specification", RFC 6325, DOI 10.17487/RFC6325, July 2011, . [RFC7780] - Eastlake 3rd, D., Zhang, M., Perlman, R., Banerjee, A., Ghanwani, A., and S. Gupta, "Transparent Interconnection of Lots of Links (TRILL): Clarifications, Corrections, and Updates", RFC 7780, DOI 10.17487/RFC7780, February 2016, . [RFC7783] Senevirathne, T., Pathangi, J., Hudson, J., "Coordinated Multicast Trees (CMT) for Transparent Interconnection of Lots of Links (TRILL)", RFC 7783, February 2016, [RFC4971] Vasseur, JP., Shen, N., Aggarwal, R., "Intermediate System to Intermediate System (IS-IS) Extensions for Advertising Router Information", RFC 4971, July 2007, Author's Address: R. Parameswaran, Brocade Communications, Inc. 120 Holger Way, San Jose, CA 95134. Email: parameswaran.r7@gmail.com Copyright and IPR Provisions Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. The definitive version of an IETF Document is that published by, or under the auspices of, the IETF. Versions of IETF Documents that are published by third parties, including those that are translated into other languages, should not be considered to be definitive versions of IETF Documents. The definitive version of these Legal Provisions is that published by, or under the auspices of, the IETF. Versions of these Legal Provisions that are published by third parties, including those that are translated into other languages, should not be considered to be definitive versions of these Legal Provisions. For the avoidance of doubt, each Contributor to the IETF Standards Process licenses each Contribution that he or she makes as part of the IETF Standards Process to the IETF Trust pursuant to the provisions of RFC 5378. No language to the contrary, or terms, conditions or rights that differ from or are inconsistent with the rights and licenses granted under RFC 5378, shall have any effect and shall be null and void, whether published or posted by such Contributor, or included with or in such Contribution.