NVO3 WG E. Nordmark Internet-Draft C. Appanna Intended status: Standards Track A. Lo Expires: January 2, 2017 Arista Networks S. Boutros A. Dubey VMware Jul 2016 Layer-Transcending Traceroute for Overlay Networks like VXLAN draft-nordmark-nvo3-transcending-traceroute-03 Abstract Tools like traceroute have been very valuable for the operation of the Internet. Part of that value comes from being able to display information about routers and paths over which the user of the tool has no control, but the traceroute output can be passed along to someone else that can further investigate or fix the problem. In overlay networks such as VXLAN and NVGRE the prevailing view is that since the overlay network has no control of the underlay there needs to be special tools and agreements to enable extracting traces from the underlay. We argue that enabling visibility into the underlay and using existing tools like traceroute has been overlooked and would add value in many deployments of overlay networks. This document specifies an approach that can be used to make traceroute transcend layers of encapsulation including details for how to apply this to VXLAN. The technique can be applied to other encapsulations used for overlay networks. It can also be implemented using current commercial silicon. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." Nordmark, et al. Expires January 2, 2017 [Page 1] Internet-Draft LTTON Jul 2016 This Internet-Draft will expire on January 2, 2017. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Nordmark, et al. Expires January 2, 2017 [Page 2] Internet-Draft LTTON Jul 2016 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 5 3. Goals and Requirements . . . . . . . . . . . . . . . . . . . . 6 4. Definition Of Terms . . . . . . . . . . . . . . . . . . . . . 7 5. Example Topologies . . . . . . . . . . . . . . . . . . . . . . 7 6. Controlling and selecting ttl behavior . . . . . . . . . . . . 11 7. Introducing a ttl copyin flag in the encapsulation header . . 11 8. Encapsulation Behavior . . . . . . . . . . . . . . . . . . . . 12 9. Decapsulating Behavior . . . . . . . . . . . . . . . . . . . . 15 10. Other ICMP errors . . . . . . . . . . . . . . . . . . . . . . 16 11. Downstream Egress Paths Object . . . . . . . . . . . . . . . . 16 12. Security Considerations . . . . . . . . . . . . . . . . . . . 19 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 15.1. Normative References . . . . . . . . . . . . . . . . . . 20 15.2. Informative References . . . . . . . . . . . . . . . . . 20 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 Nordmark, et al. Expires January 2, 2017 [Page 3] Internet-Draft LTTON Jul 2016 1. Introduction Tools like traceroute have been very valuable for the operation of the Internet. Part of that value comes from being able to display information about routers and paths over which the user of the tool has no control, but the traceroute output can be passed along to someone else that can further investigate or fix the problem. The output of traceroute can be included in an email or a trouble ticket to report the problem. This provide a lot more information than the mere indication that A can't communicate with B, in particular when the failures are transient. The ping tool provides some of the same benefits in being able to return ICMP errors such as host unreachable messages. This document shows how those tools can be used to gather information for both the overlay and underlay parts of an end-to-end path by providing the option to have some packets use a uniform time-to-live (ttl) model for the tunnels, and associated ICMP error handling. These changes are limited to the tunnel ingress and egress points. The desire to make traceroute provide useful information for overlay network is not an argument against also using a layered approach for OAM as specified in e.g., [I-D.tissa-lime-yang-oam-model]. Such approaches are quite appropriate for continuos monitoring at different layers and across different domains. A layer transcending traceroute complements the ability to do layered and/or continuos monitoring. The traceroute tool relies on receiving ICMP errors [RFC0792] in combination with using different IP time-to-live values. That results in the packet making it further and further towards the destination with ICMP ttl exceeded errors being received from each hop. That provides the user the working path even if the packets are black holed eventually, and also provides any errors like ICMP host unreachable. The fundamental assumption is that the ttl is decremented for each hop and that the resulting ICMP ttl exceeded errors are delivered back to the host. When some encapsulation is used to tunnel packets there is an architectural question how those tunnels should be viewed from the rest of the network. Different models were described first for diffserv in [RFC2983] and then applied to MPLS in [RFC3270] and expanded to MPLS ttl handling in [RFC3443] and those models apply to other forms of direct or indirect IP in IP tunnels. Those RFCs define two models for ttl that are of interest to us: o A pipe model, where the tunnel is invisible to the rest of the network in that it looks like a direct connection between the Nordmark, et al. Expires January 2, 2017 [Page 4] Internet-Draft LTTON Jul 2016 tunnel ingress and egress. o A uniform model, where the ttl decrements uniformly for hops outside and inside the tunnel. The tunneling mechanisms discussed in NVO3 (such as VXLAN [RFC7348], NVGRE [I-D.sridharan-virtualization-nvgre], GENEVE [I-D.gross-geneve], and GUE [I-D.herbert-gue]), have either been specified to provide the pipe model of a tunnel or are silent on the setting of the outer ttl. Those protocols can be extended to have an optional uniform tunnel model when the payload is IP, following the same model as in [RFC3443]. Note that these encapsulations carry Ethernet frames hence are not even aware that the payload is IP. However, IP is the bulk of what is carried over such tunnels and the ingress NVE can inspect the IP part of the Ethernet frame. However, for general application traffic the pipe model is fine and might even be expected by some applications. In general, when the source and destination IP are in the same IP subnet the ttl should not be decremented. Thus it makes sense to have a way to selectively enable the uniform model perhaps based on some method to identify packets associated with traceroute or some marker in the packet itself that the traceroute tool can set. 2. Solution Overview The pieces needed to accomplish this are: o One or more ways to select the uniform model packets at the tunnel ingress. o Tunnel ingress copying out the original ttl from a selected packet to the outer IP header, and then doing a check and decrement of that ttl. o If that ttl check results in ttl expiry at the tunnel ingress, then deliver an ICMP ttl exceeded packet back to the host. o A mechanism by which the tunnel egress knows which packets should have uniform model, for instance a bit in the encapsulation header. o The tunnel egress copying in the ttl (for identified packets) from the outer header to the inner IP header, then doing a check and decrement of that ttl. Nordmark, et al. Expires January 2, 2017 [Page 5] Internet-Draft LTTON Jul 2016 o If ttl check results in ttl expiry at the tunnel egress, then deliver an ICMP error back to the original host (or, perhaps better, to tunnel ingress the same way as underlay routers do). o IP routers in the underlay will deliver any ICMP errors to the source IP address of the packet. For tunneled packets that will be the tunnel ingress. Hence the tunnel ingress needs to be able to take such ICMP errors and form corresponding ICMP errors that are sent back to the host. The requirement in [RFC1812] ensures that the ICMP errors will contain enough headers to form such an ICMP error. It has been noted that there are routers in the Internet which decades later fail to conform to that aspect of [RFC1812]. The idea to reflect (some) ICMP errors from inside a tunnel back to the original source goes back to IPv6 in IPv4 encapsulation as specified in [RFC1933] and [RFC2473]. However, those drafts did not advocate using a uniform ttl model for the tunnels but did handle ICMP packet too big and other unreachable messages. Those drafts specify how to reflect ICMP errors received from underlay routers to ICMP errors sent to the original host. The addition of handling ICMP ttl exceeded errors for uniform tunnel model is straight forward. The information carried in the ICMP errors are quite limited - the original packet plus an ICMP type and code. However, there are extension mechanisms specified in [RFC4884] and used for MPLS in [RFC4950] which include TLVs with additional information. If there are additional information to include for overlay networks that information could be added by defining new ICMP Extensions Objects based on [RFC4884]. An example of such an extension for ECMP information is included in this document. 3. Goals and Requirements The following goals and requirements apply: o No changes needed in the underlay. o Optional changes on the decapsulating end. o ECMP friendly. If the underlay employs equal cost multipath routing then one should be able to use this mechanism to trace the same path as a given TCP or UDP flow is using. In addition, one should be able to explore different ECMP paths by varying the IP addresses and port numbers in the packets originated by traceroute on the host. Nordmark, et al. Expires January 2, 2017 [Page 6] Internet-Draft LTTON Jul 2016 o Provide output which makes it possible to compare a regular overlay traceroute with the layer-transcending output. 4. Definition Of Terms The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. The terminology such as NVE, and TS are used as specified in [RFC7365]: o Network Virtualization Edge (NVE): An NVE is the network entity that sits at the edge of an underlay network and implements L2 and/or L3 network virtualization functions. o Tenant System (TS): A physical or virtual system that can play the role of a host or a forwarding element such as a router, switch, firewall, etc. o Virtual Access Points (VAPs): A logical connection point on the NVE for connecting a Tenant System to a virtual network. o Virtual Network (VN): A VN is a logical abstraction of a physical network that provides L2 or L3 network services to a set of Tenant Systems. o Virtual Network Context (VN Context) Identifier: Field in an overlay encapsulation header that identifies the specific VN the packet belongs to. We use the VTEP term in [RFC7348] as synonymous with NVE, and VNI as synonymous to VN Context Identifier. 5. Example Topologies The following example topologies illustrate different cases where we want a tracing capability. The examples are for overlay technologies such as VXLAN which provide a layer 2 overlay on IP. The cases for layer 3 overlay on top of IP are simpler and not shown in this document. The VXLAN term VTEP is used as synonymous to NVO3's NVE term. Nordmark, et al. Expires January 2, 2017 [Page 7] Internet-Draft LTTON Jul 2016 ----------- ----------- | H1 | | H2 | | 1.0.1.1 | | 1.0.1.2 | | | | | ----------- ----------- | | | | ----------- ----------- ----------- | VtepA | | R1 | | VtepB | | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | | | | 2.0.2.2 |--| | ----------- ----------- ----------- Simple L2 overlay The figure above shows two hosts connected using an underlay which provides a layer two service. Thus H1 and H2 are in the same subnet and unaware of the existence of the underlay. Thus a normal ping or traceroute would not be able to provide any information about the nature of a failure; either packets get through or they do not. When the packets get through traceroute would output something like: traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 1 1.0.2.1 (1.0.2.1) 1.104 ms 1.235 ms 1.729 ms In this case it would be desirable to be able to traceroute from H1 to H2 (and vice versa) and observe VtepA, R1, VtepB and H2. Thus in the case of packets getting through traceroute would output: traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 1 2.0.1.1 (2.0.1.1) 1.104 ms 1.235 ms 1.729 ms 2 2.0.1.2 (2.0.1.2) 2.106 ms 2.007 ms 2.156 ms 3 2.0.2.1 (2.0.2.1) 35.034 ms 24.490 ms 21.626 ms 4 1.0.1.2 (1.0.1.2) 40.830 ms 44.694 ms 75.620 ms Note that the underlay and overlay might exist in completely separate addressing domains. Thus H1 might not be able to reach any of the underlay addresses. And the underlay IP addresses might overlap the overlay IP addresses. For example, it would be completely valid to see e.g. VtepA having the same IP address as H1. The user of this tool need to understand that the utility of the traceroute output is to get information to determine whether the issue is in the underlay or overlay, and be able to pass the underlay information to the operator of the underlay. In overlay networks without any ARP/ND optimizations ARP/ND packets would be flooded between the tunnel endpoints. Thus if there is some communication failure between H1 and H2, then H1 above might not have Nordmark, et al. Expires January 2, 2017 [Page 8] Internet-Draft LTTON Jul 2016 an ARP entry for H2. This results in traceroute not being able to output any data. This implies that in order to use traceroute to trouble shoot the issue one would need some workaround, such as installing some temporary ARP entries on the hosts. ----------- ----------- ----------- ----------- | H1 | | R2 | | R3 | | H4 | | 1.0.1.1 | | 1.0.2.2 |--| 1.0.2.3 | | | | | | 1.0.1.2 | | 1.0.3.3 |--| 1.0.3.4 | ----------- ----------- ----------- ----------- | | | | ----------- ----------- ----------- | VtepA | | R1 | | VtepB | | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | | | | 2.0.2.2 |--| | ----------- ----------- ----------- L2 overlay as part of larger network The figure above has a overlay router the nexthop as seen by H1. In this case a normal overlay traceroute would be able to display the overlay path i.e. traceroute to H4, 30 hops max, 60 byte packets 1 R2 2 R3 3 H4 The layer-transcending traceroute would show the combination of the underlay and overlay paths i.e., traceroute to H4, 30 hops max, 60 byte packets 1 VtepA 2 R1 3 VtepB 4 R2 5 R3 6 H4 Nordmark, et al. Expires January 2, 2017 [Page 9] Internet-Draft LTTON Jul 2016 ----------- ------------------- ----------- | H1 | | R5 | | H6 | | 1.0.1.1 | | | | | | | | 1.0.1.2 1.0.5.5 | | 1.0.5.6 | ----------- |-----------------| ----------- | | | | | | | | | | | | ----------- ----------- |-----------------| ----------- ----------- | VtepA | | R1 | | VtepB VtepC | | R6 | | VtepD | | 2.0.1.1 |-| 2.0.1.2 | | 2.0.2.1 3.0.1.1 |-| 3.0.1.2 | | | | | | 2.0.2.2 |-| | | 3.0.2.2 |-| 3.0.3.1 | ----------- ----------- ------------------- ----------- ----------- Multiple L2 overlays in path The figure above has multiple overlay network segments, that are connected in one router which provides the tunnel endpoints for both overlay segments plus routing for the overlay. A more general picture would be to have an overlay routed path between the two NVEs e.g., VtepB and VtepC connected to different routers in the overlay. However, such a drawing in ASCII art doesn't fit on the page. An normal overlay traceroute in the above topology would show the overlay router i.e., traceroute to H6, 30 hops max, 60 byte packets 1 R5 2 H6 The layer-transcending traceroute would show the combination of the underlay and overlay paths i.e., traceroute to H6, 30 hops max, 60 byte packets 1 VtepA 2 R1 3 VtepB 4 R5 5 VtepC 6 R6 7 VtepD 8 H6 Note that the R3 device, which include VtepB and VtepC, appears as three hops in the traceroute output. That is needed to be able to correlate the output with the overlay output which has R3. That correlation would be hard if the R3 device only appeared as VtepB in the LTTON output. The three-hop representation also stays invariant whether or not the NVEs and overlay router are implemented by a Nordmark, et al. Expires January 2, 2017 [Page 10] Internet-Draft LTTON Jul 2016 single device or multiple devices. 6. Controlling and selecting ttl behavior The network admin needs to be able to control who can use the layer transcending traceroute, since the operator might not want to disclose the underlay topology to all its users all the time. There are different approaches for this such as designating particular ports (Virtual Access Points in NVO3 terminology) on a NVE to have uniform ttl tunnel model. We have found it useful to be able to enable this capability on a per port and/or virtual network basis, in addition to having a global setting per NVE. When enabled on the NVEs the user on the TS needs to be able to control which traffic is subject to which tunnel mode. The normal traffic would use the pipe ttl tunnel model and only explicit trace applications are likely to want to use the uniform ttl tunnel model. Hence it makes sense to use some marker in the packets sent by the TS to select those packets for uniform model on the NVE. Such a mechanism should usable so that the user can perform both a regular traceroute and a LTTON. Potentially different fields in the packets originated by traceroute on the TS can be used to mark the packets for uniform ttl tunnel model. However, many of those fields such as source and destination port numbers and protocol might be used in hashing for ECMP. The marking that can be used without impacting ECMP is the DSCP field in the packet. That field can be set with an option (--tos) in at least some existing traceroute implementations. Note that when DSCP is used for such marking it is a configured choice subject to agreement between the operator of the TS and NVE. The matching on the NVE should ignore the ECN bits as to not interfere with ECN. However, the DSCP value used in the overlay might have an impact on the forwarding of the packets. In such a case one can use an alternative selector such as the UDP source port number. That has the downside of affecting the has values used for ECMP and link aggregation port selection. 7. Introducing a ttl copyin flag in the encapsulation header When this approach is applied to VXLAN [RFC7348] the decapsulating NVE has to be able to identify packets that have to be processed in the uniform ttl tunnel model way. For that purpose we define a new Nordmark, et al. Expires January 2, 2017 [Page 11] Internet-Draft LTTON Jul 2016 flag which is sent by the encapsulating NVE on selected packets, and is used by the decapsulating NVE to perform the ttl copyin, decrement and check. In addition to the one I-flag defined in [RFC7348] we define a new T-flag to capture this the trace behavior at the decapsulating tunnel endpoint. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R|R|R|R|I|R|R|T| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | VXLAN Network Identifier (VNI) | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ New fields: T-flag: When set indicates that decapsulator should take the outer ttl and copy it to the inner ttl, and then check and decrement the resulting ttl. 8. Encapsulation Behavior If the uniform ttl model is enabled for the input, and the received naked packet matches the selector, then the ingress NVE will perform these additional operations as part of encapsulating an IPv4 or IPv6 packet: o Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt and if 1 or less, then drop the packet and send an ICMPv4 (or ICMPv6) ttl exceeded back to the original host. Since the NVE is operating on a L2 packet, it might not have any layer 3 interfaces or routes for the originating host. Thus it sends the packet back to the source L2 address of the packet back out the ingress port - without any IP address lookup. o If ttl did not expire, then decrement the above ttl/hopcount and place it in the outer IP header. Encapsulate and send the packet as normal. o If some other errors prevent sending the packet (such as unknown VN Context Id, no flood list configured), then the NVE SHOULD send an ICMP host unreachable back to the host. The ingress NVE will receive ICMP errors from underlay routers and the egress NVE; whether due to ttl exceeded or underlay issues such Nordmark, et al. Expires January 2, 2017 [Page 12] Internet-Draft LTTON Jul 2016 as host unreachable, or packet too big errors. The NVE should take such errors, and in addition to any local syslog etc, generate an ICMP error sent back to the host. The principle for this is specified in [RFC1933] and [RFC2473]. Just like in those specifications, for the inner and outer IP header could be off different version. A common case of that might be an IPv6 overlay with an IPv4 underlay. That case requires some changes in the ICMP type and code values in addition to recreating the packets. The place where LTTON differs from those specifications is that there is an NVO3 header and (for L2 over L3) and L2 header in the packet. The figures below show an example of ICMP header re-generation at VtepA for the case of IPv6 overlay with IPv4 underlay. The case of IPv4 over IPv4 is similar and simpler since the ICMP header is the same for both overlay and underlay. The example uses VXLAN encapsulation to provide the concrete details, but the approach applies to other NVO3 proposals. Nordmark, et al. Expires January 2, 2017 [Page 13] Internet-Draft LTTON Jul 2016 +--------------+ | IPv4 Header | | src = R1 | | dst = VtepA | +--------------+ | ICMPv4 | | Header | | type = X | | code = Y | - - +--------------+ | IPv4 Header | | src = VtepA | IPv4 | dst = VtepB | +--------------+ Packet | UDP | | dst = VXLAN | in +--------------+ | Ethernet | Error | DA = H2 mac | | SA = H1 mac | +--------------+ - - | IPv6 | | src = H1 ipv6| | dst = H2 ipv6| Original IPv6 +--------------+ Packet. | Transport | Used to | Header | generate an +--------------+ ICMPv6 | | error message ~ Data ~ back to the source. | | - - +--------------+ - - ICMPv4 Error Message Returned to Encapsulating Node The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by extracting the Ethernet DA from the inner Ethernet SA, and forming an IPv6 header where the source address is based on the source address of the ICMPv4 error. The ICMPv6 type and code values are set based on the ICMPv4 type and code values. Nordmark, et al. Expires January 2, 2017 [Page 14] Internet-Draft LTTON Jul 2016 +--------------+ | Ethernet | | DA = H1 mac | From ICMPv4 packet | SA = VtepA | in error +--------------+ | IPv6 Header | | src = ::R1 | 96 zeros + IPv4 address | dst = H1 ipv6| +--------------+ | ICMPv6 | | Header | | type = X' | Type and code mapped | code = Y' | from v4 to v6 values - - +--------------+ - - | IPv6 | IPv6 | src = H1 ipv6| | dst = H2 ipv6| Unmodified from Packet +--------------+ ICMPv4 error | Transport | in | Header | +--------------+ Error | | ~ Data ~ | | - - +--------------+ - - Generated ICMPv6 Error Message for Overlay Source In the case of IPv6 over IPv4 the above example setting of the IPv6 source address results in this type of traceroute output: traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets 1 ::2.0.1.1 (::2.0.1.1) 1.231 ms 1.004 ms 1.126 ms 2 ::2.0.1.2 (::2.0.1.2) 1.994 ms 2.301 ms 2.016 ms 3 ::2.0.2.1 (::2.0.2.1) 18.846 ms 30.582 ms 19.776 ms 4 2000:0:0:40::2 (2000:0:0:40::2) 48.964 ms 60.131 ms 53.895 ms 9. Decapsulating Behavior If this uniform ttl model is enabled on the decapsulating NVE, and the overlay header indicates that uniform ttl model applies (the T-bit in the case of VXLAN), then the NVE will perform these additional operations as part of decapsulating a packet where the inner packet is an IPv4 or IPv6 packet: o Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively) on receipt and if 1 or less, then drop the packet and send an Nordmark, et al. Expires January 2, 2017 [Page 15] Internet-Draft LTTON Jul 2016 outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the outer packet i.e., the ingress NVE. This ICMP packet should look the same as an ICMP error generated by an underlay router, and the requirement in [RFC1812] on the size of the packet in error applies. o If ttl did not expire, then decrement the above ttl/hopcount and place it in the inner IP header. If the inner IP header is IPv4 then update the IPv4 header checksum. Then decapsulate and send the packet as for other decapsulated packets. o If some other errors prevent sending the packet (such as unknown VN Context Id), then the NVE SHOULD send an ICMP host unreachable instead of a ttl exceeded error. 10. Other ICMP errors The technique for selecting ttl behavior specified in this draft can also be used to trigger other ICMPv4 and ICMPv6 errors. For example, [RFC1933] specifies how ICMP packet too big from underlay routers can be used to report over ICMP packet too big errors to the original source. Other errors that are more specific to the overlay protocol might also be useful, such as not being able to find a VNI ID for the incoming port,vlan, or not being able to flood the packet if the packet is a Broadcast, Unknown unicast, or Multicast packet. 11. Downstream Egress Paths Object The Downstream Egress Paths Object MAY be appended to the ICMP Time Exceeded and Destination Unreachable messages. A single instance of the Downstream Egress Paths Object represents the egress paths at the router that sends the ICMP message. The Downstream Egress Paths Object must be preceded by an ICMP Extension Structure Header and an ICMP Object Header. Both are defined in [RFC4884]. The format follows closely [RFC4379] with some generalizations for Multipath types. Class-Num = TBA by IANA, Downstream Egress Paths Class C-Type = 1. If the replying router is the destination of the echo request, then a Downstream Egress Paths Object SHOULD NOT be included in the ICMP Error message. Otherwise the replying router MAY append a Downstream Egress Paths Object for all interfaces over which the echo request packet could be forwarded. Nordmark, et al. Expires January 2, 2017 [Page 16] Internet-Draft LTTON Jul 2016 The Object Length is K*N + M*N, where M is the Multipath Length for each egress path, M may not be the same for different paths. Values for K are found in the description of Address Type below. The Downstream Egress Paths Object has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Path-1 MTU | Address Type | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Downstream IP Address (4 or 16 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Downstream Interface Address (4 or 16 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MultipathType | Multipath Length | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . . (Multipath Information) . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Path-N MTU | Address Type | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Downstream IP Address (4 or 16 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Downstream Interface Address (4 or 16 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MultipathType | Multipath Length | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . . (Multipath Information) . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Downstream Egress Paths Object Maximum Transmission Unit (MTU): The MTU is the size in octets of the largest IP frame that fits on the downstream interface. Address Type: The Address Type indicates if the interface is numbered or unnumbered. It also determines the length of the Downstream IP Address and Downstream Interface fields. The resulting total for the initial part of the one path of the downstream Egress Paths Nordmark, et al. Expires January 2, 2017 [Page 17] Internet-Draft LTTON Jul 2016 Object is listed in the table below as "K Octets". The Address Type is set to one of the following values: Type # Address Type K Octets ------ ------------ -------- 1 IPv4 Numbered 16 2 IPv4 Unnumbered 16 3 IPv6 Numbered 40 4 IPv6 Unnumbered 28 Downstream IP Address and Downstream Interface Address: IPv4 addresses and interface indices are encoded in 4 octets; IPv6 addresses are encoded in 16 octets. If the interface to the downstream router has a unique IP address (e.g., it is numbered and not a LAG), then the Address Type MUST be set to IPv4 or IPv6, the Downstream IP Address MUST be set to either the downstream router's Router ID or the interface address of the downstream router, and the Downstream Interface Address MUST be set to the downstream router's interface address. If the interface to the downstream router does not have a unique IP address (e.g., it is is unnumbered or a LAG), the Address Type MUST be IPv4 Unnumbered or IPv6 Unnumbered, the Downstream IP Address MUST be the downstream router's Router ID or the interface address of the downstream router, and the Downstream Interface Address MUST be set to the index assigned by the upstream router to the interface. Multipath Type: The following Multipath Types are defined: Key Type Multipath Information --- ---------------- --------------------- 0 no multipath Empty (Multipath Length = 0) 1 MAC SA/DA Inner MAC in tunnel payload 2 IP Src/Dest Inner IP src/dest in tunnel payload 3 L4 src port L4 src ports in tunnel payload 4 L4 src port range low/high L4 src port pairs Type 0 indicates that all packets will be forwarded out this one interface. Nordmark, et al. Expires January 2, 2017 [Page 18] Internet-Draft LTTON Jul 2016 Types 1 through 4 specify that the supplied Multipath Information will serve to exercise this path. Multipath Length: The length in octets of the Multipath Information. Multipath Information: The Multipath Information encodes L4 source ports that will exercise this path. The Multipath Information depends on the Multipath Type. The contents of the field are shown in the table above. For Type 4, ranges indicated by L4 source port pairs MUST NOT overlap and MUST be in ascending sequence. 12. Security Considerations The considerations in [I-D.ietf-nvo3-security-requirements] apply. In addition, the use of the uniform ttl tunnel model will result in ICMP errors being generated by underlay routers and consumed by NVEs. That resents an attack vector which does not exist in a pipe ttl tunnel model. However, ICMP errors should be rate limited [RFC1812]. Implementations should also take appropriate measures in rate limiting the input rate for ICMP errors that are processed by limited CPU resources. Some implementations might handle the trace packets (with uniform ttl model) in software while the pipe ttl model packets can be handled in hardware. In such a case the implementation should have mechanisms to avoid starvation of limited CPU resources due to these packets. 13. IANA Considerations TBD 14. Acknowledgements The authors acknowledge the helpful comments from David Black and Diego Garcia del Rio. 15. References Nordmark, et al. Expires January 2, 2017 [Page 19] Internet-Draft LTTON Jul 2016 15.1. Normative References [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, RFC 792, DOI 10.17487/RFC0792, September 1981, . [RFC1812] Baker, F., Ed., "Requirements for IP Version 4 Routers", RFC 1812, DOI 10.17487/RFC1812, June 1995, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ RFC2119, March 1997, . [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, . [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. Rekhter, "Framework for Data Center (DC) Network Virtualization", RFC 7365, DOI 10.17487/RFC7365, October 2014, . 15.2. Informative References [I-D.gross-geneve] Gross, J., Sridhar, T., Garg, P., Wright, C., Ganga, I., Agarwal, P., Duda, K., Dutt, D., and J. Hudson, "Geneve: Generic Network Virtualization Encapsulation", draft-gross-geneve-02 (work in progress), October 2014. [I-D.herbert-gue] Herbert, T., Yong, L., and O. Zia, "Generic UDP Encapsulation", draft-herbert-gue-03 (work in progress), March 2015. [I-D.ietf-nvo3-security-requirements] Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M. Zhang, "Security Requirements of NVO3", draft-ietf-nvo3-security-requirements-07 (work in progress), June 2016. [I-D.sridharan-virtualization-nvgre] Garg, P. and Y. Wang, "NVGRE: Network Virtualization using Nordmark, et al. Expires January 2, 2017 [Page 20] Internet-Draft LTTON Jul 2016 Generic Routing Encapsulation", draft-sridharan-virtualization-nvgre-08 (work in progress), April 2015. [I-D.tissa-lime-yang-oam-model] Senevirathne, T., Finn, N., Kumar, D., Salam, S., Wu, Q., and Z. Wang, "Generic YANG Data Model for Operations, Administration, and Maintenance (OAM)", draft-tissa-lime-yang-oam-model-06 (work in progress), August 2015. [RFC1933] Gilligan, R. and E. Nordmark, "Transition Mechanisms for IPv6 Hosts and Routers", RFC 1933, DOI 10.17487/RFC1933, April 1996, . [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, December 1998, . [RFC2983] Black, D., "Differentiated Services and Tunnels", RFC 2983, DOI 10.17487/RFC2983, October 2000, . [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi- Protocol Label Switching (MPLS) Support of Differentiated Services", RFC 3270, DOI 10.17487/RFC3270, May 2002, . [RFC3443] Agarwal, P. and B. Akyol, "Time To Live (TTL) Processing in Multi-Protocol Label Switching (MPLS) Networks", RFC 3443, DOI 10.17487/RFC3443, January 2003, . [RFC4379] Kompella, K. and G. Swallow, "Detecting Multi-Protocol Label Switched (MPLS) Data Plane Failures", RFC 4379, DOI 10.17487/RFC4379, February 2006, . [RFC4884] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, "Extended ICMP to Support Multi-Part Messages", RFC 4884, DOI 10.17487/RFC4884, April 2007, . [RFC4950] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, "ICMP Extensions for Multiprotocol Label Switching", RFC 4950, DOI 10.17487/RFC4950, August 2007, . Nordmark, et al. Expires January 2, 2017 [Page 21] Internet-Draft LTTON Jul 2016 Authors' Addresses Erik Nordmark Arista Networks Santa Clara, CA USA Email: nordmark@arista.com Chandra Appanna Arista Networks Santa Clara, CA USA Email: achandra@arista.com Alton Lo Arista Networks Santa Clara, CA USA Email: altonlo@arista.com Sami Boutros VMware Email: sboutros@vmware.com Ankur Dubey VMware Email: adubey@vmware.com Nordmark, et al. Expires January 2, 2017 [Page 22]