Network Working Group H.Zheng Internet Draft CEIE,BJTU Intended status: Experimental C.Qiao Expires: December 28, 2016 SUNY K.Chen Y.Zhao CEIE,BJTU June 25, 2016 An Effective Approach to Preventing TCP Incast Throughput Collapse for Data Center Networks draft-zheng-tcpincast-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on December 28, 2016. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Zheng,et al. Expires December 28, 2016 [Page 1] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 carefully, as they describe your rights and restrictions with respect to this document. Abstract This document presents an effective solution to the known TCP incast problem in data center networks. The incast problem refers to a drastic TCP throughput drop when the number of servers synchroni- cally sending data to the same receiver is too large. Our idea is avoiding packet losses before TCP incast happens. The scheme is limiting the number of concurrent senders such that the link can be filled as fully as possible but no packet losses. In this document we examine the condition that the link can be saturated but no packet losses. Then based on the condition we propose an approach to estimates the reasonable number of concurrent senders. Our approach does not modify TCP protocol itself and can thus be applied to any TCP variant, and works regardless of the type of data center network topology and throughput limitation. Analysis and simulation results show that our approach eliminates the incast problem and noticeably improves TCP throughput. Table of Contents 1. Introduction ................................................ 3 2. Model ....................................................... 4 2.1. TCP Incast Model........................................ 4 2.2. TCP rabbits in data centers............................. 5 3. The Condition SBNPL ......................................... 6 3.1. Why limit the number of concurrent senders ..............6 3.2. Assumptions and notations............................... 7 3.3. The condition SBNPL..................................... 7 4. Performance evaluation....................................... 8 4.1. Simulation Configuration................................ 8 4.2. Throughput performance.................................. 9 4.3. The effect of the limitation to the buffer size on throughput .................................................. 9 4.4. The effect of BDP on throughput .........................9 5. Related Work ................................................ 9 6. Conclusions ................................................ 11 7. Security Considerations..................................... 12 8. IANA Considerations ........................................ 12 9. References ................................................. 12 10. Acknowledgments ........................................... 14 Zheng,et al. Expires December 28, 2016 [Page 2] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 1. Introduction Data centers are becoming one of hottest topics in both research communities and IT industry. It is attractive to build data centers atop standard TCP/IP and Ethernet due to their technological advantages and economy of scale. Although TCP has been used widely in the Internet and works well, TCP does not work well in data center networks. A reported open problem is TCP incast [1-4]. TCP incast, which results in gross under-utilization of link capacity, occurs in synchronized many-to-one communication patterns. In such a communication pattern, a receiver issues data requests to multiple senders. The senders respond to the request and return an amount of data to the receiver. The data from all senders pass through a bottleneck link in a many-to-one pattern to the receiver. When the number of synchronized senders increases, throughput observed at the receiver drops to one or two orders of magnitude below the link capacity. Several factors, including high bandwidth and low latency of a data center network, lead to TCP incast [3]. The barrier synchronized many-to-one communication pattern triggers multiple servers to transmit packets to a receiver concurrently, resulting in a potentially large amount of traffic simultaneously poured into the network. Due to limited buffer space at the switches, such traffic can overload the buffer, resulting in packet losses. TCP recovers from most of packet losses through timeout retransmission. The timeout duration is at least hundreds of milliseconds, which is orders of magnitude greater than a typical round trip time in a data center network. A server that suffers timeout is stalled even though other servers can use the available bandwidth to complete transmitting. Due to the synchronized communication, however, the receiver has to wait for the slowest server that suffers timeout. During such a waiting period, the bottleneck link may be fully idle. This results in under-utilization of the link and performance collapse. This document addresses the above TCP throughput collapse problem. We propose an approach to preventing TCP throughput collapse. We focus on avoiding packet losses before TCP incast happens. The basic idea is to restrict the number of servers allowed to transmit data to the receiver at the same time to a reasonable value so as to avoid TCP incast. For given limited amount of data required by the receiver, too few servers send data at a time may not fully utilize the bandwidth on the link. Too many servers, however due to the limited switch buffer, Zheng,et al. Expires December 28, 2016 [Page 3] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 simultaneously sending to the receiver will lead to overfilling of the buffer, resulting in TCP timeouts and then TCP incast. Accordingly the number of servers that are permitted to send data at the same time should be carefully determined. Our objective is to find a reasonable value of number of co-existing servers to ensure that the aggregate bandwidth is utilized efficiently but without packet losses. Our approach is to calculate the reasonable number of servers allowed to send data at the same time. In this document, we give the condition that the bottleneck link can be saturated but no packet losses (SBNPL). We show that if there is sufficient amount of traffic on the bottleneck link while the backlog created at the switch buffer never exceeds the buffer size, the bottleneck link can be utilized effectively without losses. Based on this condition, we then propose an approach to calculating the reasonable number of concurrent TCP senders. The simulation results and analysis show our approach noticeably improves throughput performance.The approach is an application-layer mechanism. It can be a global scheduling application or an application staggering requests to a limited number of senders. It also can be implemented on senders, which can coordinately skew their responses. The advantage of our approaches is that it does not need to modify TCP protocol or the switch. The reminder of this document is organized as follows. In section 2, we describe model discussed in this document, including TCP incast model and TCP flow characteristics in data center. We examine the condition that the bottleneck link can be saturated but no packet losses in section 3. Simulation results for evaluating performance are presented in section 4. In section 6 we conclude this document and discuss future works. 2. Model 2.1. TCP Incast Model TCP incast occurs in synchronized many-to-one communication patterns. Research works have shown that many applications in data centers use many-to-one communication pattern [3,5,6]. A simplified model is originally presented in [1]. It is a distributed storage system, a file is split into data blocks. Each data block is stripped across N servers, such that each server stores a chunk of that particular data block, denoted by Sender Request Unit (SRU). We call the number of servers across which the file is stripped as stripping width. The receiver requests servers for that particular block; each server responds and returns the chunk, i.e., SRU, stored on it. Only after the receiver has gotten all chunks of the current block, it requests Zheng,et al. Expires December 28, 2016 [Page 4] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 for the next block. Essentially, this is a Bulk Synchronous Model (BSP) [7]. Built atop TCP/IP and Gigabit Ethernet, the underlying network is constructed with high bandwidth (1~10 Gbps, even higher) and low latency (RTT of 10~100 microseconds). The link connecting the receiver and the switch is the only bottleneck. In our model, each sender only opens one TCP connection for transmission. TCP connections are kept open until the transfer of the whole file completes, or the receiver does not request any more. It is because that it is not necessary to open an individual connection for transferring each data block of the file. The transfer of a whole file consists of 'rounds'. A round begins when the receiver requests for a data block, and ends when the receiver has gotten all chunks of that block. In any round, each TCP connection transfers limited amount of data, which equals to the size of SRU, according to TCP congestion control algorithm. Consequently the transfer of a SRU in each round can be viewed as a succession of 'sub-rounds'. At the beginning of any sub-round, each TCP sender transmits SEND_W packets back-to-back, where SEND_W is the current size of TCP send window. These packets are so-called 'outstanding packets', which have been sent but whose ACK's have yet to be received by the sender. If no packet losses, each outstanding packet must either be in the buffer queue, or be in the delay pipe (it is in the pipe from the sender to the receiver or its associated ACK packet is in the pipe in the other direction). Once all packets falling within the current send window have been transmitted, no other packets are sent until the first ACK is received for one of these packets. The reception of this ACK signals the end of the current sub-round and the beginning of the next sub-round. So the duration of a sub-round is a RTT. At the beginning of the next sub- round, a group of NEW_SEND_W new packets will be sent, where NEW_SEND_W is the new size of TCP send window. When all packets associated to that particular SRU have been acknowledged, the transfer of that particular block chunk completes. The next round starts when the slowest TCP connection completes the transfer of its corresponding SRU. Note that when the new round begins, the value of TCP congestion window is that of congestion window when the last round ends. 2.2. TCP rabbits in data centers TCP has unique characteristics in the context of data centers. More specifically, a TCP flow is generally referred to as either a TCP mouse if it is short (so short that it never exits the slow-start Zheng,et al. Expires December 28, 2016 [Page 5] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 phase), or a TCP elephant if it has a bulk of data to transfer and lasts very long. From the discussion above, we observe that the size of a TCP flow is between a mouse and an elephant. In particular, a TCP flow may enter both the slow-start phase and the congestion avoidance phase during data transmission. We call such a TCP flow a "rabbit". Note that a rabbit TCP is elusive: it may exit slow start and enter congestion avoidance in the first round during which the first data block is transmitted (in the case of narrow striping) or in later rounds during which subsequent data blocks are sent (in the case of wider striping). While one may use different equations to model a TCP rabbit's behaviors in different rounds, or for different widths of striping, it greatly complicates the implementation. Moreover when there are multiple co-active rabbits, the interaction between them will make it much harder to model and predict their behaviors. Our objective is to find an approach that: is easy to implement, works regardless of TCP phase, TCP parameters configuration and scalability of data center and does not need to modify TCP protocol itself. 3. The Condition SBNPL Denote the reasonable number of servers as senders to simultaneously transmit with m. In this section at first we will explain why limiting the number of concurrent senders can guarantee throughput of the receiver. Then we will deduce the condition that the link can be saturated but no packet losses. 3.1. Why limit the number of concurrent senders Given stripping width, observe throughput of the receiver varying the number of concurrent TCP senders, i.e., the value of m. Our experiment (for a block of 1 MB stripped across 256 servers with buffer size of 64 KB) shows that throughput firstly increases then drops with the increasing of m. Because the volume of data sent by a TCP sender is just limited to 4 KB, when there are few concurrent TCP senders, the total amount of data transmitted by these senders is not sufficient to saturate the link, such that throughput of the receiver is low. With the increasing of the value of m, there are more concurrent TCP senders pouring data onto the link, and throughput increases till it reaches a top. When the value of m becomes larger, the potentially large amounts of traffic contributed by these m concurrent senders exhausts the buffering capacity of the link, which leads to packet losses; the throughput falls from the top instead of rising. We call the top throughput as the optimal Zheng,et al. Expires December 28, 2016 [Page 6] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 throughput, and the value of m corresponding to the optimal throughput as the optimal number of concurrent senders. If the number of concurrent TCP senders exactly equals to the optimal value, the receiver can obtain optimal throughput. The simulation result also shows that even if the number of concurrent senders is not optimal, there exists a value range of m where the receiver obtains good throughput performance. By "good" throughput performance, we mean the receiver achieves 98 percent or above of the optimal throughput. Hence we view those values of m within this range as reasonable number of concurrent senders. It may be hard to find the exact optimal value of m such that the receiver achieves optimal throughput, while find the reasonable value of m is more feasible. In order to find the reasonable value of m, we deduce the condition SBNPL in later subsection. 3.2. Assumptions and notations Ignoring the time that it takes to read data from disks or caches, once each sender receives a request from the receiver, it responds and returns its corresponding chunk. Thus at the beginning of any sub-round, all senders simultaneously inject data to the network. Because TCP connections are synchronized, an assumption also adopted in [8], they share both the buffer and the bottleneck link bandwidth quite equitably. That is, each TCP connection shares the buffer and the link bandwidth with B/m and C/m, respectively, where B is the buffer size and C is the bandwidth of the bottleneck link. Assume that both TCP send and receive socket buffer sizes are large enough that TCP send window SEND_W is limited by its congestion window CON_W, i.e., SEND_W = CON_W. All links have same bandwidth C, and propagating delay d. The switch has buffer capacity B, and manages queue with Drop Tail algorithm. 3.3. The condition SBNPL It is enough to analyze the transfer in a round, since for any round the data block's size is same and the block is transferred according to TCP congestion control algorithm. It is worth noting that at the beginning of a round the TCP send window size is the one when the last round ends. Due to synchronization, the windows evolutions of all TCP connections are synchronized, at any sub-round the total amount of data poured onto the bottleneck,M, is merely the sum of amount of data contributed by each connection at the current time, that is SUM_M. Zheng,et al. Expires December 28, 2016 [Page 7] Internet-Draft Preventing TCP Incast Thr. Col. for DCN June 2016 As mentioned above, if measured in units of packets, these M/P outstanding packets, where P is the size of a packet, must either be in the buffer queue, or be in the delay pipe, assuming no packet losses. If M is not less than the bandwidth delay product of the bottleneck, the bottleneck can be saturated. When these m TCP connections share the bottleneck, each connection's bandwidth delay product equals to its share of the link bandwidth times RTT time that it traverses the bottleneck. Because they are synchronized, each connection's share of the bottleneck is equitable, (i.e., which is C/m), and the RTTs are similar, so the total bandwidth delay product is the sum of m connections' bandwidth delay product, which equals to the bottleneck bandwidth times the average RTT, i.e.,C*RTT_AVR [9]. Ignoring the transmission time on the link between senders (or the receiver) and the switch, and processing time on the switch and/or the senders and the receiver, RTT_AVR is four times of the propagating delay of the bottleneck. In literature[10] it showed that when there is only one TCP connection on the bottleneck link, the link can be saturated but no packet losses, at each time as long as the amount of data poured onto the link satisfies the condition C*RTT_AVR=M=C*RTT_AVR+B. A simple extending of this condition to a version where m concurrent synchronized TCP connections share the link is C*RTT_AVR=SUM_M=C*RTT_AVR+B. However, for TCP incast problem, this condition does not hold because of highly bursty traffic. At the beginning of a round, for each TCP connection it is possible that its window has increased enough large that SEND_W packets are transmitted in back-to-back manner, resulting in flash crowd at the buffer. Such a traffic burstiness can lead to a backlog created at the buffer even when M=C*RTT_AVR and B