Files
kernel_xiaomi_cepheus/include/uapi/linux/inet_diag.h
Neal Cardwell 2f9c0900ea BACKPORT: FROMGIT: net-tcp_bbr: BBRv2 for Linux TCP
BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.

BBR v2 maintains the core of BBR v1: an explicit model of the network path that
is two-dimensional, adapting to estimate the (a) maximum available bandwidth
and (b) maximum safe volume of data a flow can keep in-flight in the
network. It mains the estimated BDP as a core guide for estimating an
appropriate level of in-flight data.

BBR v2 makes several key enhancements:

o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
extended dynamically based on estimated BDP to improve coexistence with
Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
scalable and responsive than Reno and CUBIC.

o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
loss and (DCTCP-style) ECN signals to maintain its model.

o It aims for lower losses than v1 by adjusting its model to attempt to stay
within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
respectively).

o It adapts to loss/ECN signals even when the application is running out of
data ("application-limited"), in case the "application-limited" flow is also
"network-limited" (the bw and/or inflight available to this flow is lower than
previously estimated when the flow ran out of data).

o It has a three-part model: the model explicit three tracks operating points,
where an operating point is a tuple: (bandwidth, inflight). The three operating
points are:

  o latest:        the latest measurement from the current round trip
  o upper bound:   robust, optimistic, long-term upper bound
  o lower bound:   robust, conservative, short-term lower bound

These are stored in the following state variables:

  o latest:  bw_latest, inflight_latest
  o lo:      bw_lo,     inflight_lo
  o hi:      bw_hi[2],  inflight_hi

To gain intuition about the meaning of the three operating points, it
may help to consider the analogs in CUBIC, which has a somewhat
analogous three-part model used by its probing state machine:

  BBR param     CUBIC param
  -----------   -------------
  latest     ~  cwnd
  lo         ~  ssthresh
  hi         ~  last_max_cwnd

The analogy is only a loose one, though, since the BBR operating
points are calculated differently, and are 2-dimensional (bw,inflight)
rather than CUBIC's one-dimensional notion of operating point
(inflight).

o It uses the three-part model to adapt the magnitude of its bandwidth
to match the estimated space available in the buffer, rather than (as
in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
the bottleneck buffer when probing (commodity datacenter switches
commonly do not have that much buffer for WAN flows). When BBR v2
estimates it hit a buffer limit during probing, its bandwidth probing
then starts gently in case little space is still available in the
buffer, and the accelerates, slowly at first and then rapidly if it
can grow inflight without seeing congestion signals. In such cases,
probing is bounded by inflight_hi + inflight_probe, where
inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
keep losses low and bounded if a bottleneck remains congested, while
rapidly/scalably utilizing free bandwidth when it becomes available.

o It has a slightly revised state machine, to achieve the goals above.
    BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
    BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
    BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
    BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty

o The estimated BDP: BBR v2 continues to maintain an estimate of the
path's two-way propagation delay, by tracking a windowed min_rtt, and
coordinating (on an as-ndeeded basis) to try to expose the two-way
propagation delay by draining the bottleneck queue.

BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
estimate to estimate the current bandwidth-delay product. The estimated BDP
still provides one important guideline for bounding inflight data. However,
because any min-filtered RTT and max-filtered bw inherently tend to both
overestimate, the estimated BDP is often too high; in this case loss or ECN
marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
adapt its sending rate and inflight down to match the available capacity of the
path.

o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
adds 17 more u32 to struct bbr. However, there are 11 u32 fields from
BBR v1 that we can remove after switching to BBR v2:

        struct minmax bw;       /* Max recent delivery rate in pkts/uS << 24 */
        u32     rtt_cnt;            /* count of packet-timed rounds elapsed */
                ...
                packet_conservation:1,  /* use packet conservation? */
                ...
                lt_is_sampling:1,    /* taking long-term ("LT") samples now? */
                lt_rtt_cnt:7,        /* round trips in long-term interval */
                lt_use_bw:1;         /* use lt_bw as our bw estimate? */
        u32     lt_bw;               /* LT est delivery rate in pkts/uS << 24 */
        u32     lt_last_delivered;   /* LT intvl start: tp->delivered */
        u32     lt_last_stamp;       /* LT intvl start: tp->delivered_mstamp */
        u32     lt_last_lost;        /* LT intvl start: tp->lost */

  So ultimately BBR v2 uses 17-11 = 6 more u32 fields than v1.

o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
  significant pieces:

  o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
    bbr_can_grow_inflight())
  o long-term bandwidth estimator ("policer mode")

  The code layout tries to keep BBR v2 code near the bottom of the
  file, so that v1-applicable code in the top does not accidentally
  refer to v2 code.

o Docs:
  See the following docs for more details and diagrams decsribing the BBR v2
  algorithm:
    https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
    https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00

o Internal notes:
  For this upstream rebase, Neal started from:
    git show 6f6734c1c3c4:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
  then removed dev instrumentation (dynamic get/set for parameters)
  and code that was only used by BBRv1

Effort: net-tcp_bbr
Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05
(cherry-picked from 90e22aa359)
[kdrag0n: Backported to k4.14 by removing reord_seen from bbr_debug's
          output as it's not mandatory and 4.14 doesn't have it]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com>
Signed-off-by: Mrinal Ghosh <mg712702@gmail.com>
2021-08-17 14:23:26 +00:00

237 lines
5.5 KiB
C

/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef _UAPI_INET_DIAG_H_
#define _UAPI_INET_DIAG_H_
#include <linux/types.h>
/* Just some random number */
#define TCPDIAG_GETSOCK 18
#define DCCPDIAG_GETSOCK 19
#define INET_DIAG_GETSOCK_MAX 24
/* Socket identity */
struct inet_diag_sockid {
__be16 idiag_sport;
__be16 idiag_dport;
__be32 idiag_src[4];
__be32 idiag_dst[4];
__u32 idiag_if;
__u32 idiag_cookie[2];
#define INET_DIAG_NOCOOKIE (~0U)
};
/* Request structure */
struct inet_diag_req {
__u8 idiag_family; /* Family of addresses. */
__u8 idiag_src_len;
__u8 idiag_dst_len;
__u8 idiag_ext; /* Query extended information */
struct inet_diag_sockid id;
__u32 idiag_states; /* States to dump */
__u32 idiag_dbs; /* Tables to dump (NI) */
};
struct inet_diag_req_v2 {
__u8 sdiag_family;
__u8 sdiag_protocol;
__u8 idiag_ext;
__u8 pad;
__u32 idiag_states;
struct inet_diag_sockid id;
};
/*
* SOCK_RAW sockets require the underlied protocol to be
* additionally specified so we can use @pad member for
* this, but we can't rename it because userspace programs
* still may depend on this name. Instead lets use another
* structure definition as an alias for struct
* @inet_diag_req_v2.
*/
struct inet_diag_req_raw {
__u8 sdiag_family;
__u8 sdiag_protocol;
__u8 idiag_ext;
__u8 sdiag_raw_protocol;
__u32 idiag_states;
struct inet_diag_sockid id;
};
enum {
INET_DIAG_REQ_NONE,
INET_DIAG_REQ_BYTECODE,
};
#define INET_DIAG_REQ_MAX INET_DIAG_REQ_BYTECODE
/* Bytecode is sequence of 4 byte commands followed by variable arguments.
* All the commands identified by "code" are conditional jumps forward:
* to offset cc+"yes" or to offset cc+"no". "yes" is supposed to be
* length of the command and its arguments.
*/
struct inet_diag_bc_op {
unsigned char code;
unsigned char yes;
unsigned short no;
};
enum {
INET_DIAG_BC_NOP,
INET_DIAG_BC_JMP,
INET_DIAG_BC_S_GE,
INET_DIAG_BC_S_LE,
INET_DIAG_BC_D_GE,
INET_DIAG_BC_D_LE,
INET_DIAG_BC_AUTO,
INET_DIAG_BC_S_COND,
INET_DIAG_BC_D_COND,
INET_DIAG_BC_DEV_COND, /* u32 ifindex */
INET_DIAG_BC_MARK_COND,
};
struct inet_diag_hostcond {
__u8 family;
__u8 prefix_len;
int port;
__be32 addr[0];
};
struct inet_diag_markcond {
__u32 mark;
__u32 mask;
};
/* Base info structure. It contains socket identity (addrs/ports/cookie)
* and, alas, the information shown by netstat. */
struct inet_diag_msg {
__u8 idiag_family;
__u8 idiag_state;
__u8 idiag_timer;
__u8 idiag_retrans;
struct inet_diag_sockid id;
__u32 idiag_expires;
__u32 idiag_rqueue;
__u32 idiag_wqueue;
__u32 idiag_uid;
__u32 idiag_inode;
};
/* Extensions */
enum {
INET_DIAG_NONE,
INET_DIAG_MEMINFO,
INET_DIAG_INFO,
INET_DIAG_VEGASINFO,
INET_DIAG_CONG,
INET_DIAG_TOS,
INET_DIAG_TCLASS,
INET_DIAG_SKMEMINFO,
INET_DIAG_SHUTDOWN,
/*
* Next extenstions cannot be requested in struct inet_diag_req_v2:
* its field idiag_ext has only 8 bits.
*/
INET_DIAG_DCTCPINFO, /* request as INET_DIAG_VEGASINFO */
INET_DIAG_PROTOCOL, /* response attribute only */
INET_DIAG_SKV6ONLY,
INET_DIAG_LOCALS,
INET_DIAG_PEERS,
INET_DIAG_PAD,
INET_DIAG_MARK, /* only with CAP_NET_ADMIN */
INET_DIAG_BBRINFO, /* request as INET_DIAG_VEGASINFO */
INET_DIAG_CLASS_ID, /* request as INET_DIAG_TCLASS */
INET_DIAG_MD5SIG,
__INET_DIAG_MAX,
};
#define INET_DIAG_MAX (__INET_DIAG_MAX - 1)
/* INET_DIAG_MEM */
struct inet_diag_meminfo {
__u32 idiag_rmem;
__u32 idiag_wmem;
__u32 idiag_fmem;
__u32 idiag_tmem;
};
/* INET_DIAG_VEGASINFO */
struct tcpvegas_info {
__u32 tcpv_enabled;
__u32 tcpv_rttcnt;
__u32 tcpv_rtt;
__u32 tcpv_minrtt;
};
/* INET_DIAG_DCTCPINFO */
struct tcp_dctcp_info {
__u16 dctcp_enabled;
__u16 dctcp_ce_state;
__u32 dctcp_alpha;
__u32 dctcp_ab_ecn;
__u32 dctcp_ab_tot;
};
/* INET_DIAG_BBRINFO */
struct tcp_bbr_info {
/* u64 bw: max-filtered BW (app throughput) estimate in Byte per sec: */
__u32 bbr_bw_lo; /* lower 32 bits of bw */
__u32 bbr_bw_hi; /* upper 32 bits of bw */
__u32 bbr_min_rtt; /* min-filtered RTT in uSec */
__u32 bbr_pacing_gain; /* pacing gain shifted left 8 bits */
__u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */
};
/* Phase as reported in netlink/ss stats. */
enum tcp_bbr2_phase {
BBR2_PHASE_INVALID = 0,
BBR2_PHASE_STARTUP = 1,
BBR2_PHASE_DRAIN = 2,
BBR2_PHASE_PROBE_RTT = 3,
BBR2_PHASE_PROBE_BW_UP = 4,
BBR2_PHASE_PROBE_BW_DOWN = 5,
BBR2_PHASE_PROBE_BW_CRUISE = 6,
BBR2_PHASE_PROBE_BW_REFILL = 7
};
struct tcp_bbr2_info {
/* u64 bw: bandwidth (app throughput) estimate in Byte per sec: */
__u32 bbr_bw_lsb; /* lower 32 bits of bw */
__u32 bbr_bw_msb; /* upper 32 bits of bw */
__u32 bbr_min_rtt; /* min-filtered RTT in uSec */
__u32 bbr_pacing_gain; /* pacing gain shifted left 8 bits */
__u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */
__u32 bbr_bw_hi_lsb; /* lower 32 bits of bw_hi */
__u32 bbr_bw_hi_msb; /* upper 32 bits of bw_hi */
__u32 bbr_bw_lo_lsb; /* lower 32 bits of bw_lo */
__u32 bbr_bw_lo_msb; /* upper 32 bits of bw_lo */
__u8 bbr_mode; /* current bbr_mode in state machine */
__u8 bbr_phase; /* current state machine phase */
__u8 unused1; /* alignment padding; not used yet */
__u8 bbr_version; /* MUST be at this offset in struct */
__u32 bbr_inflight_lo; /* lower/short-term data volume bound */
__u32 bbr_inflight_hi; /* higher/long-term data volume bound */
__u32 bbr_extra_acked; /* max excess packets ACKed in epoch */
};
union tcp_cc_info {
struct tcpvegas_info vegas;
struct tcp_dctcp_info dctcp;
struct tcp_bbr_info bbr;
struct tcp_bbr2_info bbr2;
};
#endif /* _UAPI_INET_DIAG_H_ */