Changes in 5.15.58
pinctrl: stm32: fix optional IRQ support to gpios
riscv: add as-options for modules with assembly compontents
mlxsw: spectrum_router: Fix IPv4 nexthop gateway indication
lockdown: Fix kexec lockdown bypass with ima policy
drm/ttm: fix locking in vmap/vunmap TTM GEM helpers
bus: mhi: host: pci_generic: add Telit FN980 v1 hardware revision
bus: mhi: host: pci_generic: add Telit FN990
Revert "selftest/vm: verify remap destination address in mremap_test"
Revert "selftest/vm: verify mmap addr in mremap_test"
PCI: hv: Fix multi-MSI to allow more than one MSI vector
PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
PCI: hv: Fix interrupt mapping for multi-MSI
serial: mvebu-uart: correctly report configured baudrate value
batman-adv: Use netif_rx_any_context() any.
Revert "mt76: mt7921: Fix the error handling path of mt7921_pci_probe()"
Revert "mt76: mt7921e: fix possible probe failure after reboot"
mt76: mt7921: use physical addr to unify register access
mt76: mt7921e: fix possible probe failure after reboot
mt76: mt7921: Fix the error handling path of mt7921_pci_probe()
xfs: fix maxlevels comparisons in the btree staging code
xfs: fold perag loop iteration logic into helper function
xfs: rename the next_agno perag iteration variable
xfs: terminate perag iteration reliably on agcount
xfs: fix perag reference leak on iteration race with growfs
xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()
r8152: fix a WOL issue
ip: Fix data-races around sysctl_ip_default_ttl.
xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in xfrm_bundle_lookup()
power/reset: arm-versatile: Fix refcount leak in versatile_reboot_probe
RDMA/irdma: Do not advertise 1GB page size for x722
RDMA/irdma: Fix sleep from invalid context BUG
pinctrl: ralink: rename MT7628(an) functions to MT76X8
pinctrl: ralink: rename pinctrl-rt2880 to pinctrl-ralink
pinctrl: ralink: Check for null return of devm_kcalloc
perf/core: Fix data race between perf_event_set_output() and perf_mmap_close()
ipv4/tcp: do not use per netns ctl sockets
net: tun: split run_ebpf_filter() and pskb_trim() into different "if statement"
mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%
sysctl: move some boundary constants from sysctl.c to sysctl_vals
tcp: Fix data-races around sysctl_tcp_ecn.
drm/amd/display: Support for DMUB HPD interrupt handling
drm/amd/display: Add option to defer works of hpd_rx_irq
drm/amd/display: Fork thread to offload work of hpd_rx_irq
drm/amdgpu/display: add quirk handling for stutter mode
drm/amd/display: Ignore First MST Sideband Message Return Error
scsi: megaraid: Clear READ queue map's nr_queues
scsi: ufs: core: Drop loglevel of WriteBoost message
nvme: check for duplicate identifiers earlier
nvme: fix block device naming collision
e1000e: Enable GPT clock before sending message to CSME
Revert "e1000e: Fix possible HW unit hang after an s0ix exit"
igc: Reinstate IGC_REMOVED logic and implement it properly
ip: Fix data-races around sysctl_ip_no_pmtu_disc.
ip: Fix data-races around sysctl_ip_fwd_use_pmtu.
ip: Fix data-races around sysctl_ip_fwd_update_priority.
ip: Fix data-races around sysctl_ip_nonlocal_bind.
ip: Fix a data-race around sysctl_ip_autobind_reuse.
ip: Fix a data-race around sysctl_fwmark_reflect.
tcp/dccp: Fix a data-race around sysctl_tcp_fwmark_accept.
tcp: sk->sk_bound_dev_if once in inet_request_bound_dev_if()
tcp: Fix data-races around sysctl_tcp_l3mdev_accept.
tcp: Fix data-races around sysctl_tcp_mtu_probing.
tcp: Fix data-races around sysctl_tcp_base_mss.
tcp: Fix data-races around sysctl_tcp_min_snd_mss.
tcp: Fix a data-race around sysctl_tcp_mtu_probe_floor.
tcp: Fix a data-race around sysctl_tcp_probe_threshold.
tcp: Fix a data-race around sysctl_tcp_probe_interval.
net: stmmac: fix pm runtime issue in stmmac_dvr_remove()
net: stmmac: fix unbalanced ptp clock issue in suspend/resume flow
mtd: rawnand: gpmi: validate controller clock rate
mtd: rawnand: gpmi: Set WAIT_FOR_READY timeout based on program/erase times
net: dsa: microchip: ksz_common: Fix refcount leak bug
net: skb: introduce kfree_skb_reason()
net: skb: use kfree_skb_reason() in tcp_v4_rcv()
net: skb: use kfree_skb_reason() in __udp4_lib_rcv()
net: socket: rename SKB_DROP_REASON_SOCKET_FILTER
net: skb_drop_reason: add document for drop reasons
net: netfilter: use kfree_drop_reason() for NF_DROP
net: ipv4: use kfree_skb_reason() in ip_rcv_core()
net: ipv4: use kfree_skb_reason() in ip_rcv_finish_core()
i2c: mlxcpld: Fix register setting for 400KHz frequency
i2c: cadence: Change large transfer count reset logic to be unconditional
perf tests: Fix Convert perf time to TSC test for hybrid
net: stmmac: fix dma queue left shift overflow issue
net/tls: Fix race in TLS device down flow
igmp: Fix data-races around sysctl_igmp_llm_reports.
igmp: Fix a data-race around sysctl_igmp_max_memberships.
igmp: Fix data-races around sysctl_igmp_max_msf.
tcp: Fix data-races around keepalive sysctl knobs.
tcp: Fix data-races around sysctl_tcp_syn(ack)?_retries.
tcp: Fix data-races around sysctl_tcp_syncookies.
tcp: Fix data-races around sysctl_tcp_migrate_req.
tcp: Fix data-races around sysctl_tcp_reordering.
tcp: Fix data-races around some timeout sysctl knobs.
tcp: Fix a data-race around sysctl_tcp_notsent_lowat.
tcp: Fix a data-race around sysctl_tcp_tw_reuse.
tcp: Fix data-races around sysctl_max_syn_backlog.
tcp: Fix data-races around sysctl_tcp_fastopen.
tcp: Fix data-races around sysctl_tcp_fastopen_blackhole_timeout.
iavf: Fix handling of dummy receive descriptors
pinctrl: armada-37xx: Use temporary variable for struct device
pinctrl: armada-37xx: Make use of the devm_platform_ioremap_resource()
pinctrl: armada-37xx: Convert to use dev_err_probe()
pinctrl: armada-37xx: use raw spinlocks for regmap to avoid invalid wait context
i40e: Fix erroneous adapter reinitialization during recovery process
ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero
net: stmmac: remove redunctant disable xPCS EEE call
gpio: pca953x: only use single read/write for No AI mode
gpio: pca953x: use the correct range when do regmap sync
gpio: pca953x: use the correct register address when regcache sync during init
be2net: Fix buffer overflow in be_get_module_eeprom
net: dsa: sja1105: silent spi_device_id warnings
net: dsa: vitesse-vsc73xx: silent spi_device_id warnings
drm/imx/dcss: Add missing of_node_put() in fail path
ipv4: Fix a data-race around sysctl_fib_multipath_use_neigh.
ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.
ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.
ip: Fix data-races around sysctl_ip_prot_sock.
udp: Fix a data-race around sysctl_udp_l3mdev_accept.
tcp: Fix data-races around sysctl knobs related to SYN option.
tcp: Fix a data-race around sysctl_tcp_early_retrans.
tcp: Fix data-races around sysctl_tcp_recovery.
tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.
tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
tcp: Fix a data-race around sysctl_tcp_retrans_collapse.
tcp: Fix a data-race around sysctl_tcp_stdurg.
tcp: Fix a data-race around sysctl_tcp_rfc1337.
tcp: Fix a data-race around sysctl_tcp_abort_on_overflow.
tcp: Fix data-races around sysctl_tcp_max_reordering.
gpio: gpio-xilinx: Fix integer overflow
KVM: selftests: Fix target thread to be migrated in rseq_test
spi: bcm2835: bcm2835_spi_handle_err(): fix NULL pointer deref for non DMA transfers
KVM: Don't null dereference ops->destroy
mm/mempolicy: fix uninit-value in mpol_rebind_policy()
bpf: Make sure mac_header was set before using it
sched/deadline: Fix BUG_ON condition for deboosted tasks
x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS parts
dlm: fix pending remove if msg allocation fails
x86/uaccess: Implement macros for CMPXCHG on user addresses
x86/extable: Tidy up redundant handler functions
x86/extable: Get rid of redundant macros
x86/mce: Deduplicate exception handling
x86/extable: Rework the exception table mechanics
x86/extable: Provide EX_TYPE_DEFAULT_MCE_SAFE and EX_TYPE_FAULT_MCE_SAFE
bitfield.h: Fix "type of reg too small for mask" test
x86/entry_32: Remove .fixup usage
x86/extable: Extend extable functionality
x86/msr: Remove .fixup usage
x86/futex: Remove .fixup usage
KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses
xhci: dbc: refactor xhci_dbc_init()
xhci: dbc: create and remove dbc structure in dbgtty driver.
xhci: dbc: Rename xhci_dbc_init and xhci_dbc_exit
xhci: Set HCD flag to defer primary roothub registration
mt76: fix use-after-free by removing a non-RCU wcid pointer
iwlwifi: fw: uefi: add missing include guards
crypto: qat - set to zero DH parameters before free
crypto: qat - use pre-allocated buffers in datapath
crypto: qat - refactor submission logic
crypto: qat - add backlog mechanism
crypto: qat - fix memory leak in RSA
crypto: qat - remove dma_free_coherent() for RSA
crypto: qat - remove dma_free_coherent() for DH
crypto: qat - add param check for RSA
crypto: qat - add param check for DH
crypto: qat - re-enable registration of algorithms
exfat: fix referencing wrong parent directory information after renaming
tracing: Have event format check not flag %p* on __get_dynamic_array()
tracing: Place trace_pid_list logic into abstract functions
tracing: Fix return value of trace_pid_write()
um: virtio_uml: Allow probing from devicetree
um: virtio_uml: Fix broken device handling in time-travel
Bluetooth: Add bt_skb_sendmsg helper
Bluetooth: Add bt_skb_sendmmsg helper
Bluetooth: SCO: Replace use of memcpy_from_msg with bt_skb_sendmsg
Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg
Bluetooth: Fix passing NULL to PTR_ERR
Bluetooth: SCO: Fix sco_send_frame returning skb->len
Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunks
exfat: use updated exfat_chain directly during renaming
drm/amd/display: Reset DMCUB before HW init
drm/amd/display: Optimize bandwidth on following fast update
drm/amd/display: Fix surface optimization regression on Carrizo
x86/amd: Use IBPB for firmware calls
x86/alternative: Report missing return thunk details
watchqueue: make sure to serialize 'wqueue->defunct' properly
tty: drivers/tty/, stop using tty_schedule_flip()
tty: the rest, stop using tty_schedule_flip()
tty: drop tty_schedule_flip()
tty: extract tty_flip_buffer_commit() from tty_flip_buffer_push()
tty: use new tty_insert_flip_string_and_push_buffer() in pty_write()
net: usb: ax88179_178a needs FLAG_SEND_ZLP
watch-queue: remove spurious double semicolon
drm/amd/display: Don't lock connection_mutex for DMUB HPD
drm/amd/display: invalid parameter check in dmub_hpd_callback
x86/extable: Prefer local labels in .set directives
KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness
x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
drm/amdgpu: Off by one in dm_dmub_outbox1_low_irq()
x86/entry_32: Fix segment exceptions
drm/amd/display: Fix wrong format specifier in amdgpu_dm.c
Linux 5.15.58
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id5d721199f6acc2bdaf37ef7d133904993d9160f
[ Upstream commit 9b55c20f83369dd54541d9ddbe3a018a8377f451 ]
sysctl_ip_prot_sock is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
Fixes: 4548b683b7 ("Introduce a sysctl that modifies the value of PROT_SOCK.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 85d0b4dbd74b95cc492b1f4e34497d3f894f5d9a ]
While reading sysctl_fwmark_reflect, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: e110861f86 ("net: add a sysctl to reflect the fwmark on replies")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60c158dc7b1f0558f6cadd5b50d0386da0000d50 ]
While reading sysctl_ip_fwd_use_pmtu, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: f87c10a8aa ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e6175a2ed1f18bf2f649625bf725e07adcfa6a28 ]
In IPv4 setting the "disable_policy" flag on a device means no policy
should be enforced for traffic originating from the device. This was
implemented by seting the DST_NOPOLICY flag in the dst based on the
originating device.
However, dsts are cached in nexthops regardless of the originating
devices, in which case, the DST_NOPOLICY flag value may be incorrect.
Consider the following setup:
+------------------------------+
| ROUTER |
+-------------+ | +-----------------+ |
| ipsec src |----|-|ipsec0 | |
+-------------+ | |disable_policy=0 | +----+ |
| +-----------------+ |eth1|-|-----
+-------------+ | +-----------------+ +----+ |
| noipsec src |----|-|eth0 | |
+-------------+ | |disable_policy=1 | |
| +-----------------+ |
+------------------------------+
Where ROUTER has a default route towards eth1.
dst entries for traffic arriving from eth0 would have DST_NOPOLICY
and would be cached and therefore can be reused by traffic originating
from ipsec0, skipping policy check.
Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead
of the DST in IN/FWD IPv4 policy checks.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 952c246496)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib99071ce487325fd9c3646ca77dd682128feeb7c
[ Upstream commit e6175a2ed1f18bf2f649625bf725e07adcfa6a28 ]
In IPv4 setting the "disable_policy" flag on a device means no policy
should be enforced for traffic originating from the device. This was
implemented by seting the DST_NOPOLICY flag in the dst based on the
originating device.
However, dsts are cached in nexthops regardless of the originating
devices, in which case, the DST_NOPOLICY flag value may be incorrect.
Consider the following setup:
+------------------------------+
| ROUTER |
+-------------+ | +-----------------+ |
| ipsec src |----|-|ipsec0 | |
+-------------+ | |disable_policy=0 | +----+ |
| +-----------------+ |eth1|-|-----
+-------------+ | +-----------------+ +----+ |
| noipsec src |----|-|eth0 | |
+-------------+ | |disable_policy=1 | |
| +-----------------+ |
+------------------------------+
Where ROUTER has a default route towards eth1.
dst entries for traffic arriving from eth0 would have DST_NOPOLICY
and would be cached and therefore can be reused by traffic originating
from ipsec0, skipping policy check.
Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead
of the DST in IN/FWD IPv4 policy checks.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
and associated inet_is_local_unbindable_port() helper function:
use it to make explicitly binding to an unbindable port return
-EPERM 'Operation not permitted'.
Autobind doesn't honour this new sysctl since:
(a) you can simply set both if that's the behaviour you desire
(b) there could be a use for preventing explicit while allowing auto
(c) it's faster in the relatively critical path of doing port selection
during connect() to only check one bitmap instead of both
Various ports may have special use cases which are not suitable for
use by general userspace applications. Currently, ports specified in
ip_local_reserved_ports sysctl will not be returned only in case of
automatic port assignment, but nothing prevents you from explicitly
binding to them - even from an entirely unprivileged process.
In certain cases it is desirable to prevent the host from assigning the
ports even in case of explicit binds, even from superuser processes.
Example use cases might be:
- a port being stolen by the nic for remote serial console, remote
power management or some other sort of debugging functionality
(crash collection, gdb, direct access to some other microcontroller
on the nic or motherboard, remote management of the nic itself).
- a transparent proxy where packets are being redirected: in case
a socket matches this connection, packets from this application
would be incorrectly sent to one of the endpoints.
Initially I wanted to solve this problem via the simple one line:
static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port) {
- return port < net->ipv4.sysctl_ip_prot_sock;
+ return port < net->ipv4.sysctl_ip_prot_sock || inet_is_local_reserved_port(net, port);
}
However, this doesn't work for two reasons:
(a) it changes userspace visible behaviour of the existing local
reserved ports sysctl, and there appears to be enough documentation
on the internet talking about setting it to make this a bad idea
(b) it doesn't prevent privileged apps from using these ports,
CAP_BIND_SERVICE is relatively likely to be available to, for example,
a recursive DNS server so it can listed on port 53, which also needs
to do src port randomization for outgoing queries due to security
reasons (and it thus does manual port binding).
If we *know* that certain ports are simply unusable, then it's better
nothing even gets the opportunity to try to use them. This way we at
least get a quick failure, instead of some sort of timeout (or possibly
even corruption of the data stream of the non-kernel based use case).
Test:
vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
vm:~# echo 3967 > /proc/sys/net/ipv4/ip_local_unbindable_ports
vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
3967
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
socket.error: (1, 'Operation not permitted')
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
socket.error: (1, 'Operation not permitted')
Cc: Sean Tranchetti <stranche@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linux SCTP <linux-sctp@vger.kernel.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Bug: 140404597
Change-Id: Ie96207bea90ae1345adf7b45724d0caf4d6e52c2
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
(cherry picked from commit 8a4b8ea595a515e6360dfccdb99cec2a1fed08c0)
commit 23f57406b82de51809d5812afd96f210f8b627f3 upstream.
ip_select_ident_segs() has been very conservative about using
the connected socket private generator only for packets with IP_DF
set, claiming it was needed for some VJ compression implementations.
As mentioned in this referenced document, this can be abused.
(Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)
Before switching to pure random IPID generation and possibly hurt
some workloads, lets use the private inet socket generator.
Not only this will remove one vulnerability, this will also
improve performance of TCP flows using pmtudisc==IP_PMTUDISC_DONT
Fixes: 73f156a6e8 ("inetpeer: get rid of ip_id_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reported-by: Ray Che <xijiache@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Consolidate IPv4 MTU code the same way it is done in IPv6 to have code
aligned in both address families
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 14972cbd34 ("net: lwtunnel: Handle fragmentation") moved
fragmentation logic away from lwtunnel by carry encap headroom and
use it in output MTU calculation. But the forwarding part was not
covered and created difference in MTU for output and forwarding and
further to silent drops on ipv4 forwarding path. Fix it by taking
into account lwtunnel encap headroom.
The same commit also introduced difference in how to treat RTAX_MTU
in IPv4 and IPv6 where latter explicitly removes lwtunnel encap
headroom from route MTU. Make IPv4 version do the same.
Fixes: 14972cbd34 ("net: lwtunnel: Handle fragmentation")
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_sdif() does not modify the skb.
This will permit propagating the const qualifier in
udp{4|6}_lib_lookup_skb() functions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
Documentation/networking/ip-sysctl.txt:46 says:
ip_forward_use_pmtu - BOOLEAN
By default we don't trust protocol path MTUs while forwarding
because they could be easily forged and can lead to unwanted
fragmentation by the router.
You only need to enable this if you have user-space software
which tries to discover path mtus by itself and depends on the
kernel honoring this information. This is normally not the case.
Default: 0 (disabled)
Possible values:
0 - disabled
1 - enabled
Which makes it pretty clear that setting it to 1 is a potential
security/safety/DoS issue, and yet it is entirely reasonable to want
forwarded traffic to honour explicitly administrator configured
route mtus (instead of defaulting to device mtu).
Indeed, I can't think of a single reason why you wouldn't want to.
Since you configured a route mtu you probably know better...
It is pretty common to have a higher device mtu to allow receiving
large (jumbo) frames, while having some routes via that interface
(potentially including the default route to the internet) specify
a lower mtu.
Note that ipv6 forwarding uses device mtu unless the route is locked
(in which case it will use the route mtu).
This approach is not usable for IPv4 where an 'mtu lock' on a route
also has the side effect of disabling TCP path mtu discovery via
disabling the IPv4 DF (don't frag) bit on all outgoing frames.
I'm not aware of a way to lock a route from an IPv6 RA, so that also
potentially seems wrong.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Sunmeet Gill (Sunny) <sgill@quicinc.com>
Cc: Vinay Paradkar <vparadka@qti.qualcomm.com>
Cc: Tyler Wear <twear@quicinc.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154]
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle the few cases that need special treatment in-line using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitigate RETPOLINE costs in __tcp_transmit_skb()
by using INDIRECT_CALL_INET() wrapper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_PKTINFO sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_MTU_DISCOVER sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com> [rxrpc bits]
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_RECVERR sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_FREEBIND sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any argument outside of that range would result in an out of bound
memory access, since the accessed array is 65536 bits long.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that the sysctl write accessor functions guarantee that:
net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
invariant is maintained, and as such the max() in selinux hooks is actually spurious.
ie. even though
if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
per logic is the same as
if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
it is actually functionally equivalent to:
if (snum < low || snum > high) {
which is equivalent to:
if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
even though the first clause is spurious.
But we want to hold on to it in case we ever want to change what what
inet_port_requires_bind_service() means (for example by changing
it from a, by default, [0..1024) range to some sort of set).
Test: builds, git 'grep inet_prot_sock' finds no other references
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the iph field from the state structure, which is not
properly initialized. Instead, add a new field to make the "do we want
to set DF" be the state bit and move the code to set the DF flag from
ip_frag_next().
Joint work with Pablo and Linus.
Fixes: 19c3401a91 ("net: ipv4: place control buffer handling away from fragmentation iterators")
Reported-by: Patrick Schönthaler <patrick@notvads.ovh>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable setting skb->mark for UDP and RAW sockets using cmsg.
This is analogous to existing support for TOS, TTL, txtime, etc.
Packet sockets already support this as of commit c7d39e3263
("packet: support per-packet fwmark for af_packet sendmsg").
Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark
Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b ("ipv4: ipcm_cookie initializers").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we want to set a EDT time for the skb we want to send
via ip_send_unicast_reply(), we have to pass a new parameter
and initialize ipc.sockc.transmit_time with it.
This fixes the EDT time for ACK/RST packets sent on behalf of
a TIME_WAIT socket.
Fixes: a842fe1425 ("tcp: add optional per socket transmit delay")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:
* ip_frag_init(), that initializes the internal state of the transformer.
* ip_frag_next(), that allows you to fetch the next fragment. This function
internally allocates the skbuff that represents the fragment, it pushes
the IPv4 header, and it also copies the payload for each fragment.
The ip_frag_state object stores the internal state of the splitter.
This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the skbuff fraglist splitter. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:
* ip_fraglist_init(), that initializes the internal state of the
fraglist splitter.
* ip_fraglist_prepare(), that restores the IPv4 header on the
fragments.
* ip_fraglist_next(), that retrieves the fragment from the fraglist and
it updates the internal state of the splitter to point to the next
fragment skbuff in the fraglist.
The ip_fraglist_iter object stores the internal state of the iterator.
This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Minor comment merge conflict in mlx5.
Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.
v2->v3:
- Simplify by passing the original netdevice down the stack (per David
Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_trie implementation calls synchronize_rcu when a certain amount of
pages are dirty from freed entries. The number of pages was determined
experimentally in 2009 (commit c3059477fc).
At the current setting, synchronize_rcu is called often -- 51 times in a
second in one test with an average of an 8 msec delay adding a fib entry.
The total impact is a lot of slow down modifying the fib. This is seen
in the output of 'time' - the difference between real time and sys+user.
For example, using 720,022 single path routes and 'ip -batch'[1]:
$ time ./ip -batch ipv4/routes-1-hops
real 0m14.214s
user 0m2.513s
sys 0m6.783s
So roughly 35% of the actual time to install the routes is from the ip
command getting scheduled out, most notably due to synchronize_rcu (this
is observed using 'perf sched timehist').
This patch makes the amount of dirty memory configurable between 64k where
the synchronize_rcu is called often (small, low end systems that are memory
sensitive) to 64M where synchronize_rcu is called rarely during a large
FIB change (for high end systems with lots of memory). The default is 512kB
which corresponds to the current setting of 128 pages with a 4kB page size.
As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
in a second blocking for up to 30 msec in a single instance, and a total
of almost 100 msec across the 4 calls in the second. The trade off is
allowing FIB entries to consume more memory in a given time window but
but with much better fib insertion rates (~30% increase in prefixes/sec).
With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
file runs in:
$ time ./ip -batch ipv4/routes-1-hops
real 0m9.692s
user 0m2.491s
sys 0m6.769s
So the dead time is reduced to about 1/2 second or <5% of the real time.
[1] 'ip' modified to not request ACK messages which improves route
insertion times by about 20%
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.
Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.
v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can re-use it at the UDP level in a later patch
rfc v3 -> v1
- add the helper declaration into the ip header
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to ip_fib_metrics_init and add messages for invalid
metrics.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the refcounting and potential free of dst metrics associated
for ipv4 and ipv6 to a common helper.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set.
Move the common initialization code to a helper and use for both protocols.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the refcounting and potential free of dst metrics associated
with a fib entry to a helper and use it in both ipv4 and ipv6.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate initialization of ipv4 and ipv6 metrics when fib entries
are created into a single helper, ip_fib_metrics_init, that handles
the call to ip_metrics_convert.
If no metrics are defined for the fib entry, then the metrics is set
to dst_default_metrics.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.
The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.
There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.
Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.
After this change the IPv4 and IPv6 logic is the same.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.
There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces __ip_queue_xmit(), through which the callers
can pass tos param into it without having to set inet->tos. For
ipv6, ip6_xmit() already allows passing tclass parameter.
It's needed when some transport protocol doesn't use inet->tos,
like sctp's per transport dscp, which will be added in next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>