summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-09-28net: Fix incorrect address comparison when searching for a bind2 bucketMartin KaFai Lau
The v6_rcv_saddr and rcv_saddr are inside a union in the 'struct inet_bind2_bucket'. When searching a bucket by following the bhash2 hashtable chain, eg. inet_bind2_bucket_match, it is only using the sk->sk_family and there is no way to check if the inet_bind2_bucket has a v6 or v4 address in the union. This leads to an uninit-value KMSAN report in [0] and also potentially incorrect matches. This patch fixes it by adding a family member to the inet_bind2_bucket and then tests 'sk->sk_family != tb->family' before matching the sk's address to the tb's address. Cc: Joanne Koong <joannelkoong@gmail.com> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20220927002544.3381205-1-kafai@fb.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28mptcp: poll allow write call before actual connectBenjamin Hesmans
If fastopen is used, poll must allow a first write that will trigger the SYN+data Similar to what is done in tcp_poll(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28mptcp: handle defer connect in mptcp_sendmsgDmytro Shytyi
When TCP_FASTOPEN_CONNECT has been set on the socket before a connect, the defer flag is set and must be handled when sendmsg is called. This is similar to what is done in tcp_sendmsg_locked(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28tcp: export tcp_sendmsg_fastopenBenjamin Hesmans
It will be used to support TCP FastOpen with MPTCP in the following commit. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28mptcp: add TCP_FASTOPEN_CONNECT socket optionBenjamin Hesmans
Set the option for the first subflow only. For the other subflows TFO can't be used because a mapping would be needed to cover the data in the SYN. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net: shrink struct ubuf_infoPavel Begunkov
We can benefit from a smaller struct ubuf_info, so leave only mandatory fields and let users to decide how they want to extend it. Convert MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields. This reduces the size from 48 bytes to just 16. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28netfilter: nft_fib: Fix for rpath check with VRF devicesPhil Sutter
Analogous to commit b575b24b8eee3 ("netfilter: Fix rpfilter dropping vrf packets by mistake") but for nftables fib expression: Add special treatment of VRF devices so that typical reverse path filtering via 'fib saddr . iif oif' expression works as expected. Fixes: f6d0cbcf09c50 ("netfilter: nf_tables: add fib expression") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-28net: sched: act_bpf: simplify code logic in tcf_bpf_init()Zhengchao Shao
Both is_bpf and is_ebpf are boolean types, so (!is_bpf && !is_ebpf) || (is_bpf && is_ebpf) can be reduced to is_bpf == is_ebpf in tcf_bpf_init(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-27Add skb drop reasons to IPv6 UDP receive pathDonald Hunter
Enumerate the skb drop reasons in the receive path for IPv6 UDP packets. Signed-off-by: Donald Hunter <donald.hunter@redhat.com> Link: https://lore.kernel.org/r/20220926120350.14928-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-27net: tls: Add ARIA-GCM algorithmTaehee Yoo
RFC 6209 describes ARIA for TLS 1.2. ARIA-128-GCM and ARIA-256-GCM are defined in RFC 6209. This patch would offer performance increment and an opportunity for hardware offload. Benchmark results: iperf-ssl are used. CPU: intel i3-12100. TLS(openssl-3.0-dev) [ 3] 0.0- 1.0 sec 185 MBytes 1.55 Gbits/sec [ 3] 1.0- 2.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 2.0- 3.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 3.0- 4.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 4.0- 5.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 0.0- 5.0 sec 927 MBytes 1.56 Gbits/sec kTLS(aria-generic) [ 3] 0.0- 1.0 sec 198 MBytes 1.66 Gbits/sec [ 3] 1.0- 2.0 sec 194 MBytes 1.62 Gbits/sec [ 3] 2.0- 3.0 sec 194 MBytes 1.63 Gbits/sec [ 3] 3.0- 4.0 sec 194 MBytes 1.63 Gbits/sec [ 3] 4.0- 5.0 sec 194 MBytes 1.62 Gbits/sec [ 3] 0.0- 5.0 sec 974 MBytes 1.63 Gbits/sec kTLS(aria-avx wirh GFNI) [ 3] 0.0- 1.0 sec 632 MBytes 5.30 Gbits/sec [ 3] 1.0- 2.0 sec 657 MBytes 5.51 Gbits/sec [ 3] 2.0- 3.0 sec 657 MBytes 5.51 Gbits/sec [ 3] 3.0- 4.0 sec 656 MBytes 5.50 Gbits/sec [ 3] 4.0- 5.0 sec 656 MBytes 5.50 Gbits/sec [ 3] 0.0- 5.0 sec 3.18 GBytes 5.47 Gbits/sec Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/20220925150033.24615-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-27NFC: hci: Split memcpy() of struct hcp_message flexible arrayKees Cook
To work around a misbehavior of the compiler's ability to see into composite flexible array structs (as detailed in the coming memcpy() hardening series[1]), split the memcpy() of the header and the payload so no false positive run-time overflow warning will be generated. This split already existed for the "firstfrag" case, so just generalize the logic further. [1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org/ Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220924040835.3364912-1-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-27net: openvswitch: allow conntrack in non-initial user namespaceMichael Weiß
Similar to the previous commit, the Netlink interface of the OVS conntrack module was restricted to global CAP_NET_ADMIN by using GENL_ADMIN_PERM. This is changed to GENL_UNS_ADMIN_PERM to support unprivileged containers in non-initial user namespace. Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-27net: openvswitch: allow metering in non-initial user namespaceMichael Weiß
The Netlink interface for metering was restricted to global CAP_NET_ADMIN by using GENL_ADMIN_PERM. To allow metring in a non-inital user namespace, e.g., a container, this is changed to GENL_UNS_ADMIN_PERM. Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-27net/smc: Support SO_REUSEPORTTony Lu
This enables SO_REUSEPORT [1] for clcsock when it is set on smc socket, so that some applications which uses it can be transparently replaced with SMC. Also, this helps improve load distribution. Here is a simple test of NGINX + wrk with SMC. The CPU usage is collected on NGINX (server) side as below. Disable SO_REUSEPORT: 05:15:33 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 05:15:34 PM all 7.02 0.00 11.86 0.00 2.04 8.93 0.00 0.00 0.00 70.15 05:15:34 PM 0 0.00 0.00 0.00 0.00 16.00 70.00 0.00 0.00 0.00 14.00 05:15:34 PM 1 11.58 0.00 22.11 0.00 0.00 0.00 0.00 0.00 0.00 66.32 05:15:34 PM 2 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00 05:15:34 PM 3 16.84 0.00 30.53 0.00 0.00 0.00 0.00 0.00 0.00 52.63 05:15:34 PM 4 28.72 0.00 44.68 0.00 0.00 0.00 0.00 0.00 0.00 26.60 05:15:34 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:15:34 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:15:34 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Enable SO_REUSEPORT: 05:15:20 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 05:15:21 PM all 8.56 0.00 14.40 0.00 2.20 9.86 0.00 0.00 0.00 64.98 05:15:21 PM 0 0.00 0.00 4.08 0.00 14.29 76.53 0.00 0.00 0.00 5.10 05:15:21 PM 1 9.09 0.00 16.16 0.00 1.01 0.00 0.00 0.00 0.00 73.74 05:15:21 PM 2 9.38 0.00 16.67 0.00 1.04 0.00 0.00 0.00 0.00 72.92 05:15:21 PM 3 10.42 0.00 17.71 0.00 1.04 0.00 0.00 0.00 0.00 70.83 05:15:21 PM 4 9.57 0.00 15.96 0.00 0.00 0.00 0.00 0.00 0.00 74.47 05:15:21 PM 5 9.18 0.00 15.31 0.00 0.00 1.02 0.00 0.00 0.00 74.49 05:15:21 PM 6 8.60 0.00 15.05 0.00 0.00 0.00 0.00 0.00 0.00 76.34 05:15:21 PM 7 12.37 0.00 14.43 0.00 0.00 0.00 0.00 0.00 0.00 73.20 Using SO_REUSEPORT helps the load distribution of NGINX be more balanced. [1] https://man7.org/linux/man-pages/man7/socket.7.html Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://lore.kernel.org/r/20220922121906.72406-1-tonylu@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-26net/sched: taprio: simplify list iteration in taprio_dev_notifier()Vladimir Oltean
taprio_dev_notifier() subscribes to netdev state changes in order to determine whether interfaces which have a taprio root qdisc have changed their link speed, so the internal calculations can be adapted properly. The 'qdev' temporary variable serves no purpose, because we just use it only once, and can just as well use qdisc_dev(q->root) directly (or the "dev" that comes from the netdev notifier; this is because qdev is only interesting if it was the subject of the state change, _and_ its root qdisc belongs in the taprio list). The 'found' variable also doesn't really serve too much of a purpose either; we can just call taprio_set_picos_per_byte() within the loop, and exit immediately afterwards. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20220923145921.3038904-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26net: dsa: make user ports return to init_net on netns deletionVladimir Oltean
As pointed out during review, currently the following set of commands crashes the kernel: $ ip netns add ns0 $ ip link set swp0 netns ns0 $ ip netns del ns0 WARNING: CPU: 1 PID: 27 at net/core/dev.c:10884 unregister_netdevice_many+0xaa4/0xaec Workqueue: netns cleanup_net pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : unregister_netdevice_many+0xaa4/0xaec lr : unregister_netdevice_many+0x700/0xaec Call trace: unregister_netdevice_many+0xaa4/0xaec default_device_exit_batch+0x294/0x340 ops_exit_list+0xac/0xc4 cleanup_net+0x2e4/0x544 process_one_work+0x4ec/0xb40 ---[ end trace 0000000000000000 ]--- unregister_netdevice: waiting for swp0 to become free. Usage count = 2 This is because since DSA user ports, since they started populating dev->rtnl_link_ops in the blamed commit, gained a different treatment from default_device_exit_net(), which thinks these interfaces can now be unregistered. They can't; so set netns_refund = true to restore the behavior prior to populating dev->rtnl_link_ops. Fixes: 95f510d0b792 ("net: dsa: allow the DSA master to be seen and changed through rtnetlink") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220921185428.1767001-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26xdp: improve page_pool xdp_return performanceJesper Dangaard Brouer
During LPC2022 I meetup with my page_pool co-maintainer Ilias. When discussing page_pool code we realised/remembered certain optimizations had not been fully utilised. Since commit c07aea3ef4d4 ("mm: add a signature in struct page") struct page have a direct pointer to the page_pool object this page was allocated from. Thus, with this info it is possible to skip the rhashtable_lookup to find the page_pool object in __xdp_return(). The rcu_read_lock can be removed as it was tied to xdp_mem_allocator. The page_pool object is still safe to access as it tracks inflight pages and (potentially) schedules final release from a work queue. Created a micro benchmark of XDP redirecting from mlx5 into veth with XDP_DROP bpf-prog on the peer veth device. This increased performance 6.5% from approx 8.45Mpps to 9Mpps corresponding to using 7 nanosec (27 cycles at 3.8GHz) less per packet. Suggested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/166377993287.1737053.10258297257583703949.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26af_unix: Refactor unix_read_skb()Peilin Ye
Similar to udp_read_skb(), delete the unnecessary while loop in unix_read_skb() for readability. Since recv_actor() cannot return a value greater than skb->len (see sk_psock_verdict_recv()), remove the redundant check. Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/7009141683ad6cd3785daced3e4a80ba0eb773b5.1663909008.git.peilin.ye@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26udp: Refactor udp_read_skb()Peilin Ye
Delete the unnecessary while loop in udp_read_skb() for readability. Additionally, since recv_actor() cannot return a value greater than skb->len (see sk_psock_verdict_recv()), remove the redundant check. Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/343b5d8090a3eb764068e9f1d392939e2b423747.1663909008.git.peilin.ye@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-23ipv6: tcp: send consistent autoflowlabel in RST packetsEric Dumazet
Blamed commit added a txhash parameter to tcp_v6_send_response() but forgot to update tcp_v6_send_reset() accordingly. Fixes: aa51b80e1af4 ("ipv6: tcp: send consistent autoflowlabel in SYN_RECV state") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220922165036.1795862-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-23Merge tag 'linux-can-next-for-6.1-20220923' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2022-09-23 The first 2 patches are by Ziyang Xuan and optimize registration and the sending in the CAN BCM protocol a bit. The next 8 patches target the gs_usb driver. 7 are by me and first fix the time hardware stamping support (added during this net-next cycle), rename a variable, convert the usb_control_msg + manual kmalloc()/kfree() to usb_control_msg_{send,rev}(), clean up the error handling and add switchable termination support. The patch by Rhett Aultman and Vasanth Sadhasivan convert the driver from usb_alloc_coherent()/usb_free_coherent() to kmalloc()/URB_FREE_BUFFER. The last patch is by Shang XiaoJing and removes an unneeded call to dev_err() from the ctucanfd driver. * tag 'linux-can-next-for-6.1-20220923' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: ctucanfd: Remove redundant dev_err call can: gs_usb: remove dma allocations can: gs_usb: add switchable termination support can: gs_usb: gs_make_candev(): clean up error handling can: gs_usb: convert from usb_control_msg() to usb_control_msg_{send,recv}() can: gs_usb: gs_cmd_reset(): rename variable holding struct gs_can pointer to dev can: gs_usb: gs_can_open(): initialize time counter before starting device can: gs_usb: add missing lock to protect struct timecounter::cycle_last can: gs_usb: gs_usb_get_timestamp(): fix endpoint parameter for usb_control_msg_recv() can: bcm: check the result of can_send() in bcm_can_tx() can: bcm: registration process optimization in bcm_module_init() ==================== Link: https://lore.kernel.org/r/20220923120859.740577-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-23can: bcm: check the result of can_send() in bcm_can_tx()Ziyang Xuan
If can_send() fail, it should not update frames_abs counter in bcm_can_tx(). Add the result check for can_send() in bcm_can_tx(). Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de> Suggested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/all/9851878e74d6d37aee2f1ee76d68361a46f89458.1663206163.git.william.xuanziyang@huawei.com Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-09-23can: bcm: registration process optimization in bcm_module_init()Ziyang Xuan
Now, register_netdevice_notifier() and register_pernet_subsys() are both after can_proto_register(). It can create CAN_BCM socket and process socket once can_proto_register() successfully, so it is possible missing notifier event or proc node creation because notifier or bcm proc directory is not registered or created yet. Although this is a low probability scenario, it is not impossible. Move register_pernet_subsys() and register_netdevice_notifier() to the front of can_proto_register(). In addition, register_pernet_subsys() and register_netdevice_notifier() may fail, check their results are necessary. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/all/823cff0ebec33fa9389eeaf8b8ded3217c32cb38.1663206163.git.william.xuanziyang@huawei.com Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-09-23net: phy: Add support for rate matchingSean Anderson
This adds support for rate matching (also known as rate adaptation) to the phy subsystem. The general idea is that the phy interface runs at one speed, and the MAC throttles the rate at which it sends packets to the link speed. There's a good overview of several techniques for achieving this at [1]. This patch adds support for three: pause-frame based (such as in Aquantia phys), CRS-based (such as in 10PASS-TS and 2BASE-TL), and open-loop-based (such as in 10GBASE-W). This patch makes a few assumptions and a few non assumptions about the types of rate matching available. First, it assumes that different phys may use different forms of rate matching. Second, it assumes that phys can use rate matching for any of their supported link speeds (e.g. if a phy supports 10BASE-T and XGMII, then it can adapt XGMII to 10BASE-T). Third, it does not assume that all interface modes will use the same form of rate matching. Fourth, it does not assume that all phy devices will support rate matching (even if some do). Relaxing or strengthening these (non-)assumptions could result in a different API. For example, if all interface modes were assumed to use the same form of rate matching, then a bitmask of interface modes supportting rate matching would suffice. For some better visibility into the process, the current rate matching mode is exposed as part of the ethtool ksettings. For the moment, only read access is supported. I'm not sure what userspace might want to configure yet (disable it altogether, disable just one mode, specify the mode to use, etc.). For the moment, since only pause-based rate adaptation support is added in the next few commits, rate matching can be disabled altogether by adjusting the advertisement. 802.3 calls this feature "rate adaptation" in clause 49 (10GBASE-R) and "rate matching" in clause 61 (10PASS-TL and 2BASE-TS). Aquantia also calls this feature "rate adaptation". I chose "rate matching" because it is shorter, and because Russell doesn't think "adaptation" is correct in this context. Signed-off-by: Sean Anderson <sean.anderson@seco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-22ethtool: tunnels: check the return value of nla_nest_start()Li Zhong
Check the return value of nla_nest_start(). When starting the entry level nested attributes, if the tailroom of socket buffer is insufficient to store the attribute header and payload, the return value will be NULL. There is, however, no real bug here since if the skb is full nla_put_be16() will fail as well and we'll error out. Signed-off-by: Li Zhong <floridsleeves@gmail.com> Link: https://lore.kernel.org/r/20220921181716.1629541-1-floridsleeves@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/sched: use tc_qdisc_stats_dump() in qdiscZhengchao Shao
use tc_qdisc_stats_dump() in qdisc. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/sched: taprio: remove unnecessary taprio_list_lockVladimir Oltean
The 3 functions that want access to the taprio_list: taprio_dev_notifier(), taprio_destroy() and taprio_init() are all called with the rtnl_mutex held, therefore implicitly serialized with respect to each other. A spin lock serves no purpose. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20220921095632.1379251-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/tls: Support 256 bit keys with TX device offloadGal Pressman
Add the missing clause for 256 bit keys in tls_set_device_offload(), and the needed adjustments in tls_device_fallback.c. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/tls: Use cipher sizes structsGal Pressman
Use the newly introduced cipher sizes structs instead of the repeated switch cases churn. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/tls: Describe ciphers sizes by const structsTariq Toukan
Introduce cipher sizes descriptor. It helps reducing the amount of code duplications and repeated switch/cases that assigns the proper sizes according to the cipher type. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/freescale/fec.h 7b15515fc1ca ("Revert "fec: Restart PPS after link state change"") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c c297561bc98a ("pinctrl: ocelot: Fix interrupt controller") 181f604b33cd ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile bbb774d921e2 ("net: Add tests for bonding and team address list management") 152e8ec77640 ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c 5440428b3da6 ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") 45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22Merge tag 'net-6.0-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wifi, netfilter and can. A handful of awaited fixes here - revert of the FEC changes, bluetooth fix, fixes for iwlwifi spew. We added a warning in PHY/MDIO code which is triggering on a couple of platforms in a false-positive-ish way. If we can't iron that out over the week we'll drop it and re-add for 6.1. I've added a new "follow up fixes" section for fixes to fixes in 6.0-rcs but it may actually give the false impression that those are problematic or that more testing time would have caught them. So likely a one time thing. Follow up fixes: - nf_tables_addchain: fix nft_counters_enabled underflow - ebtables: fix memory leak when blob is malformed - nf_ct_ftp: fix deadlock when nat rewrite is needed Current release - regressions: - Revert "fec: Restart PPS after link state change" and the related "net: fec: Use a spinlock to guard `fep->ptp_clk_on`" - Bluetooth: fix HCIGETDEVINFO regression - wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2 - mptcp: fix fwd memory accounting on coalesce - rwlock removal fall out: - ipmr: always call ip{,6}_mr_forward() from RCU read-side critical section - ipv6: fix crash when IPv6 is administratively disabled - tcp: read multiple skbs in tcp_read_skb() - mdio_bus_phy_resume state warning fallout: - eth: ravb: fix PHY state warning splat during system resume - eth: sh_eth: fix PHY state warning splat during system resume Current release - new code bugs: - wifi: iwlwifi: don't spam logs with NSS>2 messages - eth: mtk_eth_soc: enable XDP support just for MT7986 SoC Previous releases - regressions: - bonding: fix NULL deref in bond_rr_gen_slave_id - wifi: iwlwifi: mark IWLMEI as broken Previous releases - always broken: - nf_conntrack helpers: - irc: tighten matching on DCC message - sip: fix ct_sip_walk_headers - osf: fix possible bogus match in nf_osf_find() - ipvlan: fix out-of-bound bugs caused by unset skb->mac_header - core: fix flow symmetric hash - bonding, team: unsync device addresses on ndo_stop - phy: micrel: fix shared interrupt on LAN8814" * tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) selftests: forwarding: add shebang for sch_red.sh bnxt: prevent skb UAF after handing over to PTP worker net: marvell: Fix refcounting bugs in prestera_port_sfp_bind() net: sched: fix possible refcount leak in tc_new_tfilter() net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD udp: Use WARN_ON_ONCE() in udp_read_skb() selftests: bonding: cause oops in bond_rr_gen_slave_id bonding: fix NULL deref in bond_rr_gen_slave_id net: phy: micrel: fix shared interrupt on LAN8814 net/smc: Stop the CLC flow if no link to map buffers on ice: Fix ice_xdp_xmit() when XDP TX queue number is not sufficient net: atlantic: fix potential memory leak in aq_ndev_close() can: gs_usb: gs_usb_set_phys_id(): return with error if identify is not supported can: gs_usb: gs_can_open(): fix race dev->can.state condition can: flexcan: flexcan_mailbox_read() fix return value for drop = true net: sh_eth: Fix PHY state warning splat during system resume net: ravb: Fix PHY state warning splat during system resume netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed netfilter: ebtables: fix memory leak when blob is malformed netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain() ...
2022-09-22net: sched: fix possible refcount leak in tc_new_tfilter()Hangyu Hua
tfilter_put need to be called to put the refount got by tp->ops->get to avoid possible refcount leak when chain->tmplt_ops != NULL and chain->tmplt_ops != tp->ops. Fixes: 7d5509fa0d3d ("net: sched: extend proto ops with 'put' callback") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20220921092734.31700-1-hbh25y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22udp: Use WARN_ON_ONCE() in udp_read_skb()Peilin Ye
Prevent udp_read_skb() from flooding the syslog. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/20220921005915.2697-1-yepeilin.cs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/smc: Unbind r/w buffer size from clcsock and make them tunableTony Lu
Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem in clcsock. The buffer size from TCP socket doesn't fit SMC well. Generally, buffers are usually larger than TCP for SMC-R/-D to get higher performance, for they are different underlay devices and paths. So this patch unbinds buffer size from TCP, and introduces two sysctl knobs to tune them independently. Also, these knobs are per net namespace and work for containers. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-22net/smc: Introduce a specific sysctl for TEST_LINK timeWen Gu
SMC-R tests the viability of link by sending out TEST_LINK LLC messages over RoCE fabric when connections on link have been idle for a time longer than keepalive interval (testlink time). But using tcp_keepalive_time as testlink time maybe not quite suitable because it is default no less than two hours[1], which is too long for single link to find peer dead. The active host will still use peer-dead link (QP) sending messages, and can't find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally. So this patch introduces a independent sysctl for SMC-R to set link keepalive time, in order to detect link down in time. The default value is 30 seconds. [1] https://www.rfc-editor.org/rfc/rfc1122#page-101 Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-22net/smc: Stop the CLC flow if no link to map buffers onWen Gu
There might be a potential race between SMC-R buffer map and link group termination. smc_smcr_terminate_all() | smc_connect_rdma() -------------------------------------------------------------- | smc_conn_create() for links in smcibdev | schedule links down | | smc_buf_create() | \- smcr_buf_map_usable_links() | \- no usable links found, | (rmb->mr = NULL) | | smc_clc_send_confirm() | \- access conn->rmb_desc->mr[]->rkey | (panic) During reboot and IB device module remove, all links will be set down and no usable links remain in link groups. In such situation smcr_buf_map_usable_links() should return an error and stop the CLC flow accessing to uninitialized mr. Fixes: b9247544c1bc ("net/smc: convert static link ID instances to support multiple links") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-21Merge branch 'master' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter patches for net-next Remove GPL license copypastry in uapi files, those have SPDX tags. From Christophe Jaillet. Remove unused variable in rpfilter, from Guillaume Nault. Rework gc resched delay computation in conntrack, from Antoine Tenart. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: rpfilter: Remove unused variable 'ret'. headers: Remove some left-over license text in include/uapi/linux/netfilter/ netfilter: conntrack: revisit the gc initial rescheduling bias netfilter: conntrack: fix the gc rescheduling delay ==================== Link: https://lore.kernel.org/r/20220921095000.29569-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21flow_dissector: Do not count vlan tags inside tunnel payloadQingqing Yang
We've met the problem that when there is a vlan tag inside GRE encapsulation, the match of num_of_vlans fails. It is caused by the vlan tag inside GRE payload has been counted into num_of_vlans, which is not expected. One example packet is like this: Ethernet II, Src: Broadcom_68:56:07 (00:10:18:68:56:07) Dst: Broadcom_68:56:08 (00:10:18:68:56:08) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 100 Internet Protocol Version 4, Src: 192.168.1.4, Dst: 192.168.1.200 Generic Routing Encapsulation (Transparent Ethernet bridging) Ethernet II, Src: Broadcom_68:58:07 (00:10:18:68:58:07) Dst: Broadcom_68:58:08 (00:10:18:68:58:08) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200 ... It should match the (num_of_vlans 1) rule, but it matches the (num_of_vlans 2) rule. The vlan tags inside the GRE or other tunnel encapsulated payload should not be taken into num_of_vlans. The fix is to stop counting the vlan number when the encapsulation bit is set. Fixes: 34951fcf26c5 ("flow_dissector: Add number of vlan tags dissector") Signed-off-by: Qingqing Yang <qingqing.yang@broadcom.com> Reviewed-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Link: https://lore.kernel.org/r/20220919074808.136640-1-qingqing.yang@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21net: sched: remove unused tcf_result extensionJamal Hadi Salim
Added by: commit e5cf1baf92cb ("act_mirred: use TC_ACT_REINSERT when possible") but no longer useful. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220919130627.3551233-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21net: sched: simplify code in mall_reoffloadWilliam Dean
such expression: if (err) return err; return 0; can simplify to: return err; Signed-off-by: William Dean <williamsukatube@163.com> Link: https://lore.kernel.org/r/20220917063556.2673-1-williamsukatube@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21net/af_packet: registration process optimization in packet_init()Ziyang Xuan
Now, register_pernet_subsys() and register_netdevice_notifier() are both after sock_register(). It can create PF_PACKET socket and process socket once sock_register() successfully. It is possible PF_PACKET socket is creating but register_pernet_subsys() and register_netdevice_notifier() are not registered yet. Thus net->packet.sklist_lock and net->packet.sklist will be accessed without initialization that is done in packet_net_init(). Although this is a low probability scenario. Move register_pernet_subsys() and register_netdevice_notifier() to the front in packet_init(). Correspondingly, adjust the unregister process in packet_exit(). Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-21net: sched: act_ct: remove redundant variable errJinpeng Cui
Return value directly from pskb_trim_rcsum() instead of getting value from redundant variable err. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Jinpeng Cui <cui.jinpeng2@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-21netfilter: rpfilter: Remove unused variable 'ret'.Guillaume Nault
Commit 91a178258aea ("netfilter: rpfilter: Convert rpfilter_lookup_reverse to new dev helper") removed the need for the 'ret' variable. This went unnoticed because of the __maybe_unused annotation. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-21netfilter: conntrack: revisit the gc initial rescheduling biasAntoine Tenart
The previous commit changed the way the rescheduling delay is computed which has a side effect: the bias is now represented as much as the other entries in the rescheduling delay which makes the logic to kick in only with very large sets, as the initial interval is very large (INT_MAX). Revisit the GC initial bias to allow more frequent GC for smaller sets while still avoiding wakeups when a machine is mostly idle. We're moving from a large initial value to pretending we have 100 entries expiring at the upper bound. This way only a few entries having a small timeout won't impact much the rescheduling delay and non-idle machines will have enough entries to lower the delay when needed. This also improves readability as the initial bias is now linked to what is computed instead of being an arbitrary large value. Fixes: 2cfadb761d3d ("netfilter: conntrack: revisit gc autotuning") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-21netfilter: conntrack: fix the gc rescheduling delayAntoine Tenart
Commit 2cfadb761d3d ("netfilter: conntrack: revisit gc autotuning") changed the eviction rescheduling to the use average expiry of scanned entries (within 1-60s) by doing: for (...) { expires = clamp(nf_ct_expires(tmp), ...); next_run += expires; next_run /= 2; } The issue is the above will make the average ('next_run' here) more dependent on the last expiration values than the firsts (for sets > 2). Depending on the expiration values used to compute the average, the result can be quite different than what's expected. To fix this we can do the following: for (...) { expires = clamp(nf_ct_expires(tmp), ...); next_run += (expires - next_run) / ++count; } Fixes: 2cfadb761d3d ("netfilter: conntrack: revisit gc autotuning") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-20net/sched: use tc_cls_stats_dump() in filterZhengchao Shao
use tc_cls_stats_dump() in filter. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20netfilter: nf_ct_ftp: fix deadlock when nat rewrite is neededFlorian Westphal
We can't use ct->lock, this is already used by the seqadj internals. When using ftp helper + nat, seqadj will attempt to acquire ct->lock again. Revert back to a global lock for now. Fixes: c783a29c7e59 ("netfilter: nf_ct_ftp: prefer skb_linearize") Reported-by: Bruno de Paula Larini <bruno.larini@riosoft.com.br> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-20netfilter: ebtables: fix memory leak when blob is malformedFlorian Westphal
The bug fix was incomplete, it "replaced" crash with a memory leak. The old code had an assignment to "ret" embedded into the conditional, restore this. Fixes: 7997eff82828 ("netfilter: ebtables: reject blobs that don't provide all entry points") Reported-and-tested-by: syzbot+a24c5252f3e3ab733464@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-20netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain()Tetsuo Handa
It seems to me that percpu memory for chain stats started leaking since commit 3bc158f8d0330f0a ("netfilter: nf_tables: map basechain priority to hardware priority") when nft_chain_offload_priority() returned an error. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 3bc158f8d0330f0a ("netfilter: nf_tables: map basechain priority to hardware priority") Signed-off-by: Florian Westphal <fw@strlen.de>