summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-11-01net/smc: Introduce tracepoint for fallbackTony Lu
This introduces tracepoint for smc fallback to TCP, so that we can track which connection and why it fallbacks, and map the clcsocks' pointer with /proc/net/tcp to find more details about TCP connections. Compared with kprobe or other dynamic tracing, tracepoints are stable and easy to use. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ethtool: don't drop the rtnl_lock half way thru the ioctlJakub Kicinski
devlink compat code needs to drop rtnl_lock to take devlink->lock to ensure correct lock ordering. This is problematic because we're not strictly guaranteed that the netdev will not disappear after we re-lock. It may open a possibility of nested ->begin / ->complete calls. Instead of calling into devlink under rtnl_lock take a ref on the devlink instance and make the call after we've dropped rtnl_lock. We (continue to) assume that netdevs have an implicit reference on the devlink returned from ndo_get_devlink_port Note that ndo_get_devlink_port will now get called under rtnl_lock. That should be fine since none of the drivers seem to be taking serious locks inside ndo_get_devlink_port. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01devlink: expose get/put functionsJakub Kicinski
Allow those who hold implicit reference on a devlink instance to try to take a full ref on it. This will be used from netdev code which has an implicit ref because of driver call ordering. Note that after recent changes devlink_unregister() may happen before netdev unregister, but devlink_free() should still happen after, so we are safe to try, but we can't just refcount_inc() and assume it's not zero. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ethtool: handle info/flash data copying outside rtnl_lockJakub Kicinski
We need to increase the lifetime of the data for .get_info and .flash_update beyond their handlers inside rtnl_lock. Allocate a union on the heap and use it instead. Note that we now copy the ethcmd before we lookup dev, hopefully there is no crazy user space depending on error codes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ethtool: push the rtnl_lock into dev_ethtool()Jakub Kicinski
Don't take the lock in net/core/dev_ioctl.c, we'll have things to do outside rtnl_lock soon. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net: dsa: populate supported_interfaces memberMarek Behún
Add a new DSA switch operation, phylink_get_interfaces, which should fill in which PHY_INTERFACE_MODE_* are supported by given port. Use this before phylink_create() to fill phylinks supported_interfaces member, allowing phylink to determine which PHY_INTERFACE_MODEs are supported. Signed-off-by: Marek Behún <kabel@kernel.org> [tweaked patch and description to add more complete support -- rmk] Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2021-10-30 Just two minor changes this time: 1) Remove some superfluous header files from xfrm4_tunnel.c From Mianhan Liu. 2) Simplify some error checks in xfrm_input(). From luo penghao. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use array_size() in ebtables, from Gustavo A. R. Silva. 2) Attach IPS_ASSURED to internal UDP stream state, reported by Maciej Zenczykowski. 3) Add NFT_META_IFTYPE to match on the interface type either from ingress or egress. 4) Generalize pktinfo->tprot_set to flags field. 5) Allow to match on inner headers / payload data. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netfilter: nft_payload: support for inner header matching / manglingPablo Neira Ayuso
Allow to match and mangle on inner headers / payload data after the transport header. There is a new field in the pktinfo structure that stores the inner header offset which is calculated only when requested. Only TCP and UDP supported at this stage. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: nf_tables: convert pktinfo->tprot_set to flags fieldPablo Neira Ayuso
Generalize boolean field to store more flags on the pktinfo structure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: nft_meta: add NFT_META_IFTYPEPablo Neira Ayuso
Generalize NFT_META_IIFTYPE to NFT_META_IFTYPE which allows you to match on the interface type of the skb->dev field. This field is used by the netdev family to add an implicit dependency to skip non-ethernet packets when matching on layer 3 and 4 TCP/IP header fields. For backward compatibility, add the NFT_META_IIFTYPE alias to NFT_META_IFTYPE. Add __NFT_META_IIFTYPE, to be used by userspace in the future to match specifically on the iiftype. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: conntrack: set on IPS_ASSURED if flows enters internal stream statePablo Neira Ayuso
The internal stream state sets the timeout to 120 seconds 2 seconds after the creation of the flow, attach this internal stream state to the IPS_ASSURED flag for consistent event reporting. Before this patch: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Note IPS_ASSURED for the flow not yet in the internal stream state. after this update: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 120 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Before this patch, short-lived UDP flows never entered IPS_ASSURED, so they were already candidate flow to be deleted by early_drop under stress. Before this patch, IPS_ASSURED is set on regardless the internal stream state, attach this internal stream state to IPS_ASSURED. packet #1 (original direction) enters NEW state packet #2 (reply direction) enters ESTABLISHED state, sets on IPS_SEEN_REPLY paclet #3 (any direction) sets on IPS_ASSURED (if 2 seconds since the creation has passed by). Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-29net: bridge: switchdev: fix shim definition for br_switchdev_mdb_notifyVladimir Oltean
br_switchdev_mdb_notify() is conditionally compiled only when CONFIG_NET_SWITCHDEV=y and CONFIG_BRIDGE_IGMP_SNOOPING=y. It is called from br_mdb.c, which is conditionally compiled only when CONFIG_BRIDGE_IGMP_SNOOPING=y. The shim definition of br_switchdev_mdb_notify() is therefore needed for the case where CONFIG_NET_SWITCHDEV=n, however we mistakenly put it there for the case where CONFIG_BRIDGE_IGMP_SNOOPING=n. This results in build failures when CONFIG_BRIDGE_IGMP_SNOOPING=y and CONFIG_NET_SWITCHDEV=n. To fix this, put the shim definition right next to br_switchdev_fdb_notify(), which is properly guarded by NET_SWITCHDEV=n. Since this is called only from br_mdb.c, we need not take any extra safety precautions, when NET_SWITCHDEV=n and BRIDGE_IGMP_SNOOPING=n, this shim definition will be absent but nobody will be needing it. Fixes: 9776457c784f ("net: bridge: mdb: move all switchdev logic to br_switchdev.c") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211029223606.3450523-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-29devlink: make all symbols GPL-onlyJakub Kicinski
devlink_alloc() and devlink_register() are both GPL. A non-GPL module won't get far, so for consistency we can make all symbols GPL without risking any real life breakage. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29mctp: Pass flow data & flow release events to driversJeremy Kerr
Now that we have an extension for MCTP data in skbs, populate the flow when a key has been created for the packet, and add a device driver operation to inform of flow destruction. Includes a fix for a warning with test builds: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29mctp: Add flow extension to skbJeremy Kerr
This change adds a new skb extension for MCTP, to represent a request/response flow. The intention is to use this in a later change to allow i2c controllers to correctly configure a multiplexer over a flow. Since we have a cleanup function in the core path (if an extension is present), we'll need to make CONFIG_MCTP a bool, rather than a tristate. Includes a fix for a build warning with clang: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29mctp: Return new key from mctp_alloc_local_tagJeremy Kerr
In a future change, we will want the key available for future use after allocating a new tag. This change returns the key from mctp_alloc_local_tag, rather than just key->tag. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28net: bridge: switchdev: consistent function namingVladimir Oltean
Rename all recently imported functions in br_switchdev.c to start with a br_switchdev_* prefix. br_fdb_replay_one() -> br_switchdev_fdb_replay_one() br_fdb_replay() -> br_switchdev_fdb_replay() br_vlan_replay_one() -> br_switchdev_vlan_replay_one() br_vlan_replay() -> br_switchdev_vlan_replay() struct br_mdb_complete_info -> struct br_switchdev_mdb_complete_info br_mdb_complete() -> br_switchdev_mdb_complete() br_mdb_switchdev_host_port() -> br_switchdev_host_mdb_one() br_mdb_switchdev_host() -> br_switchdev_host_mdb() br_mdb_replay_one() -> br_switchdev_mdb_replay_one() br_mdb_replay() -> br_switchdev_mdb_replay() br_mdb_queue_one() -> br_switchdev_mdb_queue_one() Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: mdb: move all switchdev logic to br_switchdev.cVladimir Oltean
The following functions: br_mdb_complete br_switchdev_mdb_populate br_mdb_replay_one br_mdb_queue_one br_mdb_replay br_mdb_switchdev_host_port br_mdb_switchdev_host br_switchdev_mdb_notify are only accessible from code paths where CONFIG_NET_SWITCHDEV is enabled. So move them to br_switchdev.c, in order for that code to be compiled out if that config option is disabled. Note that br_switchdev.c gets build regardless of whether CONFIG_BRIDGE_IGMP_SNOOPING is enabled or not, whereas br_mdb.c only got built when CONFIG_BRIDGE_IGMP_SNOOPING was enabled. So to preserve correct compilation with CONFIG_BRIDGE_IGMP_SNOOPING being disabled, we must now place an #ifdef around these functions in br_switchdev.c. The offending bridge data structures that need this are br->multicast_lock and br->mdb_list, these are also compiled out of struct net_bridge when CONFIG_BRIDGE_IGMP_SNOOPING is turned off. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: split out the switchdev portion of br_mdb_notifyVladimir Oltean
Similar to fdb_notify() and br_switchdev_fdb_notify(), split the switchdev specific logic from br_mdb_notify() into a different function. This will be moved later in br_switchdev.c. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: move br_vlan_replay to br_switchdev.cVladimir Oltean
br_vlan_replay() is relevant only if CONFIG_NET_SWITCHDEV is enabled, so move it to br_switchdev.c. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: provide shim definition for br_vlan_flagsVladimir Oltean
br_vlan_replay() needs this, and we're preparing to move it to br_switchdev.c, which will be compiled regardless of whether or not CONFIG_BRIDGE_VLAN_FILTERING is enabled. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
include/net/sock.h 7b50ecfcc6cd ("net: Rename ->stream_memory_read to ->sock_is_readable") 4c1e34c0dbff ("vsock: Enable y2038 safe timeval for timeout") drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c 0daa55d033b0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table") e77bcdd1f639 ("octeontx2-af: Display all enabled PF VF rsrc_alloc entries.") Adjacent code addition in both cases, keep both. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28mptcp: fix corrupt receiver key in MPC + data + checksumDavide Caratti
using packetdrill it's possible to observe that the receiver key contains random values when clients transmit MP_CAPABLE with data and checksum (as specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid using the skb extension copy when writing the MP_CAPABLE sub-option. Fixes: d7b269083786 ("mptcp: shrink mptcp_out_options struct") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233 Reported-by: Poorva Sonparote <psonparo@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20211027203855.264600-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28devlink: Simplify internal devlink params implementationLeon Romanovsky
Reduce extra indirection from devlink_params_*() API. Such change makes it clear that we can drop devlink->lock from these flows, because everything is executed when the devlink is not registered yet. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28net/tls: Fix flipped sign in async_wait.err assignmentDaniel Jordan
sk->sk_err contains a positive number, yet async_wait.err wants the opposite. Fix the missed sign flip, which Jakub caught by inspection. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28net/tls: Fix flipped sign in tls_err_abort() callsDaniel Jordan
sk->sk_err appears to expect a positive value, a convention that ktls doesn't always follow and that leads to memory corruption in other code. For instance, [kworker] tls_encrypt_done(..., err=<negative error from crypto request>) tls_err_abort(.., err) sk->sk_err = err; [task] splice_from_pipe_feed ... tls_sw_do_sendpage if (sk->sk_err) { ret = -sk->sk_err; // ret is positive splice_from_pipe_feed (continued) ret = actor(...) // ret is still positive and interpreted as bytes // written, resulting in underflow of buf->len and // sd->len, leading to huge buf->offset and bogus // addresses computed in later calls to actor() Fix all tls_err_abort() callers to pass a negative error code consistently and centralize the error-prone sign flip there, throwing in a warning to catch future misuse and uninlining the function so it really does only warn once. Cc: stable@vger.kernel.org Fixes: c46234ebb4d1e ("tls: RX path for ktls") Reported-by: syzbot+b187b77c8474f9648fae@syzkaller.appspotmail.com Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28net: ipconfig: Release the rtnl_lock while waiting for carrierMaxime Chevallier
While waiting for a carrier to come on one of the netdevices, some devices will require to take the rtnl lock at some point to fully initialize all parts of the link. That's the case for SFP, where the rtnl is taken when a module gets detected. This prevents mounting an NFS rootfs over an SFP link. This means that while ipconfig waits for carriers to be detected, no SFP modules can be detected in the meantime, it's only detected after ipconfig times out. This commit releases the rtnl_lock while waiting for the carrier to come up, and re-takes it to check the for the init device and carrier status. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28sch_htb: Add extack messages for EOPNOTSUPP errorsMaxim Mikityanskiy
In order to make the "Operation not supported" message clearer to the user, add extack messages explaining why exactly adding offloaded HTB could be not supported in each case. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28net/smc: Correct spelling mistake to TCPF_SYN_RECVWen Gu
There should use TCPF_SYN_RECV instead of TCP_SYN_RECV. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28net/smc: Fix smc_link->llc_testlink_time overflowTony Lu
The value of llc_testlink_time is set to the value stored in net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of sysctl_tcp_keepalive_time is already jiffies, so we don't need to multiply by HZ, which would cause smc_link->llc_testlink_time overflow, and test_link send flood. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28ipv6: enable net.ipv6.route.max_size sysctl in network namespaceAlexander Kuznetsov
We want to increase route cache size in network namespace created with user namespace. Currently ipv6 route settings are disabled for non-initial network namespaces. We can allow this sysctl and it will be safe since commit <6126891c6d4f> because route cache account to kmem, that is why users from user namespace can not DOS system. Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: do not clear TCP_SKB_CB(skb)->sacked if already zeroEric Dumazet
Freshly allocated skbs have zero in skb->cb[] already. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: do not clear skb->csum if already zeroEric Dumazet
Freshly allocated skbs have their csum field cleared already. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: factorize ip_summed settingEric Dumazet
Setting skb->ip_summed to CHECKSUM_PARTIAL can be centralized in tcp_stream_alloc_skb() and __mptcp_do_alloc_tx_skb() instead of being done multiple times. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: no longer set skb->reserved_tailroomEric Dumazet
TCP/MPTCP sendmsg() no longer puts payload in skb->head, we can remove not needed code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: remove dead code from tcp_collapse_retrans()Eric Dumazet
TCP sendmsg() no longer puts payload in skb->head, remove some dead code from tcp_collapse_retrans(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: cleanup tcp_remove_empty_skb() useEric Dumazet
All tcp_remove_empty_skb() callers now use tcp_write_queue_tail() for the skb argument, we can therefore factorize code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28tcp: remove dead code from tcp_sendmsg_locked()Eric Dumazet
TCP sendmsg() no longer puts payload in skb head, we can remove dead code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28xfrm: Remove redundant fields and related parenthesesluo penghao
The variable err is not necessary in such places. It should be revmoved for the simplicity of the code. This will cause the double parentheses to be redundant, and the inner parentheses should be deleted. The clang_analyzer complains as follows: net/xfrm/xfrm_input.c:533: warning: net/xfrm/xfrm_input.c:563: warning: Although the value stored to 'err' is used in the enclosing expression, the value is never actually read from 'err'. Changes in v2: Modify the title, because v2 removes the brackets. Remove extra parentheses. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-10-27mptcp: drop unused sk in mptcp_push_releaseGeliang Tang
Since mptcp_set_timeout() had removed from mptcp_push_release() in commit 33d41c9cd74c5 ("mptcp: more accurate timeout"), the argument sk in mptcp_push_release() became useless. Let's drop it. Fixes: 33d41c9cd74c5 ("mptcp: more accurate timeout") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27mptcp: allocate fwd memory separately on the rx and tx pathPaolo Abeni
All the mptcp receive path is protected by the msk socket spinlock. As consequences, the tx path has to play a few tricks to allocate the forward memory without acquiring the spinlock multiple times, making the overall TX path quite complex. This patch tries to clean-up a bit the tx path, using completely separated fwd memory allocation, for the rx and the tx path. The forward memory allocated in the rx path is now accounted in msk->rmem_fwd_alloc and is (still) protected by the msk socket spinlock. To cope with the above we provide a few MPTCP-specific variants for the helpers to charge, uncharge, reclaim and free the forward memory in the receive path. msk->sk_forward_alloc now accounts only the forward memory for the tx path, we can use the plain core sock helper to manipulate it and drop quite a bit of complexity. On memory pressure, both rx and tx fwd memories are reclaimed. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27net: introduce sk_forward_alloc_get()Paolo Abeni
A later patch will change the MPTCP memory accounting schema in such a way that MPTCP sockets will encode the total amount of forward allocated memory in two separate fields (one for tx and one for rx). MPTCP sockets will use their own helper to provide the accurate amount of fwd allocated memory. To allow the above, this patch adds a new, optional, sk method to fetch the fwd memory, wrap the call in a new helper and use it where it is appropriate. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27inet: remove races in inet{6}_getname()Eric Dumazet
syzbot reported data-races in inet_getname() multiple times, it is time we fix this instead of pretending applications should not trigger them. getsockname() and getpeername() are not really considered fast path. v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration needed when CONFIG_CGROUP_BPF=n, as reported by kernel test robot <lkp@intel.com> syzbot typical report: BUG: KCSAN: data-race in __inet_hash_connect / inet_getname write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1: __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831 inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853 tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275 __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664 inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728 __sys_connect_file net/socket.c:1896 [inline] __sys_connect+0x254/0x290 net/socket.c:1913 __do_sys_connect net/socket.c:1923 [inline] __se_sys_connect net/socket.c:1920 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1920 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0: inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790 __sys_getsockname+0x11d/0x1b0 net/socket.c:1946 __do_sys_getsockname net/socket.c:1961 [inline] __se_sys_getsockname net/socket.c:1958 [inline] __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0xdee0 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211026213014.3026708-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27xdp: Remove redundant warningYajun Deng
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is redundant. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20211027013856.1866-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27net: sch: simplify condtion for selecting mini_Qdisc_pair bufferSeth Forshee
The only valid values for a miniq pointer are NULL or a pointer to miniq1 or miniq2, so testing for miniq_old != &miniq1 is functionally equivalent to testing that it is NULL or equal to &miniq2. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Seth Forshee <sforshee@digitalocean.com> Link: https://lore.kernel.org/r/20211026183721.137930-1-seth@forshee.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap()Seth Forshee
Currently rcu_barrier() is used to ensure that no readers of the inactive mini_Qdisc buffer remain before it is reused. This waits for any pending RCU callbacks to complete, when all that is actually required is to wait for one RCU grace period to elapse after the buffer was made inactive. This means that using rcu_barrier() may result in unnecessary waits. To improve this, store the current RCU state when a buffer is made inactive and use poll_state_synchronize_rcu() to check whether a full grace period has elapsed before reusing it. If a full grace period has not elapsed, wait for a grace period to elapse, and in the non-RT case use synchronize_rcu_expedited() to hasten it. Since this approach eliminates the RCU callback it is no longer necessary to synchronize_rcu() in the tp_head==NULL case. However, the RCU state should still be saved for the previously active buffer. Before this change I would typically see mini_qdisc_pair_swap() take tens of milliseconds to complete. After this change it typcially finishes in less than 1 ms, and often it takes just a few microseconds. Thanks to Paul for walking me through the options for improving this. Cc: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Seth Forshee <sforshee@digitalocean.com> Link: https://lore.kernel.org/r/20211026130700.121189-1-seth@forshee.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27net: sched: gred: dynamically allocate tc_gred_qopt_offloadArnd Bergmann
The tc_gred_qopt_offload structure has grown too big to be on the stack for 32-bit architectures after recent changes. net/sched/sch_gred.c:903:13: error: stack frame size (1180) exceeds limit (1024) in 'gred_destroy' [-Werror,-Wframe-larger-than] net/sched/sch_gred.c:310:13: error: stack frame size (1212) exceeds limit (1024) in 'gred_offload' [-Werror,-Wframe-larger-than] Use dynamic allocation per qdisc to avoid this. Fixes: 50dc9a8572aa ("net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types") Fixes: 67c9e6270f30 ("net: sched: Protect Qdisc::bstats with u64_stats") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211026100711.nalhttf6mbe6sudx@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27Revert "devlink: Remove not-executed trap policer notifications"Leon Romanovsky
This reverts commit 22849b5ea5952d853547cc5e0651f34a246b2a4f as it revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters devlink objects during another devlink user triggered command. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27Revert "devlink: Remove not-executed trap group notifications"Leon Romanovsky
This reverts commit 8bbeed4858239ac956a78e5cbaf778bd6f3baef8 as it revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters devlink objects during another devlink user triggered command. Fixes: 22849b5ea595 ("devlink: Remove not-executed trap policer notifications") Reported-by: syzbot+93d5accfaefceedf43c1@syzkaller.appspotmail.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>