summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2016-12-01rtnetlink: return the correct error codeZhang Shengju
Before this patch, function ndo_dflt_fdb_dump() will always return code from uc fdb dump. The reture code of mc fdb dump is lost. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30neigh: remove duplicate check for same neighZhang Shengju
Currently loop index 'idx' is used as the index in the neigh list of interest. It's increased only when the neigh is dumped. It's not the absolute index in the list. Because there is no info to record which neigh has already be scanned by previous loop. This will cause the filtered out neighs to be scanned mulitple times. This patch make idx as the absolute index in the list, it will increase no matter whether the neigh is filtered. This will prevent the above problem. And this is in line with other dump functions. v2: - take David Ahern's advice to do simple change Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30bpf, xdp: allow to pass flags to dev_change_xdp_fdDaniel Borkmann
Add an IFLA_XDP_FLAGS attribute that can be passed for setting up XDP along with IFLA_XDP_FD, which eventually allows user space to implement typical add/replace/delete logic for programs. Right now, calling into dev_change_xdp_fd() will always replace previous programs. When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more graceful when requested by returning -EBUSY in case we try to attach a new program, but we find that another one is already attached. This will be used by upcoming front-end for iproute2 as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPINGFrancis Yan
This patch exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently we can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window. It is not possible to tell before this patch without packet traces. To prepare these stats, the user needs to set SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags while requesting other SOF_TIMESTAMPING TX timestamps. When the timestamps are available in the error queue, the stats are returned in a separate control message of type SCM_TIMESTAMPING_OPT_STATS, in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME, TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27bpf: reuse dev_is_mac_header_xmit for redirectDaniel Borkmann
Commit dcf800344a91 ("net/sched: act_mirred: Refactor detection whether dev needs xmit at mac header") added dev_is_mac_header_xmit(); since it's also useful elsewhere, move it to if_arp.h and reuse it for BPF. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-11-25 1) Fix a refcount leak in vti6. From Nicolas Dichtel. 2) Fix a wrong if statement in xfrm_sk_policy_lookup. From Florian Westphal. 3) The flowcache watermarks are per cpu. Take this into account when comparing to the threshold where we refusing new allocations. From Miroslav Urbanek. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25net: ethtool: don't require CAP_NET_ADMIN for ETHTOOL_GLINKSETTINGSMiroslav Lichvar
The ETHTOOL_GLINKSETTINGS command is deprecating the ETHTOOL_GSET command and likewise it shouldn't require the CAP_NET_ADMIN capability. Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25net: properly flush delay-freed skbsEric Dumazet
Typical NAPI drivers use napi_consume_skb(skb) at TX completion time. This put skb in a percpu special queue, napi_alloc_cache, to get bulk frees. It turns out the queue is not flushed and hits the NAPI_SKB_CACHE_SIZE limit quite often, with skbs that were queued hundreds of usec earlier. I measured this can take ~6000 nsec to perform one flush. __kfree_skb_flush() can be called from two points right now : 1) From net_tx_action(), but only for skbs that were queued to sd->completion_queue. -> Irrelevant for NAPI drivers in normal operation. 2) From net_rx_action(), but only under high stress or if RPS/RFS has a pending action. This patch changes net_rx_action() to perform the flush in all cases and after more urgent operations happened (like kicking remote CPUS for RPS/RFS). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25net: filter: run cgroup eBPF ingress programsDaniel Mack
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from sk_filter_trim_cap(). eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25bpf: add new prog type for cgroup socket filteringDaniel Mack
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that it does not allow BPF_LD_[ABS|IND] instructions and hooks up the bpf_skb_load_bytes() helper. Programs of this type will be attached to cgroups for network filtering and accounting. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24ethtool: Protect {get, set}_phy_tunable with PHY device mutexFlorian Fainelli
PHY drivers should be able to rely on the caller of {get,set}_tunable to have acquired the PHY device mutex, in order to both serialize against concurrent calls of these functions, but also against PHY state machine changes. All ethtool PHY-level functions do this, except {get,set}_tunable, so we make them consistent here as well. We need to update the Microsemi PHY driver in the same commit to avoid introducing either deadlocks, or lack of proper locking. Fixes: 968ad9da7e0e ("ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE") Fixes: 310d9ad57ae0 ("net: phy: Add downshift get/set support in Microsemi PHYs driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Allan W. Nielsen <allan.nielsen@microsemi.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24devlink: Add E-Switch inline mode controlRoi Dayan
Some HWs need the VF driver to put part of the packet headers on the TX descriptor so the e-switch can do proper matching and steering. The supported modes: none, link, network, transport. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24net: Add net-device param to the get offloaded stats ndoOr Gerlitz
Some drivers would need to check few internal matters for that. To be used in downstream mlx5 commit. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24tcp: enhance tcp_collapse_retrans() with skb_shift()Eric Dumazet
In commit 2331ccc5b323 ("tcp: enhance tcp collapsing"), we made a first step allowing copying right skb to left skb head. Since all skbs in socket write queue are headless (but possibly the very first one), this strategy often does not work. This patch extends tcp_collapse_retrans() to perform frag shifting, thanks to skb_shift() helper. This helper needs to not BUG on non headless skbs, as callers are ok with that. Tested: Following packetdrill test now passes : 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.100 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 write(4, ..., 200) = 200 +0 > P. 1:201(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 201:401(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 401:601(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 601:801(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 801:1001(200) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1001:1101(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1101:1201(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1201:1301(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1301:1401(100) ack 1 +.099 < . 1:1(0) ack 201 win 257 +.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401> +0 > P. 201:1001(800) ack 1 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-23rtnetlink: fix the wrong minimal dump size getting from rtnl_calcit()Zhang Shengju
For RT netlink, calcit() function should return the minimal size for netlink dump message. This will make sure that dump message for every network device can be stored. Currently, rtnl_calcit() function doesn't account the size of header of netlink message, this patch will fix it. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-23flowcache: Increase threshold for refusing new allocationsMiroslav Urbanek
The threshold for OOM protection is too small for systems with large number of CPUs. Applications report ENOBUFs on connect() every 10 minutes. The problem is that the variable net->xfrm.flow_cache_gc_count is a global counter while the variable fc->high_watermark is a per-CPU constant. Take the number of CPUs into account as well. Fixes: 6ad3122a08e3 ("flowcache: Avoid OOM condition under preasure") Reported-by: Lukáš Koldrt <lk@excello.cz> Tested-by: Jan Hejl <jh@excello.cz> Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-22flow_dissect: call init_default_flow_dissectors() earlierEric Dumazet
Andre Noll reported panics after my recent fix (commit 34fad54c2537 "net: __skb_flow_dissect() must cap its return value") After some more headaches, Alexander root caused the problem to init_default_flow_dissectors() being called too late, in case a network driver like IGB is not a module and receives DHCP message very early. Fix is to call init_default_flow_dissectors() much earlier, as it is a core infrastructure and does not depend on another kernel service. Fixes: 06635a35d13d4 ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andre Noll <maan@tuebingen.mpg.de> Diagnosed-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All conflicts were simple overlapping changes except perhaps for the Thunder driver. That driver has a change_mtu method explicitly for sending a message to the hardware. If that fails it returns an error. Normally a driver doesn't need an ndo_change_mtu method becuase those are usually just range changes, which are now handled generically. But since this extra operation is needed in the Thunder driver, it has to stay. However, if the message send fails we have to restore the original MTU before the change because the entire call chain expects that if an error is thrown by ndo_change_mtu then the MTU did not change. Therefore code is added to nicvf_change_mtu to remember the original MTU, and to restore it upon nicvf_update_hw_max_frs() failue. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19rtnl: fix the loop index update error in rtnl_dump_ifinfo()Zhang Shengju
If the link is filtered out, loop index should also be updated. If not, loop index will not be correct. Fixes: dc599f76c22b0 ("net: Add support for filtering link dump by master device and kind") Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19net: make struct napi_alloc_cache::skb_count unsigned intAlexey Dobriyan
size_t is way too much for an integer not exceeding 64. Space savings: 10 bytes! add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-10 (-10) function old new delta napi_consume_skb 165 163 -2 __kfree_skb_flush 56 53 -3 __kfree_skb_defer 97 92 -5 Total: Before=154865639, After=154865629, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18rtnetlink: fix FDB size computationSabrina Dubroca
Add missing NDA_VLAN attribute's size. Fixes: 1e53d5bb8878 ("net: Pass VLAN ID to rtnl_fdb_notify.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunableRaju Lakkaraju
Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional implementation needs to be done in the individual PHY drivers. Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLERaju Lakkaraju
Adding get_tunable/set_tunable function pointer to the phy_driver structure, and uses these function pointers to implement the ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls. Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18netns: make struct pernet_operations::id unsigned intAlexey Dobriyan
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net: check dead netns for peernet2id_alloc()WANG Cong
Andrei reports we still allocate netns ID from idr after we destroy it in cleanup_net(). cleanup_net(): ... idr_destroy(&net->netns_ids); ... list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list); -> rollback_registered_many() -> rtmsg_ifinfo_build_skb() -> rtnl_fill_ifinfo() -> peernet2id_alloc() After that point we should not even access net->netns_ids, we should check the death of the current netns as early as we can in peernet2id_alloc(). For net-next we can consider to avoid sending rtmsg totally, it is a good optimization for netns teardown path. Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids") Reported-by: Andrei Vagin <avagin@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16netpoll: more efficient lockingEric Dumazet
Callers of netpoll_poll_lock() own NAPI_STATE_SCHED Callers of netpoll_poll_unlock() have BH blocked between the NAPI_STATE_SCHED being cleared and poll_lock is released. We can avoid the spinlock which has no contention, and use cmpxchg() on poll_owner which we need to set anyway. This removes a possible lockdep violation after the cited commit, since sk_busy_loop() re-enables BH before calling busy_poll_stop() Fixes: 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: busy-poll: return busypolling status to driversEric Dumazet
NAPI drivers use napi_complete_done() or napi_complete() when they drained RX ring and right before re-enabling device interrupts. In busy polling, we can avoid interrupts being delivered since we are polling RX ring in a controlled loop. Drivers can chose to use napi_complete_done() return value to reduce interrupts overhead while busy polling is active. This is optional, legacy drivers should work fine even if not updated. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: busy-poll: allow preemption in sk_busy_loop()Eric Dumazet
After commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job"), sk_busy_loop() needs a bit of care : softirqs might be delayed since we do not allow preemption yet. This patch adds preemptiom points in sk_busy_loop(), and makes sure no unnecessary cache line dirtying or atomic operations are done while looping. A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED, so that sk_busy_loop() does not have to grab it again. Similarly, netpoll_poll_lock() is done one time. This gives about 10 to 20 % improvement in various busy polling tests, especially when many threads are busy polling in configurations with large number of NIC queues. This should allow experimenting with bigger delays without hurting overall latencies. Tested: On a 40Gb mlx4 NIC, 32 RX/TX queues. echo 70 >/proc/sys/net/core/busy_read for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done Before: After: 1: 90072 92819 2: 157289 184007 3: 235772 213504 4: 344074 357513 5: 394755 458267 6: 461151 487819 7: 549116 625963 8: 544423 716219 9: 720460 738446 10: 794686 837612 11: 915998 923960 12: 937507 925107 13: 1019677 971506 14: 1046831 1113650 15: 1114154 1148902 16: 1105221 1179263 17: 1266552 1299585 18: 1258454 1383817 19: 1341453 1312194 20: 1363557 1488487 21: 1387979 1501004 22: 1417552 1601683 23: 1550049 1642002 24: 1568876 1601915 25: 1560239 1683607 26: 1640207 1745211 27: 1706540 1723574 28: 1638518 1722036 29: 1734309 1757447 30: 1782007 1855436 31: 1724806 1888539 32: 1717716 1944297 33: 1778716 1869118 34: 1805738 1983466 35: 1815694 2020758 36: 1893059 2035632 37: 1843406 2034653 38: 1888830 2086580 39: 1972827 2143567 40: 1877729 2181851 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15rtnetlink: fix rtnl message size computation for XDPSabrina Dubroca
rtnl_xdp_size() only considers the size of the actual payload attribute, and misses the space taken by the attribute used for nesting (IFLA_XDP). Fixes: d1fdd9138682 ("rtnl: add option for setting link xdp prog") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15rtnetlink: fix rtnl_vfinfo_sizeSabrina Dubroca
The size reported by rtnl_vfinfo_size doesn't match the space used by rtnl_fill_vfinfo. rtnl_vfinfo_size currently doesn't account for the nest attributes used by statistics (added in commit 3b766cd83232), nor for struct ifla_vf_tx_rate (since commit ed616689a3d9, which added ifla_vf_rate to the dump without removing ifla_vf_tx_rate, but replaced ifla_vf_tx_rate with ifla_vf_rate in the size computation). Fixes: 3b766cd83232 ("net/core: Add reading VF statistics through the PF netdevice") Fixes: ed616689a3d9 ("net-next:v4: Add support to configure SR-IOV VF minimum and maximum Tx rate through ip tool") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14net: fix sleeping for sk_wait_event()WANG Cong
Similar to commit 14135f30e33c ("inet: fix sleeping inside inet_wait_for_connect()"), sk_wait_event() needs to fix too, because release_sock() is blocking, it changes the process state back to running after sleep, which breaks the previous prepare_to_wait(). Switch to the new wait API. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12net: __skb_flow_dissect() must cap its return valueEric Dumazet
After Tom patch, thoff field could point past the end of the buffer, this could fool some callers. If an skb was provided, skb->len should be the upper limit. If not, hlen is supposed to be the upper limit. Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yibin Yang <yibyang@cisco.com Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12bpf: Fix bpf_redirect to an ipip/ip6tnl devMartin KaFai Lau
If the bpf program calls bpf_redirect(dev, 0) and dev is an ipip/ip6tnl, it currently includes the mac header. e.g. If dev is ipip, the end result is IP-EthHdr-IP instead of IP-IP. The fix is to pull the mac header. At ingress, skb_postpull_rcsum() is not needed because the ethhdr should have been pulled once already and then got pushed back just before calling the bpf_prog. At egress, this patch calls skb_postpull_rcsum(). If bpf_redirect(dev, BPF_F_INGRESS) is called, it also fails now because it calls dev_forward_skb() which eventually calls eth_type_trans(skb, dev). The eth_type_trans() will set skb->type = PACKET_OTHERHOST because the mac address does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST will eventually cause the ip_rcv() errors out. To fix this, ____dev_forward_skb() is added. Joint work with Daniel Borkmann. Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net: napi_hash_add() is no longer exportedEric Dumazet
There are no more users except from net/core/dev.c napi_hash_add() can now be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09ipv6: sr: add support for SRH encapsulation and injection with lwtunnelsDavid Lebrun
This patch creates a new type of interfaceless lightweight tunnel (SEG6), enabling the encapsulation and injection of SRH within locally emitted packets and forwarded packets. >From a configuration viewpoint, a seg6 tunnel would be configured as follows: ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0 Any packet whose destination address is fc00::1 would thus be encapsulated within an outer IPv6 header containing the SRH with three segments, and would actually be routed to the first segment of the list. If `mode inline' was specified instead of `mode encap', then the SRH would be directly inserted after the IPv6 header without outer encapsulation. The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This feature was made configurable because direct header insertion may break several mechanisms such as PMTUD or IPSec AH. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09rtnl: reset calcit fptr in rtnl_unregister()Mathias Krause
To avoid having dangling function pointers left behind, reset calcit in rtnl_unregister(), too. This is no issue so far, as only the rtnl core registers a netlink handler with a calcit hook which won't be unregistered, but may become one if new code makes use of the calcit hook. Fixes: c7ac8679bec9 ("rtnetlink: Compute and store minimum ifinfo...") Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net-gro: avoid reordersEric Dumazet
Receiving a GSO packet in dev_gro_receive() is not uncommon in stacked devices, or devices partially implementing LRO/GRO like bnx2x. GRO is implementing the aggregation the device was not able to do itself. Current code causes reorders, like in following case : For a given flow where sender sent 3 packets P1,P2,P3,P4 Receiver might receive P1 as a single packet, stored in GRO engine. Then P2-P4 are received as a single GSO packet, immediately given to upper stack, while P1 is held in GRO engine. This patch will make sure P1 is given to upper stack, then P2-P4 immediately after. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/flowcache: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Use multi state support to avoid custom list handling for the multiple instances. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: netdev@vger.kernel.org Cc: rt@linutronix.de Cc: "David S. Miller" <davem@davemloft.net> Link: http://lkml.kernel.org/r/20161103145021.28528-10-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09net/dev: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161103145021.28528-9-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09net: core: add missing check for uid_range in rule_exists.Lorenzo Colitti
Without this check, it is not possible to create two rules that are identical except for their UID ranges. For example: root@net-test:/# ip rule add prio 1000 lookup 300 root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300 RTNETLINK answers: File exists root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100 root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200 root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100 RTNETLINK answers: File exists Tested: https://android-review.googlesource.com/#/c/299980/ Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07sock: do not set sk_err in sock_dequeue_err_skbSoheil Hassas Yeganeh
Do not set sk_err when dequeuing errors from the error queue. Doing so results in: a) Bugs: By overwriting existing sk_err values, it possibly hides legitimate errors. It is also incorrect when local errors are queued with ip_local_error. That happens in the context of a system call, which already returns the error code. b) Inconsistent behavior: When there are pending errors on the error queue, sk_err is sometimes 0 (e.g., for the first timestamp on the error queue) and sometimes set to an error code (after dequeuing the first timestamp). c) Suboptimality: Setting sk_err to ENOMSG on simple TX timestamps can abort parallel reads and writes. Removing this line doesn't break userspace. This is because userspace code cannot rely on sk_err for detecting whether there is something on the error queue. Except for ICMP messages received for UDP and RAW, sk_err is not set at enqueue time, and as a result sk_err can be 0 while there are plenty of errors on the error queue. For ICMP packets in UDP and RAW, sk_err is set when they are enqueued on the error queue, but that does not result in aborting reads and writes. For such cases, sk_err is only readable via getsockopt(SO_ERROR) which will reset the value of sk_err on its own. More importantly, prior to this patch, recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e., sk_err is set by sock_dequeue_err_skb without atomic ops or locks) which can store 0 in sk_err even when we have ICMP messages pending. Removing this line from sock_dequeue_err_skb eliminates that race. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue lenJesper Dangaard Brouer
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a default qdisc attached, given they will be backed by physical device, that already have a qdisc attached for pushback. It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as this can be useful for difference policy reasons (e.g. bandwidth limiting containers). For this to work, the tx_queue_len need to have a sane value, because some qdiscs inherit/copy the tx_queue_len (namely, pfifo, bfifo, gred, htb, plug and sfb). Commit a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()") caught situations where some drivers didn't initialize tx_queue_len. The problem with the commit was choosing 1 as the fallback value. A qdisc queue length of 1 causes more harm than good, because it creates hard to debug situations for userspace. It gives userspace a false sense of a working config after attaching a qdisc. As low volume traffic (that doesn't activate the qdisc policy) works, like ping, while traffic that e.g. needs shaping cannot reach the configured policy levels, given the queue length is too small. This patch change the value to DEFAULT_TX_QUEUE_LEN, given other IFF_NO_QUEUE devices (that call ether_setup()) also use this value. Fixes: a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07udp: do fwd memory scheduling on dequeuePaolo Abeni
A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: core: add UID to flows, rules, and routesLorenzo Colitti
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: core: Add a UID field to struct sock.Lorenzo Colitti
Protocol sockets (struct sock) don't have UIDs, but most of the time, they map 1:1 to userspace sockets (struct socket) which do. Various operations such as the iptables xt_owner match need access to the "UID of a socket", and do so by following the backpointer to the struct socket. This involves taking sk_callback_lock and doesn't work when there is no socket because userspace has already called close(). Simplify this by adding a sk_uid field to struct sock whose value matches the UID of the corresponding struct socket. The semantics are as follows: 1. Whenever sk_socket is non-null: sk_uid is the same as the UID in sk_socket, i.e., matches the return value of sock_i_uid. Specifically, the UID is set when userspace calls socket(), fchown(), or accept(). 2. When sk_socket is NULL, sk_uid is defined as follows: - For a socket that no longer has a sk_socket because userspace has called close(): the previous UID. - For a cloned socket (e.g., an incoming connection that is established but on which userspace has not yet called accept): the UID of the socket it was cloned from. - For a socket that has never had an sk_socket: UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. Kernel sockets created by sock_create_kern are a special case of #1 and sk_uid is the user that created them. For kernel sockets created at network namespace creation time, such as the per-processor ICMP and TCP sockets, this is the user that created the network namespace. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03dccp: do not release listeners too soonEric Dumazet
Andrey Konovalov reported following error while fuzzing with syzkaller : IPv4: Attempt to release alive inet socket ffff880068e98940 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b9e0000 task.stack: ffff880068770000 RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>] selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639 RSP: 0018:ffff8800687771c8 EFLAGS: 00010202 RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010 RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002 R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940 R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000 FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0 Stack: ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0 Call Trace: [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257 [< inline >] dst_input ./include/net/dst.h:507 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332 [< inline >] new_sync_write fs/read_write.c:499 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560 [< inline >] SYSC_write fs/read_write.c:607 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2 It turns out DCCP calls __sk_receive_skb(), and this broke when lookups no longer took a reference on listeners. Fix this issue by adding a @refcounted parameter to __sk_receive_skb(), so that sock_put() is used only when needed. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31net: mangle zero checksum in skb_checksum_help()Eric Dumazet
Sending zero checksum is ok for TCP, but not for UDP. UDPv6 receiver should by default drop a frame with a 0 checksum, and UDPv4 would not verify the checksum and might accept a corrupted packet. Simply replace such checksum by 0xffff, regardless of transport. This error was caught on SIT tunnels, but seems generic. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31net: clear sk_err_soft in sk_clone_lock()Eric Dumazet
At accept() time, it is possible the parent has a non zero sk_err_soft, leftover from a prior error. Make sure we do not leave this value in the child, as it makes future getsockopt(SO_ERROR) calls quite unreliable. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>