summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2015-12-15net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUMTom Herbert
These netif flags are unnecessary convolutions. It is more straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM directly. This patch also: - Cleans up can_checksum_protocol - Simplifies netdev_intersect_features Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASKTom Herbert
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the set of features for offloading all checksums. This is a mask of the checksum offload related features bits. It is incorrect to set both NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for features of a device. This patch: - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where NETIF_F_ALL_CSUM is being used as a mask). - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14net: fix IP early demux racesEric Dumazet
David Wilder reported crashes caused by dst reuse. <quote David> I am seeing a crash on a distro V4.2.3 kernel caused by a double release of a dst_entry. In ipv4_dst_destroy() the call to list_empty() finds a poisoned next pointer, indicating the dst_entry has already been removed from the list and freed. The crash occurs 18 to 24 hours into a run of a network stress exerciser. </quote> Thanks to his detailed report and analysis, we were able to understand the core issue. IP early demux can associate a dst to skb, after a lookup in TCP/UDP sockets. When socket cache is not properly set, we want to store into sk->sk_dst_cache the dst for future IP early demux lookups, by acquiring a stable refcount on the dst. Problem is this acquisition is simply using an atomic_inc(), which works well, unless the dst was queued for destruction from dst_release() noticing dst refcount went to zero, if DST_NOCACHE was set on dst. We need to make sure current refcount is not zero before incrementing it, or risk double free as David reported. This patch, being a stable candidate, adds two new helpers, and use them only from IP early demux problematic paths. It might be possible to merge in net-next skb_dst_force() and skb_dst_force_safe(), but I prefer having the smallest patch for stable kernels : Maybe some skb_dst_force() callers do not expect skb->dst can suddenly be cleared. Can probably be backported back to linux-3.6 kernels Reported-by: David J. Wilder <dwilder@us.ibm.com> Tested-by: David J. Wilder <dwilder@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14net: add validation for the socket syscall protocol argumentHannes Frederic Sowa
郭永刚 reported that one could simply crash the kernel as root by using a simple program: int socket_fd; struct sockaddr_in addr; addr.sin_port = 0; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_family = 10; socket_fd = socket(10,3,0x40000000); connect(socket_fd , &addr,16); AF_INET, AF_INET6 sockets actually only support 8-bit protocol identifiers. inet_sock's skc_protocol field thus is sized accordingly, thus larger protocol identifiers simply cut off the higher bits and store a zero in the protocol fields. This could lead to e.g. NULL function pointer because as a result of the cut off inet_num is zero and we call down to inet_autobind, which is NULL for raw sockets. kernel: Call Trace: kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70 kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80 kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110 kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80 kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200 kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10 kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89 I found no particular commit which introduced this problem. CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14Merge branch 'master' of ↵Pablo Neira Ayuso
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Resolve conflict between commit 264640fc2c5f4f ("ipv6: distinguish frag queues by device for multicast and link-local packets") from the net tree and commit 029f7f3b8701c ("netfilter: ipv6: nf_defrag: avoid/free clone operations") from the nf-next tree. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Conflicts: net/ipv6/netfilter/nf_conntrack_reasm.c
2015-12-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter fixes for net The following patchset contains Netfilter fixes for you net tree, specifically for nf_tables and nfnetlink_queue, they are: 1) Avoid a compilation warning in nfnetlink_queue that was introduced in the previous merge window with the simplification of the conntrack integration, from Arnd Bergmann. 2) nfnetlink_queue is leaking the pernet subsystem registration from a failure path, patch from Nikolay Borisov. 3) Pass down netns pointer to batch callback in nfnetlink, this is the largest patch and it is not a bugfix but it is a dependency to resolve a splat in the correct way. 4) Fix a splat due to incorrect socket memory accounting with nfnetlink skbuff clones. 5) Add missing conntrack dependencies to NFT_DUP_IPV4 and NFT_DUP_IPV6. 6) Traverse the nftables commit list in reverse order from the commit path, otherwise we crash when the user applies an incremental update via 'nft -f' that deletes an object that was just introduced in this batch, from Xin Long. Regarding the compilation warning fix, many people have sent us (and keep sending us) patches to address this, that's why I'm including this batch even if this is not critical. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-13net: Flush local routes when device changes vrf associationDavid Ahern
The VRF driver cycles netdevs when an interface is enslaved or released: the down event is used to flush neighbor and route tables and the up event (if the interface was already up) effectively moves local and connected routes to the proper table. As of 4f823defdd5b the local route is left hanging around after a link down, so when a netdev is moved from one VRF to another (or released from a VRF altogether) local routes are left in the wrong table. Fix by handling the NETDEV_CHANGEUPPER event. When the upper dev is an L3mdev then call fib_disable_ip to flush all routes, local ones to. Fixes: 4f823defdd5b ("ipv4: fix to not remove local route on link down") Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-10netfilter: nf_dup: add missing dependencies with NF_CONNTRACKPablo Neira Ayuso
CONFIG_NF_CONNTRACK=m CONFIG_NF_DUP_IPV4=y results in: net/built-in.o: In function `nf_dup_ipv4': >> (.text+0xd434f): undefined reference to `nf_conntrack_untracked' Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03ipv4: igmp: Allow removing groups from a removed interfaceAndrew Lunn
When a multicast group is joined on a socket, a struct ip_mc_socklist is appended to the sockets mc_list containing information about the joined group. If the interface is hot unplugged, this entry becomes stale. Prior to commit 52ad353a5344f ("igmp: fix the problem when mc leave group") it was possible to remove the stale entry by performing a IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on the interface. However, this fix enforces that the interface must still exist. Thus with time, the number of stale entries grows, until sysctl_igmp_max_memberships is reached and then it is not possible to join and more groups. The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is performed without specifying the interface, either by ifindex or ip address. However here we do supply one of these. So loosen the restriction on device existence to only apply when the interface has not been specified. This then restores the ability to clean up the stale entries. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Fixes: 52ad353a5344f "(igmp: fix the problem when mc leave group") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02tcp: suppress too verbose messages in tcp_send_ack()Eric Dumazet
If tcp_send_ack() can not allocate skb, we properly handle this and setup a timer to try later. Use __GFP_NOWARN to avoid polluting syslog in the case host is under memory pressure, so that pertinent messages are not lost under a flood of useless information. sk_gfp_atomic() can use its gfp_mask argument (all callers currently were using GFP_ATOMIC before this patch) We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this function now takes into account its second argument (gfp_mask) Note that when tcp_transmit_skb() is called with clone_it set to false, we do not attempt memory allocations, so can pass a 0 gfp_mask, which most compilers can emit faster than a non zero or constant value. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATAEric Dumazet
This patch is a cleanup to make following patch easier to review. Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA from (struct socket)->flags to a (struct socket_wq)->flags to benefit from RCU protection in sock_wake_async() To ease backports, we rename both constants. Two new helpers, sk_set_bit(int nr, struct sock *sk) and sk_clear_bit(int net, struct sock *sk) are added so that following patch can change their implementation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30tcp: initialize tp->copied_seq in case of cross SYN connectionEric Dumazet
Dmitry provided a syzkaller (http://github.com/google/syzkaller) generated program that triggers the WARNING at net/ipv4/tcp.c:1729 in tcp_recvmsg() : WARN_ON(tp->copied_seq != tp->rcv_nxt && !(flags & (MSG_PEEK | MSG_TRUNC))); His program is specifically attempting a Cross SYN TCP exchange, that we support (for the pleasure of hackers ?), but it looks we lack proper tcp->copied_seq initialization. Thanks again Dmitry for your report and testings. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: add mfc newroute/delroute netlink supportNikolay Aleksandrov
This patch adds support to add and remove MFC entries. It uses the same attributes like the already present dump support in order to be consistent. There's one new entry - RTA_PREFSRC, it's used to denote an MFC_PROXY entry (see MRT_ADD_MFC vs MRT_ADD_MFC_PROXY). The already existing infrastructure is used to create and delete the entries, the netlink message gets converted internally to a struct mfcctl which is used with ipmr_mfc_add/delete. The other used attributes are: RTA_IIF - used for mfcc_parent (when adding it's required to be valid) RTA_SRC - used for mfcc_origin RTA_DST - used for mfcc_mcastgrp RTA_TABLE - the MRT table id RTA_MULTIPATH - the "oifs" ttl array Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: fix setsockopt error returnNikolay Aleksandrov
We can have both errors and we'll return the second one, fix it to return an error at a time as it's normal. I've overlooked this in my previous set. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: move pimsm_enabled to pim.h and renameNikolay Aleksandrov
Move the inline pimsm_enabled() to pim.h and rename it to ipmr_pimsm_enabled to show it's for the ipv4 ipmr code since pim.h is used by IPv6 too. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: move struct mr_table and VIF_EXISTS to mroute.hNikolay Aleksandrov
Move the definitions of VIF_EXISTS() and struct mr_table to mroute.h Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: remove unused MFC_NOTIFY flag and make the flags enumNikolay Aleksandrov
MFC_NOTIFY was introduced in kernel 2.1.68 but afaik it hasn't been used and I couldn't find any users currently so just remove it. Only MFC_STATIC is left, so move it into an enum, add a description and use BIT(). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: remove unnecessary mroute.h includesNikolay Aleksandrov
It looks like many files are including mroute.h unnecessarily, so remove the include. Most importantly remove it from ipv6. CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24net: ipmr, ip6mr: fix vif/tunnel failure race conditionNikolay Aleksandrov
Since (at least) commit b17a7c179dd3 ("[NET]: Do sysfs registration as part of register_netdevice."), netdev_run_todo() deals only with unregistration, so we don't need to do the rtnl_unlock/lock cycle to finish registration when failing pimreg or dvmrp device creation. In fact that opens a race condition where someone can delete the device while rtnl is unlocked because it's fully registered. The problem gets worse when netlink support is introduced as there are more points of entry that can cause it and it also makes reusing that code correctly impossible. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24net/ipv4/ipconfig: Rejoin broken lines in console outputGeert Uytterhoeven
Commit 09605cc12c078306 ("net ipv4: use preferred log methods") replaced a few calls of pr_cont() after a console print without a trailing newline by pr_info(), causing lines to be split during IP autoconfiguration, like: . , OK IP-Config: Got DHCP answer from 192.168.97.254, my address is 192.168.97.44 Convert these back to using pr_cont(), so it prints again: ., OK IP-Config: Got DHCP answer from 192.168.97.254, my address is 192.168.97.44 Absorb the printing of "my address ..." into the previous call to pr_info(), as there's no reason to use a continuation there. Convert one more pr_info() to print nameservers while we're at it. Fixes: 09605cc12c078306 ("net ipv4: use preferred log methods") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: factor out common vif init codeNikolay Aleksandrov
Factor out common vif init code used in both tunnel and pimreg initialization and create ipmr_init_vif_indev() function. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: rearrange and cleanup setsockoptNikolay Aleksandrov
Take rtnl in the beginning unconditionally as most options already need it (one exception - MRT_DONE, see the comment inside), make the lock/unlock places central and move out the switch() local variables. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: drop ip_mr_init() mrt_cachep null check as we'll panic if it failsNikolay Aleksandrov
It's not necessary to check for null as SLAB_PANIC is used and we'll panic if the alloc fails, so just drop it. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLESNikolay Aleksandrov
Trivial replace of ifdef with IS_BUILTIN(). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: make ip_mroute_getsockopt more understandableNikolay Aleksandrov
Use a switch to determine if optname is correct and set val accordingly. This produces a much more straight-forward and readable code. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: fix code and comment styleNikolay Aleksandrov
Trivial code and comment style fixes, also removed some extra newlines, spaces and tabs. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: remove some pimsm ifdefs and simplifyNikolay Aleksandrov
Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM define and is used to check if any version of PIM-SM has been enabled. Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because only a VIFF_REGISTER device can send such packet, and it can't be created if pimsm_enabled() is false. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: always define mroute_reg_vif_numNikolay Aleksandrov
Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_ options were set, but that's not really necessary as the size of the struct is the same in both cases (checked with pahole, both cases size is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the code. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23net: ipmr: move the tbl id check in ipmr_new_tableNikolay Aleksandrov
Move the table id check in ipmr_new_table and make it return error pointer. We need this change for the upcoming netlink table manipulation support in order to avoid code duplication and a race condition. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23netfilter: remove duplicate includestephen hemminger
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-22net: ipmr: fix static mfc/dev leaks on table destructionNikolay Aleksandrov
When destroying an mrt table the static mfc entries and the static devices are kept, which leads to devices that can never be destroyed (because of refcnt taken) and leaked memory, for example: unreferenced object 0xffff880034c144c0 (size 192): comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s) hex dump (first 32 bytes): 98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff .S.4.....S.4.... ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00 ................ backtrace: [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300 [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910 [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0 [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0 [<ffffffff81564e13>] raw_setsockopt+0x33/0x90 [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20 [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0 [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a [<ffffffffffffffff>] 0xffffffffffffffff Make sure that everything is cleaned on netns destruction. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20tcp: fix potential huge kmalloc() calls in TCP_REPAIREric Dumazet
tcp_send_rcvq() is used for re-injecting data into tcp receive queue. Problems : - No check against size is performed, allowed user to fool kernel in attempting very large memory allocations, eventually triggering OOM when memory is fragmented. - In case of fault during the copy we do not return correct errno. Lets use alloc_skb_with_frags() to cook optimal skbs. Fixes: 292e8d8c8538 ("tcp: Move rcvq sending to tcp_input.c") Fixes: c0e88ff0f256 ("tcp: Repair socket queues") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20tcp: fix Fast Open snmp over-counting bugYuchung Cheng
Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times when the handshake experiences multiple SYN timeouts. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20tcp: disable Fast Open on timeouts after handshakeYuchung Cheng
Some middle-boxes black-hole the data after the Fast Open handshake (https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf). The exact reason is unknown. The work-around is to disable Fast Open temporarily after multiple recurring timeouts with few or no data delivered in the established state. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18tcp: md5: fix lockdep annotationEric Dumazet
When a passive TCP is created, we eventually call tcp_md5_do_add() with sk pointing to the child. It is not owner by the user yet (we will add this socket into listener accept queue a bit later anyway) But we do own the spinlock, so amend the lockdep annotation to avoid following splat : [ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage! [ 8451.090932] [ 8451.090932] other info that might help us debug this: [ 8451.090932] [ 8451.090934] [ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1 [ 8451.090936] 3 locks held by socket_sockopt_/214795: [ 8451.090936] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90 [ 8451.090947] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0 [ 8451.090952] #2: (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500 [ 8451.090958] [ 8451.090958] stack backtrace: [ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_ [ 8451.091215] Call Trace: [ 8451.091216] <IRQ> [<ffffffff856fb29c>] dump_stack+0x55/0x76 [ 8451.091229] [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110 [ 8451.091235] [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0 [ 8451.091239] [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0 [ 8451.091242] [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190 [ 8451.091246] [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500 [ 8451.091249] [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190 [ 8451.091253] [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0 [ 8451.091256] [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0 [ 8451.091260] [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0 [ 8451.091263] [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0 [ 8451.091267] [<ffffffff85618d38>] ip_local_deliver+0x48/0x80 [ 8451.091270] [<ffffffff85618510>] ip_rcv_finish+0x160/0x700 [ 8451.091273] [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0 [ 8451.091277] [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90 Fixes: a8afca0329988 ("tcp: md5: protects md5sig_info with RCU") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18udp: remove duplicate includestephen hemminger
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18net ipv4: use preferred log methodsBastian Stender
Replace printk calls with preferred unconditional log method calls to keep kernel messages clean. Added newline to "too small MTU" message. Signed-off-by: Bastian Stender <bst@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16raw: increment correct SNMP counters for ICMP messagesBen Cartwright-Cox
Sending ICMP packets with raw sockets ends up in the SNMP counters logging the type as the first byte of the IPv4 header rather than the ICMP header. This is fixed by adding the IP Header Length to the casting into a icmphdr struct. Signed-off-by: Ben Cartwright-Cox <ben@benjojo.co.uk> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15tcp: ensure proper barriers in lockless contextsEric Dumazet
Some functions access TCP sockets without holding a lock and might output non consistent data, depending on compiler and or architecture. tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ... Introduce sk_state_load() and sk_state_store() to fix the issues, and more clearly document where this lack of locking is happening. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-09netfilter: Fix removal of GRE expectation entries created by PPTPAnthony Lineham
The uninitialized tuple structure caused incorrect hash calculation and the lookup failed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=106441 Signed-off-by: Anthony Lineham <anthony.lineham@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-05tcp: use correct req pointer in tcp_move_syn() callsEric Dumazet
I mistakenly took wrong request sock pointer when calling tcp_move_syn() @req_unhash is either a copy of @req, or a NULL value for FastOpen connexions (as we do not expect to unhash the temporary request sock from ehash table) Fixes: 805c4bc05705 ("tcp: fix req->saved_syn race") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05ipv4: use sk_fullsock() in ipv4_conntrack_defrag()Eric Dumazet
Before converting a 'socket pointer' into inet socket, use sk_fullsock() to detect timewait or request sockets. Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05tcp: fix req->saved_syn raceEric Dumazet
For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix ireq->pktopts race"), we need to make sure we do not access req->saved_syn unless we own the request sock. This fixes races for listeners using TCP_SAVE_SYN option. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04net: Fix prefsrc lookupsDavid Ahern
A bug report (https://bugzilla.kernel.org/show_bug.cgi?id=107071) noted that the follwoing ip command is failing with v4.3: $ ip route add 10.248.5.0/24 dev bond0.250 table vlan_250 src 10.248.5.154 RTNETLINK answers: Invalid argument 021dd3b8a142d changed the lookup of the given preferred source address to use the table id passed in, but this assumes the local entries are in the given table which is not necessarily true for non-VRF use cases. When validating the preferred source fallback to the local table on failure. Fixes: 021dd3b8a142d ("net: Add routes to the table associated with the device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04ipv4: fix a potential deadlock in mcast getsockopt() pathWANG Cong
Sasha reported the following lockdep warning: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET); lock(rtnl_mutex); lock(sk_lock-AF_INET); lock(rtnl_mutex); This is due to that for IP_MSFILTER and MCAST_MSFILTER, we take rtnl lock before the socket lock in setsockopt() path, but take the socket lock before rtnl lock in getsockopt() path. All the rest optnames are setsockopt()-only. Fix this by aligning the getsockopt() path with the setsockopt() path, so that all mcast socket path would be locked in the same order. Note, IPv6 part is different where rtnl lock is not held. Fixes: 54ff9ef36bdf ("ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}") Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04ipv4: disable BH when changing ip local port rangeWANG Cong
This fixes the following lockdep warning: [ INFO: inconsistent lock state ] 4.3.0-rc7+ #1197 Not tainted --------------------------------- inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage. sysctl/1019 [HC0[0]:SC0[0]:HE1:SE1] takes: (&(&net->ipv4.ip_local_ports.lock)->seqcount){+.+-..}, at: [<ffffffff81921de7>] ipv4_local_port_range+0xb4/0x12a {IN-SOFTIRQ-R} state was registered at: [<ffffffff810bd682>] __lock_acquire+0x2f6/0xdf0 [<ffffffff810be6d5>] lock_acquire+0x11c/0x1a4 [<ffffffff818e599c>] inet_get_local_port_range+0x4e/0xae [<ffffffff8166e8e3>] udp_flow_src_port.constprop.40+0x23/0x116 [<ffffffff81671cb9>] vxlan_xmit_one+0x219/0xa6a [<ffffffff81672f75>] vxlan_xmit+0xa6b/0xaa5 [<ffffffff817f2deb>] dev_hard_start_xmit+0x2ae/0x465 [<ffffffff817f35ed>] __dev_queue_xmit+0x531/0x633 [<ffffffff817f3702>] dev_queue_xmit_sk+0x13/0x15 [<ffffffff818004a5>] neigh_resolve_output+0x12f/0x14d [<ffffffff81959cfa>] ip6_finish_output2+0x344/0x39f [<ffffffff8195bf58>] ip6_finish_output+0x88/0x8e [<ffffffff8195bfef>] ip6_output+0x91/0xe5 [<ffffffff819792ae>] dst_output_sk+0x47/0x4c [<ffffffff81979392>] NF_HOOK_THRESH.constprop.30+0x38/0x82 [<ffffffff8197981e>] mld_sendpack+0x189/0x266 [<ffffffff8197b28b>] mld_ifc_timer_expire+0x1ef/0x223 [<ffffffff810de581>] call_timer_fn+0xfb/0x28c [<ffffffff810ded1e>] run_timer_softirq+0x1c7/0x1f1 Fixes: b8f1a55639e6 ("udp: Add function to make source port for UDP tunnels") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor overlapping changes in net/ipv4/ipmr.c, in 'net' we were fixing the "BH-ness" of the counter bumps whilst in 'net-next' the functions were modified to take an explicit 'net' parameter. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03xfrm: dst_entries_init() per-net dst_opsDan Streetman
Remove the dst_entries_init/destroy calls for xfrm4 and xfrm6 dst_ops templates; their dst_entries counters will never be used. Move the xfrm dst_ops initialization from the common xfrm/xfrm_policy.c to xfrm4/xfrm4_policy.c and xfrm6/xfrm6_policy.c, and call dst_entries_init and dst_entries_destroy for each net namespace. The ipv4 and ipv6 xfrms each create dst_ops template, and perform dst_entries_init on the templates. The template values are copied to each net namespace's xfrm.xfrm*_dst_ops. The problem there is the dst_ops pcpuc_entries field is a percpu counter and cannot be used correctly by simply copying it to another object. The result of this is a very subtle bug; changes to the dst entries counter from one net namespace may sometimes get applied to a different net namespace dst entries counter. This is because of how the percpu counter works; it has a main count field as well as a pointer to the percpu variables. Each net namespace maintains its own main count variable, but all point to one set of percpu variables. When any net namespace happens to change one of the percpu variables to outside its small batch range, its count is moved to the net namespace's main count variable. So with multiple net namespaces operating concurrently, the dst_ops entries counter can stray from the actual value that it should be; if counts are consistently moved from one net namespace to another (which my testing showed is likely), then one net namespace winds up with a negative dst_ops count while another winds up with a continually increasing count, eventually reaching its gc_thresh limit, which causes all new traffic on the net namespace to fail with -ENOBUFS. Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-11-02net: fix percpu memory leaksEric Dumazet
This patch fixes following problems : 1) percpu_counter_init() can return an error, therefore init_frag_mem_limit() must propagate this error so that inet_frags_init_net() can do the same up to its callers. 2) If ip[46]_frags_ns_ctl_register() fail, we must unwind properly and free the percpu_counter. Without this fix, we leave freed object in percpu_counters global list (if CONFIG_HOTPLUG_CPU) leading to crashes. This bug was detected by KASAN and syzkaller tool (http://github.com/google/syzkaller) Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>