summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2017-02-07net: add dst_pending_confirm flag to skbuffJulian Anastasov
Add new skbuff flag to allow protocols to confirm neighbour. When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. Add sock_confirm_neigh() helper to confirm the neighbour and use it for IPv4, IPv6 and VRF before dst_neigh_output. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07ipv6: sr: fix non static symbol warningsWei Yongjun
Fixes the following sparse warnings: net/ipv6/seg6_iptunnel.c:58:5: warning: symbol 'nla_put_srh' was not declared. Should it be static? net/ipv6/seg6_iptunnel.c:238:5: warning: symbol 'seg6_input' was not declared. Should it be static? net/ipv6/seg6_iptunnel.c:254:5: warning: symbol 'seg6_output' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Use compressed IPv6 addresses showing route replace errorDavid Ahern
ip6_print_replace_route_err logs an error if a route replace fails with IPv6 addresses in the full format. e.g,: IPv6: IPV6: multipath route replace failed (check consistency of installed routes): 2001:0db8:0200:0000:0000:0000:0000:0000 nexthop 2001:0db8:0001:0000:0000:0000:0000:0016 ifi 0 Change the message to dump the addresses in the compressed format. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Change notifications for multipath delete to RTA_MULTIPATHDavid Ahern
If an entire multipath route is deleted using prefix and len (without any nexthops), send a single RTM_DELROUTE notification with the full route using RTA_MULTIPATH. This is done by generating the skb before the route delete when all of the sibling routes are still present but sending it after the route has been removed from the FIB. The skip_notify flag is used to tell the lower fib code not to send notifications for the individual nexthop routes. If a route is deleted using RTA_MULTIPATH for any nexthops or a single nexthop entry is deleted, then the nexthops are deleted one at a time with notifications sent as each hop is deleted. This is necessary given that IPv6 allows individual hops within a route to be deleted. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Change notifications for multipath add to RTA_MULTIPATHDavid Ahern
Change ip6_route_multipath_add to send one notifciation with the full route encoded with RTA_MULTIPATH instead of a series of individual routes. This is done by adding a skip_notify flag to the nl_info struct. The flag is used to skip sending of the notification in the fib code that actually inserts the route. Once the full route has been added, a notification is generated with all nexthops. ip6_route_multipath_add handles 3 use cases: new routes, route replace, and route append. The multipath notification generated needs to be consistent with the order of the nexthops and it should be consistent with the order in a FIB dump which means the route with the first nexthop needs to be used as the route reference. For the first 2 cases (new and replace), a reference to the route used to send the notification is obtained by saving the first route added. For the append case, the last route added is used to loop back to its first sibling route which is the first nexthop in the multipath route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Add support to dump multipath routes via RTA_MULTIPATH attributeDavid Ahern
IPv6 returns multipath routes as a series of individual routes making their display and handling by userspace different and more complicated than IPv4, putting the burden on the user to see that a route is part of a multipath route and internally creating a multipath route if desired (e.g., libnl does this as of commit 29b71371e764). This patch addresses this difference, allowing multipath routes to be returned using the RTA_MULTIPATH attribute. The end result is that IPv6 multipath routes can be treated and displayed in a format similar to IPv4: $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:200::/120 metric 1024 nexthop via 2001:db8:1::2 dev eth1 weight 1 nexthop via 2001:db8:2::2 dev eth2 weight 1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Allow shorthand delete of all nexthops in multipath routeDavid Ahern
IPv4 allows multipath routes to be deleted using just the prefix and length. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.11.200.2 dev eth11.200 weight 1 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3 $ ip ro del 1.1.1.0/24 vrf red $ ip ro ls vrf red unreachable default metric 8192 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3 The same notation does not work with IPv6 because of how multipath routes are implemented for IPv6. For IPv6 only the first nexthop of a multipath route is deleted if the request contains only a prefix and length. This leads to unnecessary complexity in userspace dealing with IPv6 multipath routes. This patch allows all nexthops to be deleted without specifying each one in the delete request. Internally, this is done by walking the sibling list of the route matching the specifications given (prefix, length, metric, protocol, etc). $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:200::/120 via 2001:db8:1::2 dev eth1 metric 1024 pref medium 2001:db8:200::/120 via 2001:db8:2::2 dev eth2 metric 1024 pref medium ... $ ip -6 ro del vrf red 2001:db8:200::/120 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium ... Because IPv6 allows individual nexthops to be deleted without deleting the entire route, the ip6_route_multipath_del and non-multipath code path (ip6_route_del) have to be discriminated so that all nexthops are only deleted for the latter case. This is done by making the existing fc_type in fib6_config a u16 and then adding a new u16 field with fc_delete_all_nh as the first bit. Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net: ipv6: Set protocol to kernel for local routesDavid Ahern
IPv6 stack does not set the protocol for local routes, so those routes show up with proto "none": $ ip -6 ro ls table local local ::1 dev lo proto none metric 0 pref medium local 2100:3:: dev lo proto none metric 0 pref medium local 2100:3::4 dev lo proto none metric 0 pref medium local fe80:: dev lo proto none metric 0 pref medium ... Set rt6i_protocol to RTPROT_KERNEL for consistency with IPv4. Now routes show up with proto "kernel": $ ip -6 ro ls table local local ::1 dev lo proto kernel metric 0 pref medium local 2100:3:: dev lo proto kernel metric 0 pref medium local 2100:3::4 dev lo proto kernel metric 0 pref medium local fe80:: dev lo proto kernel metric 0 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03tcp: add tcp_mss_clamp() helperEric Dumazet
Small cleanup factorizing code doing the TCP_MAXSEG clamping. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All merge conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02netfilter: allow logging from non-init namespacesMichal Kubeček
Commit 69b34fb996b2 ("netfilter: xt_LOG: add net namespace support for xt_LOG") disabled logging packets using the LOG target from non-init namespaces. The motivation was to prevent containers from flooding kernel log of the host. The plan was to keep it that way until syslog namespace implementation allows containers to log in a safe way. However, the work on syslog namespace seems to have hit a dead end somewhere in 2013 and there are users who want to use xt_LOG in all network namespaces. This patch allows to do so by setting /proc/sys/net/netfilter/nf_log_all_netns to a nonzero value. This sysctl is only accessible from init_net so that one cannot switch the behaviour from inside a container. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02netfilter: merge ctinfo into nfct pointer storage areaFlorian Westphal
After this change conntrack operations (lookup, creation, matching from ruleset) only access one instead of two sk_buff cache lines. This works for normal conntracks because those are allocated from a slab that guarantees hw cacheline or 8byte alignment (whatever is larger) so the 3 bits needed for ctinfo won't overlap with nf_conn addresses. Template allocation now does manual address alignment (see previous change) on arches that don't have sufficent kmalloc min alignment. Some spots intentionally use skb->_nfct instead of skb_nfct() helpers, this is to avoid undoing the skb_nfct() use when we remove untracked conntrack object in the future. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02netfilter: add and use nf_ct_set helperFlorian Westphal
Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff. This avoids changing code in followup patch that merges skb->nfct and skb->nfctinfo into skb->_nfct. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02skbuff: add and use skb_nfct helperFlorian Westphal
Followup patch renames skb->nfct and changes its type so add a helper to avoid intrusive rename change later. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02netfilter: reset netfilter state when duplicating packetFlorian Westphal
We should also toss nf_bridge_info, if any -- packet is leaving via ip_local_out, also, this skb isn't bridged -- it is a locally generated copy. Also this avoids the need to touch this later when skb->nfct is replaced with 'unsigned long _nfct' in followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02netfilter: conntrack: no need to pass ctinfo to error handlerFlorian Westphal
It is never accessed for reading and the only places that write to it are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo). The conntrack core specifically checks for attached skb->nfct after ->error() invocation and returns early in this case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-01ipv6: pointer math error in ip6_tnl_parse_tlv_enc_lim()Dan Carpenter
Casting is a high precedence operation but "off" and "i" are in terms of bytes so we need to have some parenthesis here. Fixes: fbfa743a9d2a ("ipv6: fix ip6_tnl_parse_tlv_enc_lim()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01net: ipv6: add NLM_F_APPEND in notifications when applicableDavid Ahern
IPv6 does not set the NLM_F_APPEND flag in notifications to signal that a NEWROUTE is an append versus a new route or a replaced one. Add the flag if the request has it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-02-01 1) Some typo fixes, from Alexander Alemayhu. 2) Don't acquire state lock in get_mtu functions. The only rece against a dead state does not matter. From Florian Westphal. 3) Remove xfrm4_state_fini, it is unused for more than 10 years. From Florian Westphal. 4) Various rcu usage improvements. From Florian Westphal. 5) Properly handle crypto arrors in ah4/ah6. From Gilad Ben-Yossef. 6) Try to avoid skb linearization in esp4 and esp6. 7) The esp trailer is now set up in different places, add a helper for this. 8) With the upcomming usage of gro_cells in IPsec, a gro merged skb can have a secpath. Drop it before freeing or reusing the skb. 9) Add a xfrm dummy network device for napi. With this we can use gro_cells from within xfrm, it allows IPsec GRO without impact on the generic networking code. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30ipv6: Paritially checksum full MTU framesVlad Yasevich
IPv6 will mark data that is smaller that mtu - headersize as CHECKSUM_PARTIAL, but if the data will completely fill the mtu, the packet checksum will be computed in software instead. Extend the conditional to include the data that fills the mtu as well. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30lwtunnel: remove device arg to lwtunnel_build_stateDavid Ahern
Nothing about lwt state requires a device reference, so remove the input argument. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30net: Avoid receiving packets with an l3mdev on unbound UDP socketsRobert Shearman
Packets arriving in a VRF currently are delivered to UDP sockets that aren't bound to any interface. TCP defaults to not delivering packets arriving in a VRF to unbound sockets. IP route lookup and socket transmit both assume that unbound means using the default table and UDP applications that haven't been changed to be aware of VRFs may not function correctly in this case since they may not be able to handle overlapping IP address ranges, or be able to send packets back to the original sender if required. So add a sysctl, udp_l3mdev_accept, to control this behaviour with it being analgous to the existing tcp_l3mdev_accept, namely to allow a process to have a VRF-global listen socket. Have this default to off as this is the behaviour that users will expect, given that there is no explicit mechanism to set unmodified VRF-unaware application into a default VRF. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27tcp: don't annotate mark on control socket from tcp_v6_send_response()Pablo Neira
Unlike ipv4, this control socket is shared by all cpus so we cannot use it as scratchpad area to annotate the mark that we pass to ip6_xmit(). Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket family caches the flowi6 structure in the sctp_transport structure, so we cannot use to carry the mark unless we later on reset it back, which I discarded since it looks ugly to me. Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net/ipv6: support more tunnel interfaces for EUI64 link-local generationFelix Jia
Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net/ipv6: allow sysctl to change link-local address generation modeFelix Jia
The address generation mode for IPv6 link-local can only be configured by netlink messages. This patch adds the ability to change the address generation mode via sysctl. v1 -> v2 Removed the rtnl lock and switch to use RCU lock to iterate through the netdev list. v2 -> v3 Removed the addrgenmode variable from the idev structure and use the systcl storage for the flag. Simplifed the logic for sysctl handling by removing the supported for all operation. Added support for more types of tunnel interfaces for link-local address generation. Based the patches from net-next. v3 -> v4 Removed unnecessary whitespace changes. Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26net: ipv6: ignore null_entry on route dumpsDavid Ahern
lkp-robot reported a BUG: [ 10.151226] BUG: unable to handle kernel NULL pointer dereference at 00000198 [ 10.152525] IP: rt6_fill_node+0x164/0x4b8 [ 10.153307] *pdpt = 0000000012ee5001 *pde = 0000000000000000 [ 10.153309] [ 10.154492] Oops: 0000 [#1] [ 10.154987] CPU: 0 PID: 909 Comm: netifd Not tainted 4.10.0-rc4-00722-g41e8c70ee162-dirty #10 [ 10.156482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 10.158254] task: d0deb000 task.stack: d0e0c000 [ 10.159059] EIP: rt6_fill_node+0x164/0x4b8 [ 10.159780] EFLAGS: 00010296 CPU: 0 [ 10.160404] EAX: 00000000 EBX: d10c2358 ECX: c1f7c6cc EDX: c1f6ff44 [ 10.161469] ESI: 00000000 EDI: c2059900 EBP: d0e0dc4c ESP: d0e0dbe4 [ 10.162534] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 10.163482] CR0: 80050033 CR2: 00000198 CR3: 10d94660 CR4: 000006b0 [ 10.164535] Call Trace: [ 10.164993] ? paravirt_sched_clock+0x9/0xd [ 10.165727] ? sched_clock+0x9/0xc [ 10.166329] ? sched_clock_cpu+0x19/0xe9 [ 10.166991] ? lock_release+0x13e/0x36c [ 10.167652] rt6_dump_route+0x4c/0x56 [ 10.168276] fib6_dump_node+0x1d/0x3d [ 10.168913] fib6_walk_continue+0xab/0x167 [ 10.169611] fib6_walk+0x2a/0x40 [ 10.170182] inet6_dump_fib+0xfb/0x1e0 [ 10.170855] netlink_dump+0xcd/0x21f This happens when the loopback device is set down and a ipv6 fib route dump is requested. ip6_null_entry is the root of all ipv6 fib tables making it integrated into the table and hence passed to the ipv6 route dump code. The null_entry route uses the loopback device for dst.dev but may not have rt6i_idev set because of the order in which initializations are done -- ip6_route_net_init is run before addrconf_init has initialized the loopback device. Fixing the initialization order is a much bigger problem with no obvious solution thus far. The BUG is triggered when the loopback is set down and the netif_running check added by a1a22c1206 fails. The fill_node descends to checking rt->rt6i_idev for ignore_routes_with_linkdown and since rt6i_idev is NULL it faults. The null_entry route should not be processed in a dump request. Catch and ignore. This check is done in rt6_dump_route as it is the highest place in the callchain with knowledge of both the route and the network namespace. Fixes: a1a22c1206("net: ipv6: Keep nexthop of multipath route on admin down") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26net: ipv6: remove skb_reserve in getrouteDavid Ahern
Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The allocated skb is not passed through the routing engine (like it is for IPv4) and has not since the beginning of git time. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a large batch with Netfilter fixes for your net tree, they are: 1) Two patches to solve conntrack garbage collector cpu hogging, one to remove GC_MAX_EVICTS and another to look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not the scanning interval. From Florian Westphal. 2) Two patches to fix incorrect set element counting if NLM_F_EXCL is is not set. Moreover, don't decrenent set->nelems from abort patch if -ENFILE which leaks a spare slot in the set. This includes a patch to deconstify the set walk callback to update set->ndeact. 3) Two fixes for the fwmark_reflect sysctl feature: Propagate mark to reply packets both from nf_reject and local stack, from Pau Espin Pedrol. 4) Fix incorrect handling of loopback traffic in rpfilter and nf_tables fib expression, from Liping Zhang. 5) Fix oops on stateful objects netlink dump, when no filter is specified. Also from Liping Zhang. 6) Fix a build error if proc is not available in ipt_CLUSTERIP, related to fix that was applied in the previous batch for net. From Arnd Bergmann. 7) Fix lack of string validation in table, chain, set and stateful object names in nf_tables, from Liping Zhang. Moreover, restrict maximum log prefix length to 127 bytes, otherwise explicitly bail out. 8) Two patches to fix spelling and typos in nf_tables uapi header file and Kconfig, patches from Alexander Alemayhu and William Breathitt Gray. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net/tcp-fastopen: Add new API supportWei Wang
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an alternative way to perform Fast Open on the active side (client). Prior to this patch, a client needs to replace the connect() call with sendto(MSG_FASTOPEN). This can be cumbersome for applications who want to use Fast Open: these socket operations are often done in lower layer libraries used by many other applications. Changing these libraries and/or the socket call sequences are not trivial. A more convenient approach is to perform Fast Open by simply enabling a socket option when the socket is created w/o changing other socket calls sequence: s = socket() create a new socket setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …); newly introduced sockopt If set, new functionality described below will be used. Return ENOTSUPP if TFO is not supported or not enabled in the kernel. connect() With cookie present, return 0 immediately. With no cookie, initiate 3WHS with TFO cookie-request option and return -1 with errno = EINPROGRESS. write()/sendmsg() With cookie present, send out SYN with data and return the number of bytes buffered. With no cookie, and 3WHS not yet completed, return -1 with errno = EINPROGRESS. No MSG_FASTOPEN flag is needed. read() Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but write() is not called yet. Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is established but no msg is received yet. Return number of bytes read if socket is established and there is msg received. The new API simplifies life for applications that always perform a write() immediately after a successful connect(). Such applications can now take advantage of Fast Open by merely making one new setsockopt() call at the time of creating the socket. Nothing else about the application's socket call sequence needs to change. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net: Remove __sk_dst_reset() in tcp_v6_connect()Wei Wang
Remove __sk_dst_reset() in the failure handling because __sk_dst_reset() will eventually get called when sk is released. No need to handle it in the protocol specific connect call. This is also to make the code path consistent with ipv4. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Specify the owning module for lwtunnel opsRobert Shearman
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ipv6: fix ip6_tnl_parse_tlv_enc_lim()Eric Dumazet
This function suffers from multiple issues. First one is that pskb_may_pull() may reallocate skb->head, so the 'raw' pointer needs either to be reloaded or not used at all. Second issue is that NEXTHDR_DEST handling does not validate that the options are present in skb->data, so we might read garbage or access non existent memory. With help from Willem de Bruijn. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()Eric Dumazet
Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(), we must reload any pointer that was related to skb->head (or skb->data), or risk use after free. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Introduce a sysctl that modifies the value of PROT_SOCK.Krister Johansen
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl that denotes the first unprivileged inet port in the namespace. To disable all privileged ports set this to zero. It also checks for overlap with the local port range. The privileged and local range may not overlap. The use case for this change is to allow containerized processes to bind to priviliged ports, but prevent them from ever being allowed to modify their container's network configuration. The latter is accomplished by ensuring that the network namespace is not a child of the user namespace. This modification was needed to allow the container manager to disable a namespace's priviliged port restrictions without exposing control of the network namespace to processes in the user namespace. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-22ipv6: add NUMA awareness to seg6_hmac_init_algo()Eric Dumazet
Since we allocate per cpu storage, let's also use NUMA hints. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20gre6: Clean up unused struct ipv6_tel_txoption definitionJakub Sitnicki
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") removed the ip6gre specific transmit function, but left the struct ipv6_tel_txoption definition. Clean it up. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20ipv6: seg6_genl_set_tunsrc() must check kmemdup() return valueEric Dumazet
seg6_genl_get_tunsrc() and set_tun_src() do not handle tun_src being possibly NULL, so we must check kmemdup() return value and abort if it is NULL Fixes: 915d7e5e5930 ("ipv6: sr: add code base for control plane support of SR-IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Lebrun <david.lebrun@uclouvain.be> Acked-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: ipv6: Keep nexthop of multipath route on admin downDavid Ahern
IPv6 deletes route entries associated with multipath routes on an admin down where IPv4 does not. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 metric 64 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.100.2.254 dev eth2 weight 1 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Set link down: $ ip li set eth1 down IPv4 retains the multihop route but flags eth1 route as dead: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.16 dev eth1 weight 1 dead linkdown nexthop via 10.100.2.16 dev eth2 weight 1 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 and IPv6 deletes the route as part of flushing all routes for the device: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Worse, on admin up of the device the multipath route has to be deleted to get this leg of the route re-added. This patch keeps routes that are part of a multipath route if ignore_routes_with_linkdown is set with the dead and linkdown flags enabling consistency between IPv4 and IPv6: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19ipv6: addrconf: Avoid addrconf_disable_change() using RCU read-side lockKefeng Wang
Just like commit 4acd4945cd1e ("ipv6: addrconf: Avoid calling netdevice notifiers with RCU read-side lock"), it is unnecessary to make addrconf_disable_change() use RCU iteration over the netdev list, since it already holds the RTNL lock, or we may meet Illegal context switch in RCU read-side critical section. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18lwtunnel: fix autoload of lwt modulesDavid Ahern
Trying to add an mpls encap route when the MPLS modules are not loaded hangs. For example: CONFIG_MPLS=y CONFIG_NET_MPLS_GSO=m CONFIG_MPLS_ROUTING=m CONFIG_MPLS_IPTUNNEL=m $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2 The ip command hangs: root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2 $ cat /proc/880/stack [<ffffffff81065a9b>] call_usermodehelper_exec+0xd6/0x134 [<ffffffff81065efc>] __request_module+0x27b/0x30a [<ffffffff814542f6>] lwtunnel_build_state+0xe4/0x178 [<ffffffff814aa1e4>] fib_create_info+0x47f/0xdd4 [<ffffffff814ae451>] fib_table_insert+0x90/0x41f [<ffffffff814a8010>] inet_rtm_newroute+0x4b/0x52 ... modprobe is trying to load rtnl-lwt-MPLS: root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS and it hangs after loading mpls_router: $ cat /proc/881/stack [<ffffffff81441537>] rtnl_lock+0x12/0x14 [<ffffffff8142ca2a>] register_netdevice_notifier+0x16/0x179 [<ffffffffa0033025>] mpls_init+0x25/0x1000 [mpls_router] [<ffffffff81000471>] do_one_initcall+0x8e/0x13f [<ffffffff81119961>] do_init_module+0x5a/0x1e5 [<ffffffff810bd070>] load_module+0x13bd/0x17d6 ... The problem is that lwtunnel_build_state is called with rtnl lock held preventing mpls_init from registering. Given the potential references held by the time lwtunnel_build_state it can not drop the rtnl lock to the load module. So, extract the module loading code from lwtunnel_build_state into a new function to validate the encap type. The new function is called while converting the user request into a fib_config which is well before any table, device or fib entries are examined. Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove prefix arg to rt6_fill_nodeDavid Ahern
The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route where a user is requesting a prefix only dump. Simplify rt6_fill_node by removing the prefix arg and moving the prefix check to rt6_dump_route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove nowait arg to rt6_fill_nodeDavid Ahern
All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and simplify rt6_fill_node accordingly. rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the nowait arg from it as well. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: drop ->bind_conflictJosef Bacik
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict is how they check the rcv_saddr, so delete this call back and simply change inet_csk_bind_conflict to call inet_rcv_saddr_equal. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: collapse ipv4/v6 rcv_saddr_equal functions into oneJosef Bacik
We pass these per-protocol equal functions around in various places, but we can just have one function that checks the sk->sk_family and then do the right comparison function. I've also changed the ipv4 version to not cast to inet_sock since it is unneeded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-01-17esp: Introduce a helper to setup the trailerSteffen Klassert
We need to setup the trailer in two different cases, so add a helper to avoid code duplication. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-17esp6: Avoid skb_cow_data whenever possibleSteffen Klassert
This patch tries to avoid skb_cow_data on esp6. On the encrypt side we add the IPsec tailbits to the linear part of the buffer if there is space on it. If there is no space on the linear part, we add a page fragment with the tailbits to the buffer and use separate src and dst scatterlists. On the decrypt side, we leave the buffer as it is if it is not cloned. With this, we can avoid a linearization of the buffer in most of the cases. Joint work with: Sowmini Varadhan <sowmini.varadhan@oracle.com> Ilan Tayari <ilant@mellanox.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-16ip6_tunnel: Account for tunnel header in tunnel MTUJakub Sitnicki
With ip6gre we have a tunnel header which also makes the tunnel MTU smaller. We need to reserve room for it. Previously we were using up space reserved for the Tunnel Encapsulation Limit option header (RFC 2473). Also, after commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") our contract with the caller has changed. Now we check if the packet length exceeds the tunnel MTU after the tunnel header has been pushed, unlike before. This is reflected in the check where we look at the packet length minus the size of the tunnel header, which is already accounted for in tunnel MTU. Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>