Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes more low level communication layer cleanups.
The main change is the listening socket is no longer handled as a
special case of node connection sockets. There is one small fix for
checking the number of local connections"
* tag 'dlm-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: check on existing node address
fs: dlm: constify addr_compare
fs: dlm: fix check for multi-homed hosts
fs: dlm: listen socket out of connection hash
fs: dlm: refactor sctp sock parameter
fs: dlm: move shutdown action to node creation
fs: dlm: move connect callback in node creation
fs: dlm: add helper for init connection
fs: dlm: handle non blocked connect event
fs: dlm: flush othercon at close
fs: dlm: add get buffer error handling
fs: dlm: define max send buffer
fs: dlm: fix proper srcu api call
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"We have a mix of all kinds of changes, feature updates, core stuff,
performance improvements and lots of cleanups and preparatory changes.
User visible:
- export filesystem generation in sysfs
- new features for mount option 'rescue':
- what's currently supported is exported in sysfs
- 'ignorebadroots'/'ibadroots' - continue even if some essential
tree roots are not usable (extent, uuid, data reloc, device,
csum, free space)
- 'ignoredatacsums'/'idatacsums' - skip checksum verification on
data
- 'all' - now enables 'ignorebadroots' + 'ignoredatacsums' +
'nologreplay'
- export read mirror policy settings to sysfs, new policies will be
added in the future
- remove inode number cache feature (mount -o inode_cache), obsoleted
in 5.9
User visible fixes:
- async discard scheduling fixes on high loads
- update inode byte counter atomically so stat() does not report
wrong value in some cases
- free space tree fixes:
- correctly report status of v2 after remount
- clear v1 cache inodes when v2 is newly enabled after remount
Core:
- switch own tree lock implementation to standard rw semaphore:
- one-level lock nesting is not required anymore, the last use of
this was in free space that's now loaded asynchronously
- own implementation of adaptive spinning before taking mutex has
been part of rwsem
- performance seems to be better in general, much better (+tens
of percents) for some workloads
- lockdep does not complain
- finish direct IO conversion to iomap infrastructure, remove
temporary workaround for DSYNC after iomap API updates
- preparatory work to support data and metadata blocks smaller than
page:
- generalize code that assumes sectorsize == PAGE_SIZE, lots of
refactoring
- planned namely for 64K pages (eg. arm64, ppc64)
- scrub read-only support
- preparatory work for zoned allocation mode (SMR/ZBC/ZNS friendly):
- disable incompatible features
- round-robin superblock write
- free space cache (v1) is loaded asynchronously, remove tree path
recursion
- slightly improved time tacking for transaction kthread wake ups
Performance improvements (note that the numbers depend on load type or
other features and weren't run on the same machine):
- skip unnecessary work:
- do not start readahead for csum tree when scrubbing non-data
block groups
- do not start and wait for delalloc on snapshot roots on
transaction commit
- fix race when defragmenting leads to unnecessary IO
- dbench speedups (+throughput%/-max latency%):
- skip unnecessary searches for xattrs when logging an inode
(+10.8/-8.2)
- stop incrementing log batch when joining log transaction (1-2)
- unlock path before checking if extent is shared during nocow
writeback (+5.0/-20.5), on fio load +9.7% throughput/-9.8%
runtime
- several tree log improvements, eg. removing unnecessary
operations, fixing races that lead to additional work
(+12.7/-8.2)
- tree-checker error branches annotated with unlikely() (+3%
throughput)
Other:
- cleanups
- lockdep fixes
- more btrfs_inode conversions
- error variable cleanups"
* tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
btrfs: scrub: allow scrub to work with subpage sectorsize
btrfs: scrub: support subpage data scrub
btrfs: scrub: support subpage tree block scrub
btrfs: scrub: always allocate one full page for one sector for RAID56
btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
btrfs: update num_extent_pages to support subpage sized extent buffer
btrfs: don't allow tree block to cross page boundary for subpage support
btrfs: calculate inline extent buffer page size based on page size
btrfs: factor out btree page submission code to a helper
btrfs: make btrfs_verify_data_csum follow sector size
btrfs: pass bio_offset to check_data_csum() directly
btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
btrfs: fix lockdep warning when creating free space tree
btrfs: skip space_cache v1 setup when not using it
btrfs: remove free space items when disabling space cache v1
btrfs: warn when remount will not change the free space tree
btrfs: use superblock state to print space_cache mount option
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fixes from Jeff Layton:
"A fix for some undefined integer overflow behavior, a typo in a
comment header, and a fix for a potential deadlock involving internal
senders of SIGIO/SIGURG"
* tag 'locks-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fcntl: Fix potential deadlock in send_sig{io, urg}()
locks: fix a typo at a kernel-doc markup
locks: Fix UBSAN undefined behaviour in flock64_to_posix_lock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big driver core updates for 5.11-rc1
This time there was a lot of different work happening here for some
reason:
- redo of the fwnode link logic, speeding it up greatly
- auxiliary bus added (this was a tag that will be pulled in from
other trees/maintainers this merge window as well, as driver
subsystems started to rely on it)
- platform driver core cleanups on the way to fixing some long-time
api updates in future releases
- minor fixes and tweaks.
All have been in linux-next with no (finally) reported issues. Testing
there did helped in shaking issues out a lot :)"
* tag 'driver-core-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (39 commits)
driver core: platform: don't oops in platform_shutdown() on unbound devices
ACPI: Use fwnode_init() to set up fwnode
misc: pvpanic: Replace OF headers by mod_devicetable.h
misc: pvpanic: Combine ACPI and platform drivers
usb: host: sl811: Switch to use platform_get_mem_or_io()
vfio: platform: Switch to use platform_get_mem_or_io()
driver core: platform: Introduce platform_get_mem_or_io()
dyndbg: fix use before null check
soc: fix comment for freeing soc_dev_attr
driver core: platform: use bus_type functions
driver core: platform: change logic implementing platform_driver_probe
driver core: platform: reorder functions
driver core: make driver_probe_device() static
driver core: Fix a couple of typos
driver core: Reorder devices on successful probe
driver core: Delete pointless parameter in fwnode_operations.add_links
driver core: Refactor fw_devlink feature
efi: Update implementation of add_links() to create fwnode links
of: property: Update implementation of add_links() to create fwnode links
driver core: Use device's fwnode to check if it is waiting for suppliers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll
- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...
|
|
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
|
|
With this change, when the knob is set to 0, it allows unprivileged users
to call userfaultfd, like when it is set to 1, but with the restriction
that page faults from only user-mode can be handled. In this mode, an
unprivileged user (without SYS_CAP_PTRACE capability) must pass
UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.
This enables administrators to reduce the likelihood that an attacker with
access to userfaultfd can delay faulting kernel code to widen timing
windows for other exploits.
The default value of this knob is changed to 0. This is required for
correct functioning of pipe mutex. However, this will fail postcopy live
migration, which will be unnoticeable to the VM guests. To avoid this,
set 'vm.userfault = 1' in /sys/sysctl.conf.
The main reason this change is desirable as in the short term is that the
Android userland will behave as with the sysctl set to zero. So without
this commit, any Linux binary using userfaultfd to manage its memory would
behave differently if run within the Android userland. For more details,
refer to Andrea's reply [1].
[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: <calin@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "Control over userfaultfd kernel-fault handling", v6.
This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.
It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel. For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.
This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.
The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.
[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
This patch (of 2):
userfaultfd handles page faults from both user and kernel code. Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.
A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.
Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
which in turn enables memmap_valid_within() function that is intended to
verify existence of struct page associated with a pfn when there are holes
in the memory map.
However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID
and arch-specific pfn_valid() implementation that also deals with the holes
in the memory map.
The only two users of memmap_valid_within() call this function after
a call to pfn_valid() so the memmap_valid_within() check becomes redundant.
Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely
entirely on ARM's implementation of pfn_valid() that is now enabled
unconditionally.
Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Michael Schmitz <schmitzmic@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel. Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.
Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups. However at
the moment, the pagetables are accounted per-zone. Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.
[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]
Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Running stress-ng on ocfs2 completely fills the kernel log with 'max
lookup times reached, filesystem may have nested directories.'
Let's ratelimit this message as done with others in the code.
Test-case:
# mkfs.ocfs2 --mount local $DEV
# mount $DEV $MNT
# cd $MNT
# dmesg -C
# stress-ng --dirdeep 1 --dirdeep-ops 1000
# dmesg | grep -c 'max lookup times reached'
Before:
# dmesg -C
# stress-ng --dirdeep 1 --dirdeep-ops 1000
...
stress-ng: info: [11116] successful run completed in 3.03s
# dmesg | grep -c 'max lookup times reached'
967
After:
# dmesg -C
# stress-ng --dirdeep 1 --dirdeep-ops 1000
...
stress-ng: info: [739] successful run completed in 0.96s
# dmesg | grep -c 'max lookup times reached'
10
# dmesg
[ 259.086086] ocfs2_check_if_ancestor: 1990 callbacks suppressed
[ 259.086092] (stress-ng-dirde,740,1):ocfs2_check_if_ancestor:1091 max lookup times reached, filesystem may have nested directories, src inode: 18007, dest inode: 17940.
...
Link: https://lkml.kernel.org/r/20201001224417.478263-1-mfo@canonical.com
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A break is not needed if it is preceded by a goto
Link: https://lkml.kernel.org/r/20201019175216.2329-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This variable isn't used anymore, remove it to skip W=1 warning:
fs/ntfs/inode.c:2350:6: warning: variable `attr_len' set but not used [-Wunused-but-set-variable]
Link: https://lkml.kernel.org/r/4194376f-898b-b602-81c3-210567712092@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We actually don't use these varibles, so remove them to avoid gcc warning:
fs/ntfs/file.c:326:14: warning: variable `base_ni' set but not used [-Wunused-but-set-variable]
fs/ntfs/logfile.c:481:21: warning: variable `log_page_mask' set but not used [-Wunused-but-set-variable]
Link: https://lkml.kernel.org/r/1604821092-54631-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry/exit updates from Thomas Gleixner:
"A set of updates for entry/exit handling:
- More generalization of entry/exit functionality
- The consolidation work to reclaim TIF flags on x86 and also for
non-x86 specific TIF flags which are solely relevant for syscall
related work and have been moved into their own storage space. The
x86 specific part had to be merged in to avoid a major conflict.
- The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
delivery mode of task work and results in an impressive performance
improvement for io_uring. The non-x86 consolidation of this is
going to come seperate via Jens.
- The selective syscall redirection facility which provides a clean
and efficient way to support the non-Linux syscalls of WINE by
catching them at syscall entry and redirecting them to the user
space emulation. This can be utilized for other purposes as well
and has been designed carefully to avoid overhead for the regular
fastpath. This includes the core changes and the x86 support code.
- Simplification of the context tracking entry/exit handling for the
users of the generic entry code which guarantee the proper ordering
and protection.
- Preparatory changes to make the generic entry code accomodate S390
specific requirements which are mostly related to their syscall
restart mechanism"
* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
entry: Add syscall_exit_to_user_mode_work()
entry: Add exit_to_user_mode() wrapper
entry_Add_enter_from_user_mode_wrapper
entry: Rename exit_to_user_mode()
entry: Rename enter_from_user_mode()
docs: Document Syscall User Dispatch
selftests: Add benchmark for syscall user dispatch
selftests: Add kselftest for syscall user dispatch
entry: Support Syscall User Dispatch on common syscall entry
kernel: Implement selective syscall userspace redirection
signal: Expose SYS_USER_DISPATCH si_code type
x86: vdso: Expose sigreturn address on vdso to the kernel
MAINTAINERS: Add entry for common entry code
entry: Fix boot for !CONFIG_GENERIC_ENTRY
x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
sched: Detect call to schedule from critical entry code
context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
x86: Reclaim unused x86 TI flags
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:
- Use put_nsproxy() instead of open-coding it switch_task_namespaces()
- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.
- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.
- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688fe ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.
Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"
* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull time namespace updates from Christian Brauner:
"When time namespaces were introduced we missed to virtualize the
'btime' field in /proc/stat. This confuses tasks which are in another
time namespace with a virtualized boottime which is common in some
container workloads. This contains Michael's series to fix 'btime'
which Thomas asked me to take through my tree.
To fix 'btime' virtualization we simply subtract the offset of the
time namespace's boottime from btime before printing the stats. Note
that since start_boottime of processes are seconds since boottime and
the boottime stamp is now shifted according to the time namespace's
offset, the offset of the time namespace also needs to be applied
before the process stats are given to userspace. This avoids that
processes shown by tools such as 'ps' appear as time travelers in the
corresponding time namespace.
Selftests are included to verify that btime virtualization in
/proc/stat works as expected"
* tag 'time-namespace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
namespace: make timens_on_fork() return nothing
selftests/timens: added selftest for /proc/stat btime
fs/proc: apply the time namespace offset to /proc/stat btime
timens: additional helper functions for boottime offset handling
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-12-14
1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.
2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.
3) Support for finding BTF based kernel attach targets through libbpf's
bpf_program__set_attach_target() API, from Andrii Nakryiko.
4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.
5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.
6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.
7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.
8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.
9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
KP Singh, Jiri Olsa and various others.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
bpf: Permits pointers on stack for helper calls
libbpf: Expose libbpf ring_buffer epoll_fd
selftests/bpf: Add set_attach_target() API selftest for module target
libbpf: Support modules in bpf_program__set_attach_target() API
selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
selftests/bpf: Drop the need for LLVM's llc
selftests/bpf: fix bpf_testmod.ko recompilation logic
samples/bpf: Fix possible hang in xdpsock with multiple threads
selftests/bpf: Make selftest compilation work on clang 11
selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
selftests/bpf: Drop tcp-{client,server}.py from Makefile
selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
selftests/bpf: Xsk selftests framework
bpf: Only provide bpf_sock_from_file with CONFIG_NET
bpf: Return -ENOTSUPP when attaching to non-kernel BTF
xsk: Validate socket state in xsk_recvmsg, prior touching socket members
...
====================
Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"Another branch with a nicely negative diffstat, just the way I
like 'em:
- Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
the end (Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree"
* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/ia32_signal: Propagate __user annotation properly
x86/alternative: Update text_poke_bp() kernel-doc comment
x86/PCI: Make a kernel-doc comment a normal one
x86/asm: Drop unused RDPID macro
x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
x86/head64: Remove duplicate include
x86/mm: Declare 'start' variable where it is used
x86/head/64: Remove unused GET_CR2_INTO() macro
x86/boot: Remove unused finalize_identity_maps()
x86/uaccess: Document copy_from_user_nmi()
x86/dumpstack: Make show_trace_log_lvl() static
x86/mtrr: Fix a kernel-doc markup
x86/setup: Remove unused MCA variables
x86, libnvdimm/test: Remove COPY_MC_TEST
x86: Reclaim TIF_IA32 and TIF_X32
x86/mm: Convert mmu context ia32_compat into a proper flags field
x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
elf: Expose ELF header on arch_setup_additional_pages()
x86/elf: Use e_machine to select start_thread for x32
elf: Expose ELF header in compat_start_thread()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add speed testing on 1420-byte blocks for networking
Algorithms:
- Improve performance of chacha on ARM for network packets
- Improve performance of aegis128 on ARM for network packets
Drivers:
- Add support for Keem Bay OCS AES/SM4
- Add support for QAT 4xxx devices
- Enable crypto-engine retry mechanism in caam
- Enable support for crypto engine on sdm845 in qce
- Add HiSilicon PRNG driver support"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits)
crypto: qat - add capability detection logic in qat_4xxx
crypto: qat - add AES-XTS support for QAT GEN4 devices
crypto: qat - add AES-CTR support for QAT GEN4 devices
crypto: atmel-i2c - select CONFIG_BITREVERSE
crypto: hisilicon/trng - replace atomic_add_return()
crypto: keembay - Add support for Keem Bay OCS AES/SM4
dt-bindings: Add Keem Bay OCS AES bindings
crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd
crypto: seed - remove trailing semicolon in macro definition
crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: cpt - Fix sparse warnings in cptpf
hwrng: ks-sa - Add dependency on IOMEM and OF
crypto: lib/blake2s - Move selftest prototype into header file
crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
crypto: ccree - rework cache parameters handling
crypto: cavium - Use dma_set_mask_and_coherent to simplify code
crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code
...
|
|
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity updates from Eric Biggers:
"Some cleanups for fs-verity:
- Rename some names that have been causing confusion
- Move structs needed for file signing to the UAPI header"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: move structs needed for file signing to UAPI header
fs-verity: rename "file measurement" to "file digest"
fs-verity: rename fsverity_signed_digest to fsverity_formatted_digest
fs-verity: remove filenames from file comments
|
|
Pull fscrypt updates from Eric Biggers:
"This release there are some fixes for longstanding problems, as well
as some cleanups:
- Fix a race condition where a duplicate filename could be created in
an encrypted directory if a syscall that creates a new filename
raced with the directory's encryption key being added.
- Allow deleting files that use an unsupported encryption policy.
- Simplify the locking for 'struct fscrypt_master_key'.
- Remove kernel-internal constants from the UAPI header.
As usual, all these patches have been in linux-next with no reported
issues, and I've tested them with xfstests"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: allow deleting files with unsupported encryption policy
fscrypt: unexport fscrypt_get_encryption_info()
fscrypt: move fscrypt_require_key() to fscrypt_private.h
fscrypt: move body of fscrypt_prepare_setattr() out-of-line
fscrypt: introduce fscrypt_prepare_readdir()
ext4: don't call fscrypt_get_encryption_info() from dx_show_leaf()
ubifs: remove ubifs_dir_open()
f2fs: remove f2fs_dir_open()
ext4: remove ext4_dir_open()
fscrypt: simplify master key locking
fscrypt: remove unnecessary calls to fscrypt_require_key()
ubifs: prevent creating duplicate encrypted filenames
f2fs: prevent creating duplicate encrypted filenames
ext4: prevent creating duplicate encrypted filenames
fscrypt: add fscrypt_is_nokey_name()
fscrypt: remove kernel-internal constants from UAPI header
|
|
Pull io_uring fixes from Jens Axboe:
"Two fixes in here, fixing issues introduced in this merge window"
* tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-block:
io_uring: fix file leak on error path of io ctx creation
io_uring: fix mis-seting personality's creds
|
|
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
"A single patch in this pull request to fix a BIO and page reference
leak when writing sequential zone files"
* tag 'zonefs-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: fix page reference and BIO leak
|
|
When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().
I tested with 5.10-rc4 and the issue remains.
Explanation from Catalin in [1]:
"Arguably, that's a user-space bug since tagged file offsets were never
supported. In this case it's not even a tag at bit 56 as per the arm64
tagged address ABI but rather down to bit 47. You could say that the
problem is caused by the C library (malloc()) or whoever created the
tagged vaddr and passed it to this function. It's not a kernel
regression as we've never supported it.
Now, pagemap is a special case where the offset is usually not
generated as a classic file offset but rather derived by shifting a
user virtual address. I guess we can make a concession for pagemap
(only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"
My test code is based on [2]:
A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
userspace program:
uint64 OsLayer::VirtualToPhysical(void *vaddr) {
uint64 frame, paddr, pfnmask, pagemask;
int pagesize = sysconf(_SC_PAGESIZE);
off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
int fd = open(kPagemapPath, O_RDONLY);
...
if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
int err = errno;
string errtxt = ErrorString(err);
if (fd >= 0)
close(fd);
return 0;
}
...
}
kernel fs/proc/task_mmu.c:
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
...
src = *ppos;
svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
end_vaddr = mm->task_size;
/* watch out for wraparound */
// svpfn == 0xb400007662f54
// (mm->task_size >> PAGE) == 0x8000000
if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
start_vaddr = end_vaddr;
ret = 0;
while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
int len;
unsigned long end;
...
}
...
}
[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158
Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
Cc: <stable@vger.kernel.org> [5.4-]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull NFS client fixes from Anna Schumaker:
"Here are a handful more bugfixes for 5.10.
Unfortunately, we found some problems with the new READ_PLUS operation
that aren't easy to fix. We've decided to disable this codepath
through a Kconfig option for now, but a series of patches going into
5.11 will clean up the code and fix the issues at the same time. This
seemed like the best way to go about it.
Summary:
- Fix array overflow when flexfiles mirroring is enabled
- Fix rpcrdma_inline_fixup() crash with new LISTXATTRS
- Fix 5 second delay when doing inter-server copy
- Disable READ_PLUS by default"
* tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Disable READ_PLUS by default
NFSv4.2: Fix 5 seconds delay when doing inter server copy
NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
|
|
We've been seeing failures with xfstests generic/091 and generic/263
when using READ_PLUS. I've made some progress on these issues, and the
tests fail later on but still don't pass. Let's disable READ_PLUS by
default until we can work out what is going on.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Since commit b4868b44c5628 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.
Fixes: ce0887ac96d3 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.
For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.
- When ->send_request() is called, rpcrdma_marshal_req() does not
set up a Reply chunk because buflen is smaller than the inline
threshold. Thus rpcrdma_convert_iovs() does not get invoked at
all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
on the receive buffer.
- During reply processing, rpcrdma_inline_fixup() tries to copy
received data into rq_rcv_buf->pages because page_len is positive.
But there are no receive pages because rpcrdma_marshal_req() never
allocated them.
The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.
RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.
Reported-by: Olga kornievskaia <kolga@netapp.com>
Suggested-by: Olga kornievskaia <kolga@netapp.com>
Fixes: c10a75145feb ("NFSv4.2: add the extended attribute proc functions.")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga kornievskaia <kolga@netapp.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Tested-by: Olga kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
In zonefs_file_dio_append(), the pages obtained using
bio_iov_iter_get_pages() are not released on completion of the
REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails.
Furthermore, a call to bio_put() is missing when
bio_iov_iter_get_pages() fails.
Fix these resource leaks by adding BIO resource release code (bio_put()i
and bio_release_pages()) at the end of the function after the BIO
execution and add a jump to this resource cleanup code in case of
bio_iov_iter_get_pages() failure.
While at it, also fix the call to task_io_account_write() to be passed
the correct BIO size instead of bio_iov_iter_get_pages() return value.
Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Since btrfs scrub is utilizing its own infrastructure to submit
read/write, scrub is independent from all other routines.
This brings one very neat feature, allow us to read 4K data into offset
0 of a 64K page. So is the writeback routine.
This makes scrub on subpage sector size much easier to implement, and
thanks to previous commits which just changed the implementation to
always do scrub based on sector size, now scrub can handle subpage
filesystem without any problem.
This patch will just remove the restriction on
(sectorsize != PAGE_SIZE), to make scrub finally work on subpage
filesystems.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs scrub is more flexible than buffered data write path, as we can
read an unaligned subpage data into page offset 0.
This ability makes subpage support much easier, we just need to check
each scrub_page::page_len and ensure we only calculate hash for [0,
page_len) of a page.
There is a small thing to notice: for subpage case, we still do sector
by sector scrub. This means we will submit a read bio for each sector
to scrub, resulting in the same amount of read bios, just like on the 4K
page systems.
This behavior can be considered as a good thing, if we want everything
to be the same as 4K page systems. But this also means, we're wasting
the possibility to submit larger bio using 64K page size. This is
another problem to consider in the future.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
To support subpage tree block scrub, scrub_checksum_tree_block() only
needs to learn 2 new tricks:
- Follow sector size
Now scrub_page only represents one sector, we need to follow it
properly.
- Run checksum on all sectors
Since scrub_page only represents one sector, we need to run checksum
on all sectors, not only (nodesize >> PAGE_SIZE).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For scrub_pages() and scrub_pages_for_parity(), we currently allocate
one scrub_page structure for one page.
This is fine if we only read/write one sector one time. But for cases
like scrubbing RAID56, we need to read/write the full stripe, which is
in 64K size for now.
For subpage size, we will submit the read in just one page, which is
normally a good thing, but for RAID56 case, it only expects to see one
sector, not the full stripe in its endio function.
This could lead to wrong parity checksum for RAID56 on subpage.
To make the existing code work well for subpage case, here we take a
shortcut by always allocating a full page for one sector.
This should provide the base to make RAID56 work for subpage case.
The cost is pretty obvious now, for one RAID56 stripe now we always need
16 pages. For support subpage situation (64K page size, 4K sector size),
this means we need full one megabyte to scrub just one RAID56 stripe.
And for data scrub, each 4K sector will also need one 64K page.
This is mostly just a workaround, the proper fix for this is a much
larger project, using scrub_block to replace scrub_page, and allow
scrub_block to handle multi pages, csums, and csum_bitmap to avoid
allocating one page for each sector.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs on-disk format chose to use u64 for almost everything, but there
are a other restrictions that won't let us use more than u32 for things
like extent length (the maximum length is 128MiB for non-hole extents),
or stripe length (we have device number limit).
This means if we don't have extra handling to convert u64 to u32, we
will always have some questionable operations like
"u32 = u64 >> sectorsize_bits" in the code.
This patch will try to address the problem by reducing the width for the
following members/parameters:
- scrub_parity::stripe_len
- @len of scrub_pages()
- @extent_len of scrub_remap_extent()
- @len of scrub_parity_mark_sectors_error()
- @len of scrub_parity_mark_sectors_data()
- @len of scrub_extent()
- @len of scrub_pages_for_parity()
- @len of scrub_extent_for_parity()
For members extracted from on-disk structure, like map->stripe_len, they
will be kept as is. Since that modification would require on-disk format
change.
There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
sites, extra ASSERT() is added to be extra safe for debug builds.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_lookup_bio_sums() is only called for read bios.
While btrfs_find_ordered_sum() is to search ordered extent sums, which
is only for write path.
This means to read a page we either:
- Submit read bio if it's not uptodate
This means we only need to search csum tree for checksums.
- The page is already uptodate
It can be marked uptodate for previous read, or being marked dirty.
As we always mark page uptodate for dirty page.
In that case, we don't need to submit read bio at all, thus no need
to search any checksums.
Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
And since btrfs_lookup_bio_sums() is the only caller for
btrfs_find_ordered_sum(), also remove the implementation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.
Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.
This patch introduces two helpers to do this:
- get_eb_page_index()
This is to calculate the index to access extent_buffer::pages.
It's just a simple wrapper around "start >> PAGE_SHIFT".
For sectorsize == PAGE_SIZE case, nothing is changed.
For sectorsize < PAGE_SIZE case, we always get index as 0, and
the existing page shift also works.
- get_eb_offset_in_page()
This is to calculate the offset to access extent_buffer::pages.
This needs to take extent_buffer::start into consideration.
For sectorsize == PAGE_SIZE case, extent_buffer::start is always
aligned to PAGE_SIZE, thus adding extent_buffer::start to
offset_in_page() won't change the result.
For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
us the correct offset to access.
This patch will touch the following parts to cover all extent buffer
accessors:
- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For subpage sized extent buffer, we have ensured no extent buffer will
cross page boundary, thus we would only need one page for any extent
buffer.
Update function num_extent_pages to handle such case. Now
num_extent_pages() returns 1 for subpage sized extent buffer.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.
The 64K is selected because:
- we are only going to support 64K page size for subpage for now
- 64K is also the maximum supported node size
This ensures that tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.
Otherwise we would have to do complex multi-page handling of tree
blocks. Currently there is no way to create such tree blocks.
In kernel we have avoided such tree blocks allocation even on 4K page
size, as it can lead to RAID56 stripe scrubbing.
While btrfs-progs have fixed its chunk allocator since 2016 for convert,
and has extra checks to do the same behavior as the kernel.
Just add such graceful checks in case of an ancient filesystem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.
For a system using 64K page size, we would really have just one page.
While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.
Calculate the array size based on page size and the node size maximum.
- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page
Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btree_write_cache_pages() we have a btree page submission routine
buried deeply in a nested loop.
This patch will extract that part of code into a helper function,
submit_eb_page(), to do the same work.
Since submit_eb_page() now can return >0 for successful extent
buffer submission, remove the "ASSERT(ret <= 0);" line.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_verify_data_csum() just passes the whole page to
check_data_csum(), which is fine since we only support sectorsize ==
PAGE_SIZE.
To support subpage, we need to properly honor per-sector
checksum verification, just like what we did in dio read path.
This patch will do the csum verification in a for loop, starts with
pg_off == start - page_offset(page), with sectorsize increase for
each loop.
For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
will only loop once.
For subpage case, we do the iterate over each sector and if we found any
error, we return error.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().
Both parameters are calculated values for csum lookup.
Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.
Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.
Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The parameter bio_offset of extent_submit_bio_start_t is very confusing.
If it's really bio_offset (offset to bio), then it should be u32. But
in fact, it's only utilized by dio read, and that member is used as file
offset, which must be u64.
Rename it to dio_file_offset since the only user uses it as file offset,
and add comment for who is using it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.
The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.
The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.
Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.
The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.
[30.480136] ======================================================
[30.480830] WARNING: possible circular locking dependency detected
[30.481457] 5.9.0-rc8+ #76 Not tainted
[30.481897] ------------------------------------------------------
[30.482500] mount/520 is trying to acquire lock:
[30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.484054]
but task is already holding lock:
[30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.485581]
which lock already depends on the new lock.
[30.486397]
the existing dependency chain (in reverse order) is:
[30.487205]
-> #2 (btrfs-extent-01#2){++++}-{3:3}:
[30.487825] down_read_nested+0x43/0x150
[30.488306] __btrfs_tree_read_lock+0x39/0x180
[30.488868] __btrfs_read_lock_root_node+0x3a/0x50
[30.489477] btrfs_search_slot+0x464/0x9b0
[30.490009] check_committed_ref+0x59/0x1d0
[30.490603] btrfs_cross_ref_exist+0x65/0xb0
[30.491108] run_delalloc_nocow+0x405/0x930
[30.491651] btrfs_run_delalloc_range+0x60/0x6b0
[30.492203] writepage_delalloc+0xd4/0x150
[30.492688] __extent_writepage+0x18d/0x3a0
[30.493199] extent_write_cache_pages+0x2af/0x450
[30.493743] extent_writepages+0x34/0x70
[30.494231] do_writepages+0x31/0xd0
[30.494642] __filemap_fdatawrite_range+0xad/0xe0
[30.495194] btrfs_fdatawrite_range+0x1b/0x50
[30.495677] __btrfs_write_out_cache+0x40d/0x460
[30.496227] btrfs_write_out_cache+0x8b/0x110
[30.496716] btrfs_start_dirty_block_groups+0x211/0x4e0
[30.497317] btrfs_commit_transaction+0xc0/0xba0
[30.497861] sync_filesystem+0x71/0x90
[30.498303] btrfs_remount+0x81/0x433
[30.498767] reconfigure_super+0x9f/0x210
[30.499261] path_mount+0x9d1/0xa30
[30.499722] do_mount+0x55/0x70
[30.500158] __x64_sys_mount+0xc4/0xe0
[30.500616] do_syscall_64+0x33/0x40
[30.501091] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.501629]
-> #1 (btrfs-root-00){++++}-{3:3}:
[30.502241] down_read_nested+0x43/0x150
[30.502727] __btrfs_tree_read_lock+0x39/0x180
[30.503291] __btrfs_read_lock_root_node+0x3a/0x50
[30.503903] btrfs_search_slot+0x464/0x9b0
[30.504405] btrfs_insert_empty_items+0x60/0xa0
[30.504973] btrfs_insert_item+0x60/0xd0
[30.505412] btrfs_create_tree+0x1b6/0x210
[30.505913] btrfs_create_free_space_tree+0x54/0x110
[30.506460] btrfs_mount_rw+0x15d/0x20f
[30.506937] btrfs_remount+0x356/0x433
[30.507369] reconfigure_super+0x9f/0x210
[30.507868] path_mount+0x9d1/0xa30
[30.508264] do_mount+0x55/0x70
[30.508668] __x64_sys_mount+0xc4/0xe0
[30.509186] do_syscall_64+0x33/0x40
[30.509652] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.510271]
-> #0 (btrfs-free-space-00){++++}-{3:3}:
[30.510972] __lock_acquire+0x11ad/0x1b60
[30.511432] lock_acquire+0xa2/0x360
[30.511917] down_read_nested+0x43/0x150
[30.512383] __btrfs_tree_read_lock+0x39/0x180
[30.512947] __btrfs_read_lock_root_node+0x3a/0x50
[30.513455] btrfs_search_slot+0x464/0x9b0
[30.513947] search_free_space_info+0x45/0x90
[30.514465] __add_to_free_space_tree+0x92/0x39d
[30.515010] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.515639] btrfs_mount_rw+0x15d/0x20f
[30.516142] btrfs_remount+0x356/0x433
[30.516538] reconfigure_super+0x9f/0x210
[30.517065] path_mount+0x9d1/0xa30
[30.517438] do_mount+0x55/0x70
[30.517824] __x64_sys_mount+0xc4/0xe0
[30.518293] do_syscall_64+0x33/0x40
[30.518776] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.519335]
other info that might help us debug this:
[30.520210] Chain exists of:
btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
[30.521407] Possible unsafe locking scenario:
[30.522037] CPU0 CPU1
[30.522456] ---- ----
[30.522941] lock(btrfs-extent-01#2);
[30.523311] lock(btrfs-root-00);
[30.523952] lock(btrfs-extent-01#2);
[30.524620] lock(btrfs-free-space-00);
[30.525068]
*** DEADLOCK ***
[30.525669] 5 locks held by mount/520:
[30.526116] #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
[30.527056] #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
[30.527960] #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
[30.529118] #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.530113] #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
[30.531124]
stack backtrace:
[30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
[30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
[30.533215] Call Trace:
[30.533452] dump_stack+0x8d/0xc0
[30.533797] check_noncircular+0x13c/0x150
[30.534233] __lock_acquire+0x11ad/0x1b60
[30.534667] lock_acquire+0xa2/0x360
[30.535063] ? __btrfs_tree_read_lock+0x39/0x180
[30.535525] down_read_nested+0x43/0x150
[30.535939] ? __btrfs_tree_read_lock+0x39/0x180
[30.536400] __btrfs_tree_read_lock+0x39/0x180
[30.536862] __btrfs_read_lock_root_node+0x3a/0x50
[30.537304] btrfs_search_slot+0x464/0x9b0
[30.537713] ? trace_hardirqs_on+0x1c/0xf0
[30.538148] search_free_space_info+0x45/0x90
[30.538572] __add_to_free_space_tree+0x92/0x39d
[30.539071] ? printk+0x48/0x4a
[30.539367] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.539972] btrfs_mount_rw+0x15d/0x20f
[30.540350] btrfs_remount+0x356/0x433
[30.540773] ? shrink_dcache_sb+0xd9/0x100
[30.541203] reconfigure_super+0x9f/0x210
[30.541642] path_mount+0x9d1/0xa30
[30.542040] do_mount+0x55/0x70
[30.542366] __x64_sys_mount+0xc4/0xe0
[30.542822] do_syscall_64+0x33/0x40
[30.543197] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.543691] RIP: 0033:0x7f109f7ab93a
[30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
[30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
[30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
[30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
[30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|