summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-04-05mm/page_poison.c: make early_page_poison_param() __initDou Liyang
The early_param() is only called during kernel initialization, So Linux marks the function of it with __init macro to save memory. But it forgot to mark the early_page_poison_param(). So, Make it __init as well. Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.com Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/page_owner.c: make early_page_owner_param() __initDou Liyang
The early_param() is only called during kernel initialization, So Linux marks the functions of it with __init macro to save memory. But it forgot to mark the early_page_owner_param(). So, Make it __init as well. Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.com Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/kmemleak.c: make kmemleak_boot_config() __initDou Liyang
The early_param() is only called during kernel initialization, So Linux marks the functions of it with __init macro to save memory. But it forgot to mark the kmemleak_boot_config(). So, Make it __init as well. Link: http://lkml.kernel.org/r/20180117034720.26897-1-douly.fnst@cn.fujitsu.com Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: swap: unify cluster-based and vma-based swap readaheadMinchan Kim
This patch makes do_swap_page() not need to be aware of two different swap readahead algorithms. Just unify cluster-based and vma-based readahead function call. Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: swap: clean up swap readaheadMinchan Kim
When I see recent change of swap readahead, I am very unhappy about current code structure which diverges two swap readahead algorithm in do_swap_page. This patch is to clean it up. Main motivation is that fault handler doesn't need to be aware of readahead algorithms but just should call swapin_readahead. As first step, this patch cleans up a little bit but not perfect (I just separate for review easier) so next patch will make the goal complete. [minchan@kernel.org: do not check readahead flag with THP anon] Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm,vmscan: don't pretend forward progress upon shrinker_rwsem contentionTetsuo Handa
Since we no longer use return value of shrink_slab() for normal reclaim, the comment is no longer true. If some do_shrink_slab() call takes unexpectedly long (root cause of stall is currently unknown) when register_shrinker()/unregister_shrinker() is pending, trying to drop caches via /proc/sys/vm/drop_caches could become infinite cond_resched() loop if many mem_cgroup are defined. For safety, let's not pretend forward progress. Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05z3fold: limit use of stale list for allocationVitaly Wool
Currently if z3fold couldn't find an unbuddied page it would first try to pull a page off the stale list. The problem with this approach is that we can't 100% guarantee that the page is not processed by the workqueue thread at the same time unless we run cancel_work_sync() on it, which we can't do if we're in an atomic context. So let's just limit stale list usage to non-atomic contexts only. Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com Signed-off-by: Vitaly Vul <vitaly.vul@sony.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <Oleksiy.Avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/huge_memory.c: reorder operations in __split_huge_page_tail()Konstantin Khlebnikov
THP split makes non-atomic change of tail page flags. This is almost ok because tail pages are locked and isolated but this breaks recent changes in page locking: non-atomic operation could clear bit PG_waiters. As a result concurrent sequence get_page_unless_zero() -> lock_page() might block forever. Especially if this page was truncated later. Fix is trivial: clone flags before unfreezing page reference counter. This race exists since commit 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit") while unsave unfreeze itself was added in commit 8df651c7059e ("thp: cleanup split_huge_page()"). clear_compound_head() also must be called before unfreezing page reference because after successful get_page_unless_zero() might follow put_page() which needs correct compound_head(). And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which is made especially for that and has semantic of smp_store_release(). Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/page_ref: use atomic_set_release in page_ref_unfreezeKonstantin Khlebnikov
page_ref_unfreeze() has exactly that semantic. No functional changes: just minus one barrier and proper handling of PPro errata. Link: http://lkml.kernel.org/r/151844393004.210639.4672319312617954272.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: fix races between address_space dereference and free in page_evicatableHuang Ying
When page_mapping() is called and the mapping is dereferenced in page_evicatable() through shrink_active_list(), it is possible for the inode to be truncated and the embedded address space to be freed at the same time. This may lead to the following race. CPU1 CPU2 truncate(inode) shrink_active_list() ... page_evictable(page) truncate_inode_page(mapping, page); delete_from_page_cache(page) spin_lock_irqsave(&mapping->tree_lock, flags); __delete_from_page_cache(page, NULL) page_cache_tree_delete(..) ... mapping = page_mapping(page); page->mapping = NULL; ... spin_unlock_irqrestore(&mapping->tree_lock, flags); page_cache_free_page(mapping, page) put_page(page) if (put_page_testzero(page)) -> false - inode now has no pages and can be freed including embedded address_space mapping_unevictable(mapping) test_bit(AS_UNEVICTABLE, &mapping->flags); - we've dereferenced mapping which is potentially already free. Similar race exists between swap cache freeing and page_evicatable() too. The address_space in inode and swap cache will be freed after a RCU grace period. So the races are fixed via enclosing the page_mapping() and address_space usage in rcu_read_lock/unlock(). Some comments are added in code to make it clear what is protected by the RCU read lock. Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: reuse DEFINE_SHOW_ATTRIBUTE() macroAndy Shevchenko
...instead of open coding file operations followed by custom ->open() callbacks per each attribute. [andriy.shevchenko@linux.intel.com: add tags, fix compilation issue] Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennisszhou@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm, page_alloc: move mirrored_kernelcore to __meminitdataDavid Rientjes
mirrored_kernelcore can be in __meminitdata, so move it there. At the same time, fixup section specifiers to be after the name of the variable per checkpatch. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121623280.179479@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm, page_alloc: extend kernelcore and movablecore for percentDavid Rientjes
Both kernelcore= and movablecore= can be used to define the amount of ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires the system memory capacity to be known when specifying the command line, however. This introduces the ability to define both kernelcore= and movablecore= as a percentage of total system memory. This is convenient for systems software that wants to define the amount of ZONE_MOVABLE, for example, as a proportion of a system's memory rather than a hardcoded byte value. To define the percentage, the final character of the parameter should be a '%'. mhocko: "why is anyone using these options nowadays?" rientjes: : : Fragmentation of non-__GFP_MOVABLE pages due to low on memory : situations can pollute most pageblocks on the system, as much as 1GB of : slab being fragmented over 128GB of memory, for example. When the : amount of kernel memory is well bounded for certain systems, it is : better to aggressively reclaim from existing MIGRATE_UNMOVABLE : pageblocks rather than eagerly fallback to others. : : We have additional patches that help with this fragmentation if you're : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and : draining of pcp lists back to the zone free area to prevent stranding. [rientjes@google.com: updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: hwpoison: disable memory error handling on 1GB hugepageNaoya Horiguchi
Recently the following BUG was reported: Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000 Memory failure: 0x3c0000: recovery action for huge page: Recovered BUG: unable to handle kernel paging request at ffff8dfcc0003000 IP: gup_pgd_range+0x1f0/0xc20 PGD 17ae72067 P4D 17ae72067 PUD 0 Oops: 0000 [#1] SMP PTI ... CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on a 1GB hugepage. This happens because get_user_pages_fast() is not aware of a migration entry on pud that was created in the 1st madvise() event. I think that conversion to pud-aligned migration entry is working, but other MM code walking over page table isn't prepared for it. We need some time and effort to make all this work properly, so this patch avoids the reported bug by just disabling error handling for 1GB hugepage. [n-horiguchi@ah.jp.nec.com: v2] Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Punit Agrawal <punit.agrawal@arm.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/memory_hotplug: optimize memory hotplugPavel Tatashin
During memory hotplugging we traverse struct pages three times: 1. memset(0) in sparse_add_one_section() 2. loop in __add_section() to set do: set_page_node(page, nid); and SetPageReserved(page); 3. loop in memmap_init_zone() to call __init_single_pfn() This patch removes the first two loops, and leaves only loop 3. All struct pages are initialized in one place, the same as it is done during boot. The benefits: - We improve memory hotplug performance because we are not evicting the cache several times and also reduce loop branching overhead. - Remove condition from hotpath in __init_single_pfn(), that was added in order to fix the problem that was reported by Bharata in the above email thread, thus also improve performance during normal boot. - Make memory hotplug more similar to the boot memory initialization path because we zero and initialize struct pages only in one function. - Simplifies memory hotplug struct page initialization code, and thus enables future improvements, such as multi-threading the initialization of struct pages in order to improve hotplug performance even further on larger machines. [pasha.tatashin@oracle.com: v5] Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/memory_hotplug: don't read nid from struct page during hotplugPavel Tatashin
During memory hotplugging the probe routine will leave struct pages uninitialized, the same as it is currently done during boot. Therefore, we do not want to access the inside of struct pages before __init_single_page() is called during onlining. Because during hotplug we know that pages in one memory block belong to the same numa node, we can skip the checking. We should keep checking for the boot case. [pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()] Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/memory_hotplug: optimize probe routinePavel Tatashin
When memory is hotplugged pages_correctly_reserved() is called to verify that the added memory is present, this routine traverses through every struct page and verifies that PageReserved() is set. This is a slow operation especially if a large amount of memory is added. Instead of checking every page, it is enough to simply check that the section is present, has mapping (struct page array is allocated), and the mapping is online. In addition, we should not excpect that probe routine sets flags in struct page, as the struct pages have not yet been initialized. The initialization should be done in __init_single_page(), the same as during boot. Link: http://lkml.kernel.org/r/20180215165920.8570-5-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: uninitialized struct page poisoning sanity checkingPavel Tatashin
During boot we poison struct page memory in order to ensure that no one is accessing this memory until the struct pages are initialized in __init_single_page(). This patch adds more scrutiny to this checking by making sure that flags do not equal the poison pattern when they are accessed. The pattern is all ones. Since node id is also stored in struct page, and may be accessed quite early, we add this enforcement into page_to_nid() function as well. Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n [pasha.tatashin@oracle.com: v4] Link: http://lkml.kernel.org/r/20180215165920.8570-4-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180213193159.14606-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05x86/mm/memory_hotplug: determine block size based on the end of boot memoryPavel Tatashin
Memory sections are combined into "memory block" chunks. These chunks are the units upon which memory can be added and removed. On x86, the new memory may be added after the end of the boot memory, therefore, if block size does not align with end of boot memory, memory hot-plugging/hot-removing can be broken. Memory sections are combined into "memory block" chunks. These chunks are the units upon which memory can be added and removed. On x86 the new memory may be added after the end of the boot memory, therefore, if block size does not align with end of boot memory, memory hotplugging/hotremoving can be broken. Currently, whenever machine is booted with more than 64G the block size is unconditionally increased to 2G from the base 128M. This is done in order to reduce number of memory device files in sysfs: /sys/devices/system/memory/memoryXXX We must use the largest allowed block size that aligns to the next address to be able to hotplug the next block of memory. So, when memory is larger or equal to 64G, we check the end address and find the largest block size that is still power of two but smaller or equal to 2G. Before, the fix: Run qemu with: -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 Block size [0x80000000] unaligned hotplug range: start 0x1040000000, size 0x80000000 acpi PNP0C80:00: add_memory failed acpi PNP0C80:00: acpi_memory_enable_device() error acpi PNP0C80:00: Enumeration failure With the fix memory is added successfully as the block size is set to 1G, and therefore aligns with start address 0x1040000000. [pasha.tatashin@oracle.com: v4] Link: http://lkml.kernel.org/r/20180215165920.8570-3-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180213193159.14606-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/memory_hotplug: enforce block size aligned range checkPavel Tatashin
Patch series "optimize memory hotplug", v3. This patchset: - Improves hotplug performance by eliminating a number of struct page traverses during memory hotplug. - Fixes some issues with hotplugging, where boundaries were not properly checked. And on x86 block size was not properly aligned with end of memory - Also, potentially improves boot performance by eliminating condition from __init_single_page(). - Adds robustness by verifying that that struct pages are correctly poisoned when flags are accessed. The following experiments were performed on Xeon(R) CPU E7-8895 v3 @ 2.60GHz with 1T RAM: booting in qemu with 960G of memory, time to initialize struct pages: no-kvm: TRY1 TRY2 BEFORE: 39.433668 39.39705 AFTER: 36.903781 36.989329 with-kvm: BEFORE: 10.977447 11.103164 AFTER: 10.929072 10.751885 Hotplug 896G memory: no-kvm: TRY1 TRY2 BEFORE: 848.740000 846.910000 AFTER: 783.070000 786.560000 with-kvm: TRY1 TRY2 BEFORE: 34.410000 33.57 AFTER: 29.810000 29.580000 This patch (of 6): Start qemu with the following arguments: -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G Which: boots machine with 64G, and adds a device mem1 with 2G which can be hotplugged later. Also make sure that config has the following turned on: CONFIG_MEMORY_HOTPLUG CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE CONFIG_ACPI_HOTPLUG_MEMORY Using the qemu monitor hotplug the memory (make sure config has (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 The operation will fail with the following trace: WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205 pages_correctly_reserved+0xe6/0x110 Modules linked in: CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:pages_correctly_reserved+0xe6/0x110 Call Trace: memory_subsys_online+0x44/0xa0 device_online+0x51/0x80 store_mem_state+0x5e/0xe0 kernfs_fop_write+0xfa/0x170 __vfs_write+0x2e/0x150 vfs_write+0xa8/0x1a0 SyS_write+0x4d/0xb0 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x21/0x86 ---[ end trace 6203bc4f1a5d30e8 ]--- The problem is detected in: drivers/base/memory.c static bool pages_correctly_reserved(unsigned long start_pfn) 205 if (WARN_ON_ONCE(!pfn_valid(pfn))) This function loops through every section in the newly added memory block and verifies that the first pfn is valid, meaning section exists, has mapping (struct page array), and is online. The block size on x86 is usually 128M, but when machine is booted with more than 64G of memory, the block size is changed to 2G: $ cat /sys/devices/system/memory/block_size_bytes 80000000 or $ dmesg | grep "block size" [ 0.086469] x86/mm: Memory block size: 2048MB During memory hotplug, and hotremove we verify that the range is section size aligned, but we actually must verify that it is block size aligned, because that is the proper unit for hotplug operations. See: Documentation/memory-hotplug.txt So, when the start_pfn of newly added memory is not block size aligned, we can get a memory block that has only part of it with properly populated sections. In our case the start_pfn starts from the last_pfn (end of physical memory). $ dmesg | grep last_pfn [ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000 0x1040000 == 65G, and so is not 2G aligned! The fix is to enforce that memory that is hotplugged and hotremoved is block size aligned. With this fix, running the above sequence yield to the following result: (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 Block size [0x80000000] unaligned hotplug range: start 0x1040000000, size 0x80000000 acpi PNP0C80:00: add_memory failed acpi PNP0C80:00: acpi_memory_enable_device() error acpi PNP0C80:00: Enumeration failure Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: thp: fix potential clearing to referenced flag in ↵Yang Shi
page_idle_clear_pte_refs_one() For PTE-mapped THP, the compound THP has not been split to normal 4K pages yet, the whole THP is considered referenced if any one of sub page is referenced. When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked to retrieve referenced bit. But, the current code just returns the result of the last PTE. If the last PTE has not referenced, the referenced flag will be cleared. Just set referenced when ptep{pmdp}_clear_young_notify() returns true. Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Gang Deng <gavin.dg@linux.alibaba.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: initialize pages on demand during bootPavel Tatashin
Deferred page initialization allows the boot cpu to initialize a small subset of the system's pages early in boot, with other cpus doing the rest later on. It is, however, problematic to know how many pages the kernel needs during boot. Different modules and kernel parameters may change the requirement, so the boot cpu either initializes too many pages or runs out of memory. To fix that, initialize early pages on demand. This ensures the kernel does the minimum amount of work to initialize pages during boot and leaves the rest to be divided in the multithreaded initialization path (deferred_init_memmap). The on-demand code is permanently disabled using static branching once deferred pages are initialized. After the static branch is changed to false, the overhead is up-to two branch-always instructions if the zone watermark check fails or if rmqueue fails. Sergey Senozhatsky noticed that while deferred pages currently make sense only on NUMA machines (we start one thread per latency node), CONFIG_NUMA is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT, so that is also must be addressed in the patch. [akpm@linux-foundation.org: fix typo in comment, make deferred_pages static] [pasha.tatashin@oracle.com: fix min() type mismatch warning] Link: http://lkml.kernel.org/r/20180212164543.26592-1-pasha.tatashin@oracle.com [pasha.tatashin@oracle.com: use zone_to_nid() in deferred_grow_zone()] Link: http://lkml.kernel.org/r/20180214163343.21234-2-pasha.tatashin@oracle.com [pasha.tatashin@oracle.com: might_sleep warning] Link: http://lkml.kernel.org/r/20180306192022.28289-1-pasha.tatashin@oracle.com [akpm@linux-foundation.org: s/spin_lock/spin_lock_irq/ in page_alloc_init_late()] [pasha.tatashin@oracle.com: v5] Link: http://lkml.kernel.org/r/20180309220807.24961-3-pasha.tatashin@oracle.com [akpm@linux-foundation.org: tweak comments] [pasha.tatashin@oracle.com: v6] Link: http://lkml.kernel.org/r/20180313182355.17669-3-pasha.tatashin@oracle.com [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180209192216.20509-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: disable interrupts while initializing deferred pagesPavel Tatashin
Vlastimil Babka reported about a window issue during which when deferred pages are initialized, and the current version of on-demand initialization is finished, allocations may fail. While this is highly unlikely scenario, since this kind of allocation request must be large, and must come from interrupt handler, we still want to cover it. We solve this by initializing deferred pages with interrupts disabled, and holding node_size_lock spin lock while pages in the node are being initialized. The on-demand deferred page initialization that comes later will use the same lock, and thus synchronize with deferred_init_memmap(). It is unlikely for threads that initialize deferred pages to be interrupted. They run soon after smp_init(), but before modules are initialized, and long before user space programs. This is why there is no adverse effect of having these threads running with interrupts disabled. [pasha.tatashin@oracle.com: v6] Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/swap_slots.c: use conditional compilationRandy Dunlap
For mm/swap_slots.c, use the traditional Linux method of conditional compilation and linking instead of always compiling it by using #ifdef CONFIG_SWAP and #endif for the entire source file (excluding header files). Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGEAnshuman Khandual
alloc_contig_range() initiates compaction and eventual migration for the purpose of either CMA or HugeTLB allocations. At present, the reason code remains the same MR_CMA for either of these cases. Let's make it MR_CONTIG_RANGE which will appropriately reflect the reason code in both these cases. Link: http://lkml.kernel.org/r/20180202091518.18798-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: always print RLIMIT_DATA warningDavid Woodhouse
The documentation for ignore_rlimit_data says that it will print a warning at first misuse. Yet it doesn't seem to do that. Fix the code to print the warning even when we allow the process to continue. Link: http://lkml.kernel.org/r/1517935505-9321-1-git-send-email-dwmw@amazon.co.uk Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/ksm.c: make stable_node_dup() staticColin Ian King
stable_node_dup() is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: mm/ksm.c:1321:13: warning: symbol 'stable_node_dup' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20180206221005.12642-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab, slub: skip unnecessary kasan_cache_shutdown()Shakeel Butt
The kasan quarantine is designed to delay freeing slab objects to catch use-after-free. The quarantine can be large (several percent of machine memory size). When kmem_caches are deleted related objects are flushed from the quarantine but this requires scanning the entire quarantine which can be very slow. We have seen the kernel busily working on this while holding slab_mutex and badly affecting cache_reaper, slabinfo readers and memcg kmem cache creations. It can easily reproduced by following script: yes . | head -1000000 | xargs stat > /dev/null for i in `seq 1 10`; do seq 500 | (cd /cg/memory && xargs mkdir) seq 500 | xargs -I{} sh -c 'echo $BASHPID > \ /cg/memory/{}/tasks && exec stat .' > /dev/null seq 500 | (cd /cg/memory && xargs rmdir) done The busy stack: kasan_cache_shutdown shutdown_cache memcg_destroy_kmem_caches mem_cgroup_css_free css_free_rwork_fn process_one_work worker_thread kthread ret_from_fork This patch is based on the observation that if the kmem_cache to be destroyed is empty then there should not be any objects of this cache in the quarantine. Without the patch the script got stuck for couple of hours. With the patch the script completed within a second. Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm/slab_common.c: remove test if cache name is accessibleMikulas Patocka
Since commit db265eca7700 ("mm/sl[aou]b: Move duping of slab name to slab_common.c"), the kernel always duplicates the slab cache name when creating a slab cache, so the test if the slab name is accessible is useless. Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab, slub: remove size disparity on debug kernelShakeel Butt
I have noticed on debug kernel with SLAB, the size of some non-root slabs were larger than their corresponding root slabs. e.g. for radix_tree_node: $cat /proc/slabinfo | grep radix name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ... radix_tree_node 15052 15075 4096 1 1 ... $cat /cgroup/memory/temp/memory.kmem.slabinfo | grep radix name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ... radix_tree_node 1581 158 4120 1 2 ... However for SLUB in debug kernel, the sizes were same. On further inspection it is found that SLUB always use kmem_cache.object_size to measure the kmem_cache.size while SLAB use the given kmem_cache.size. In the debug kernel the slab's size can be larger than its object_size. Thus in the creation of non-root slab, the SLAB uses the root's size as base to calculate the non-root slab's size and thus non-root slab's size can be larger than the root slab's size. For SLUB, the non-root slab's size is measured based on the root's object_size and thus the size will remain same for root and non-root slab. This patch makes slab's object_size the default base to measure the slab's size. Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com Fixes: 794b1248be4e ("memcg, slab: separate memcg vs root cache creation paths") Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab: use 32-bit arithmetic in freelist_randomize()Alexey Dobriyan
SLAB doesn't support 4GB+ of objects per slab, therefore randomization doesn't need size_t. Link: http://lkml.kernel.org/r/20180305200730.15812-25-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make size_from_object() return unsigned intAlexey Dobriyan
Function returns size of the object without red zone which can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-24-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make struct kmem_cache_order_objects::x unsigned intAlexey Dobriyan
struct kmem_cache_order_objects is for mixing order and number of objects, and orders aren't big enough to warrant 64-bit width. Propagate unsignedness down so that everything fits. !!! Patch assumes that "PAGE_SIZE << order" doesn't overflow. !!! Link: http://lkml.kernel.org/r/20180305200730.15812-23-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make slab_index() return unsigned intAlexey Dobriyan
slab_index() returns index of an object within a slab which is at most u15 (or u16?). Iterators additionally guarantee that "p >= addr". Link: http://lkml.kernel.org/r/20180305200730.15812-22-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab: make usercopy region 32-bitAlexey Dobriyan
If kmem case sizes are 32-bit, then usecopy region should be too. Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05kasan: make kasan_cache_create() work with 32-bit slab cache sizesAlexey Dobriyan
If SLAB doesn't support 4GB+ kmem caches (it never did), KASAN should not do it as well. Link: http://lkml.kernel.org/r/20180305200730.15812-20-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab: make kmem_cache_flags accept 32-bit object sizeAlexey Dobriyan
Now that all sizes are properly typed, propagate "unsigned int" down the callgraph. Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->size unsigned intAlexey Dobriyan
Linux doesn't support negative length objects (including meta data). Link: http://lkml.kernel.org/r/20180305200730.15812-18-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->object_size unsigned intAlexey Dobriyan
Linux doesn't support negative length objects. Link: http://lkml.kernel.org/r/20180305200730.15812-17-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->offset unsigned intAlexey Dobriyan
->offset is free pointer offset from the start of the object, can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-16-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->cpu_partial unsigned intAlexey Dobriyan
/* * cpu_partial determined the maximum number of objects * kept in the per cpu partial lists of a processor. */ Can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->inuse unsigned intAlexey Dobriyan
->inuse is "the number of bytes in actual use by the object", can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-14-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->align unsigned intAlexey Dobriyan
Kmem cache alignment can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-13-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->reserved unsigned intAlexey Dobriyan
->reserved is either 0 or sizeof(struct rcu_head), can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-12-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->red_left_pad unsigned intAlexey Dobriyan
Padding length can't be negative. Link: http://lkml.kernel.org/r/20180305200730.15812-11-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->max_attr_size unsigned intAlexey Dobriyan
->max_attr_size is maximum length of every SLAB memcg attribute ever written. VFS limits those to INT_MAX. Link: http://lkml.kernel.org/r/20180305200730.15812-10-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slub: make ->remote_node_defrag_ratio unsigned intAlexey Dobriyan
->remote_node_defrag_ratio is in range 0..1000. This also adds a check and modifies the behavior to return an error code. Before this patch invalid values were ignored. Link: http://lkml.kernel.org/r/20180305200730.15812-9-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab: make size_index_elem() unsigned intAlexey Dobriyan
size_index_elem() always works with small sizes (kmalloc caches are 32-bit) and returns small indexes. Link: http://lkml.kernel.org/r/20180305200730.15812-8-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab: make size_index[] array u8Alexey Dobriyan
All those small numbers are reverse indexes into kmalloc caches array and can't be negative. On x86_64 "unsigned int = fls()" can drop CDQE instruction: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-2 (-2) Function old new delta kmalloc_slab 101 99 -2 Link: http://lkml.kernel.org/r/20180305200730.15812-7-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05slab: make kmem_cache_create() work with 32-bit sizesAlexey Dobriyan
struct kmem_cache::size and ::align were always 32-bit. Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0. kmem_cache_create(1UL<<32+1) created 1-byte cache as expected. size_t doesn't work and never did. Link: http://lkml.kernel.org/r/20180305200730.15812-6-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>