From 1a5a9906d4e8d1976b701f889d8f35d54b928f25 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Wed, 21 Mar 2012 16:33:42 -0700 Subject: mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(¤t->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell Signed-off-by: Andrea Arcangeli Acked-by: Johannes Weiner Cc: Mel Gorman Cc: Hugh Dickins Cc: Dave Jones Acked-by: Larry Woodman Acked-by: Rik van Riel Cc: [2.6.38+] Cc: Mark Salter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'fs/proc') diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7dcd2a25049..3efa7253523 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -409,6 +409,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } else { spin_unlock(&walk->mm->page_table_lock); } + + if (pmd_trans_unstable(pmd)) + return 0; /* * The mmap_sem held all the way back in m_start() is what * keeps khugepaged out of here and from collapsing things @@ -507,6 +510,8 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; split_huge_page_pmd(walk->mm, pmd); + if (pmd_trans_unstable(pmd)) + return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { @@ -670,6 +675,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, int err = 0; split_huge_page_pmd(walk->mm, pmd); + if (pmd_trans_unstable(pmd)) + return 0; /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); @@ -961,6 +968,8 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spin_unlock(&walk->mm->page_table_lock); } + if (pmd_trans_unstable(pmd)) + return 0; orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { struct page *page = can_gather_numa_stats(*pte, md->vma, addr); -- cgit v1.2.3 From 5aaabe831eb527e0d9284f0745d830a755f70393 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:57 -0700 Subject: pagemap: avoid splitting thp when reading /proc/pid/pagemap Thp split is not necessary if we explicitly check whether pmds are mapping thps or not. This patch introduces this check and adds code to generate pagemap entries for pmds mapping thps, which results in less performance impact of pagemap on thp. Signed-off-by: Naoya Horiguchi Reviewed-by: Andi Kleen Reviewed-by: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Wu Fengguang Cc: Andrea Arcangeli Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 48 insertions(+), 4 deletions(-) (limited to 'fs/proc') diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3efa7253523..95264c0ef30 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -608,6 +608,9 @@ struct pagemapread { u64 *buffer; }; +#define PAGEMAP_WALK_SIZE (PMD_SIZE) +#define PAGEMAP_WALK_MASK (PMD_MASK) + #define PM_ENTRY_BYTES sizeof(u64) #define PM_STATUS_BITS 3 #define PM_STATUS_OFFSET (64 - PM_STATUS_BITS) @@ -666,6 +669,27 @@ static u64 pte_to_pagemap_entry(pte_t pte) return pme; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +{ + u64 pme = 0; + /* + * Currently pmd for thp is always present because thp can not be + * swapped-out, migrated, or HWPOISONed (split in such cases instead.) + * This if-check is just to prepare for future implementation. + */ + if (pmd_present(pmd)) + pme = PM_PFRAME(pmd_pfn(pmd) + offset) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; + return pme; +} +#else +static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +{ + return 0; +} +#endif + static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -673,15 +697,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; + u64 pfn = PM_NOT_PRESENT; - split_huge_page_pmd(walk->mm, pmd); if (pmd_trans_unstable(pmd)) return 0; /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); + spin_lock(&walk->mm->page_table_lock); + if (pmd_trans_huge(*pmd)) { + if (pmd_trans_splitting(*pmd)) { + spin_unlock(&walk->mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, pmd); + } else { + for (; addr != end; addr += PAGE_SIZE) { + unsigned long offset; + + offset = (addr & ~PAGEMAP_WALK_MASK) >> + PAGE_SHIFT; + pfn = thp_pmd_to_pagemap_entry(*pmd, offset); + err = add_to_pagemap(addr, pfn, pm); + if (err) + break; + } + spin_unlock(&walk->mm->page_table_lock); + return err; + } + } else { + spin_unlock(&walk->mm->page_table_lock); + } + for (; addr != end; addr += PAGE_SIZE) { - u64 pfn = PM_NOT_PRESENT; /* check to see if we've left 'vma' behind * and need a new, higher one */ @@ -764,8 +810,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ -#define PAGEMAP_WALK_SIZE (PMD_SIZE) -#define PAGEMAP_WALK_MASK (PMD_MASK) static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { -- cgit v1.2.3 From 025c5b2451e42c9e8dfdecd6dc84956ce8f321b5 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:57 -0700 Subject: thp: optimize away unnecessary page table locking Currently when we check if we can handle thp as it is or we need to split it into regular sized pages, we hold page table lock prior to check whether a given pmd is mapping thp or not. Because of this, when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead. To remove it, this patch introduces a optimized check function and replace several similar logics with it. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Naoya Horiguchi Cc: David Rientjes Cc: Andi Kleen Cc: Wu Fengguang Cc: Andrea Arcangeli Cc: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 73 ++++++++++------------------ include/linux/huge_mm.h | 17 +++++++ mm/huge_memory.c | 125 +++++++++++++++++++++++------------------------- 3 files changed, 101 insertions(+), 114 deletions(-) (limited to 'fs/proc') diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 95264c0ef30..328843de6e9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -394,20 +394,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *pte; spinlock_t *ptl; - spin_lock(&walk->mm->page_table_lock); - if (pmd_trans_huge(*pmd)) { - if (pmd_trans_splitting(*pmd)) { - spin_unlock(&walk->mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, pmd); - } else { - smaps_pte_entry(*(pte_t *)pmd, addr, - HPAGE_PMD_SIZE, walk); - spin_unlock(&walk->mm->page_table_lock); - mss->anonymous_thp += HPAGE_PMD_SIZE; - return 0; - } - } else { + if (pmd_trans_huge_lock(pmd, vma) == 1) { + smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk); spin_unlock(&walk->mm->page_table_lock); + mss->anonymous_thp += HPAGE_PMD_SIZE; + return 0; } if (pmd_trans_unstable(pmd)) @@ -705,26 +696,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); spin_lock(&walk->mm->page_table_lock); - if (pmd_trans_huge(*pmd)) { - if (pmd_trans_splitting(*pmd)) { - spin_unlock(&walk->mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, pmd); - } else { - for (; addr != end; addr += PAGE_SIZE) { - unsigned long offset; - - offset = (addr & ~PAGEMAP_WALK_MASK) >> - PAGE_SHIFT; - pfn = thp_pmd_to_pagemap_entry(*pmd, offset); - err = add_to_pagemap(addr, pfn, pm); - if (err) - break; - } - spin_unlock(&walk->mm->page_table_lock); - return err; + if (pmd_trans_huge_lock(pmd, vma) == 1) { + for (; addr != end; addr += PAGE_SIZE) { + unsigned long offset; + + offset = (addr & ~PAGEMAP_WALK_MASK) >> + PAGE_SHIFT; + pfn = thp_pmd_to_pagemap_entry(*pmd, offset); + err = add_to_pagemap(addr, pfn, pm); + if (err) + break; } - } else { spin_unlock(&walk->mm->page_table_lock); + return err; } for (; addr != end; addr += PAGE_SIZE) { @@ -992,24 +976,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, pte_t *pte; md = walk->private; - spin_lock(&walk->mm->page_table_lock); - if (pmd_trans_huge(*pmd)) { - if (pmd_trans_splitting(*pmd)) { - spin_unlock(&walk->mm->page_table_lock); - wait_split_huge_page(md->vma->anon_vma, pmd); - } else { - pte_t huge_pte = *(pte_t *)pmd; - struct page *page; - - page = can_gather_numa_stats(huge_pte, md->vma, addr); - if (page) - gather_stats(page, md, pte_dirty(huge_pte), - HPAGE_PMD_SIZE/PAGE_SIZE); - spin_unlock(&walk->mm->page_table_lock); - return 0; - } - } else { + + if (pmd_trans_huge_lock(pmd, md->vma) == 1) { + pte_t huge_pte = *(pte_t *)pmd; + struct page *page; + + page = can_gather_numa_stats(huge_pte, md->vma, addr); + if (page) + gather_stats(page, md, pte_dirty(huge_pte), + HPAGE_PMD_SIZE/PAGE_SIZE); spin_unlock(&walk->mm->page_table_lock); + return 0; } if (pmd_trans_unstable(pmd)) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 1b921299abc..f56cacb4fec 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -113,6 +113,18 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); +extern int __pmd_trans_huge_lock(pmd_t *pmd, + struct vm_area_struct *vma); +/* mmap_sem must be held on entry */ +static inline int pmd_trans_huge_lock(pmd_t *pmd, + struct vm_area_struct *vma) +{ + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem)); + if (pmd_trans_huge(*pmd)) + return __pmd_trans_huge_lock(pmd, vma); + else + return 0; +} static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, @@ -176,6 +188,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, long adjust_next) { } +static inline int pmd_trans_huge_lock(pmd_t *pmd, + struct vm_area_struct *vma) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f7fc394f63..f0e5306eeb5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1031,32 +1031,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, { int ret = 0; - spin_lock(&tlb->mm->page_table_lock); - if (likely(pmd_trans_huge(*pmd))) { - if (unlikely(pmd_trans_splitting(*pmd))) { - spin_unlock(&tlb->mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, - pmd); - } else { - struct page *page; - pgtable_t pgtable; - pgtable = get_pmd_huge_pte(tlb->mm); - page = pmd_page(*pmd); - pmd_clear(pmd); - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); - page_remove_rmap(page); - VM_BUG_ON(page_mapcount(page) < 0); - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); - VM_BUG_ON(!PageHead(page)); - tlb->mm->nr_ptes--; - spin_unlock(&tlb->mm->page_table_lock); - tlb_remove_page(tlb, page); - pte_free(tlb->mm, pgtable); - ret = 1; - } - } else + if (__pmd_trans_huge_lock(pmd, vma) == 1) { + struct page *page; + pgtable_t pgtable; + pgtable = get_pmd_huge_pte(tlb->mm); + page = pmd_page(*pmd); + pmd_clear(pmd); + tlb_remove_pmd_tlb_entry(tlb, pmd, addr); + page_remove_rmap(page); + VM_BUG_ON(page_mapcount(page) < 0); + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); + VM_BUG_ON(!PageHead(page)); + tlb->mm->nr_ptes--; spin_unlock(&tlb->mm->page_table_lock); - + tlb_remove_page(tlb, page); + pte_free(tlb->mm, pgtable); + ret = 1; + } return ret; } @@ -1066,21 +1057,15 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, { int ret = 0; - spin_lock(&vma->vm_mm->page_table_lock); - if (likely(pmd_trans_huge(*pmd))) { - ret = !pmd_trans_splitting(*pmd); - spin_unlock(&vma->vm_mm->page_table_lock); - if (unlikely(!ret)) - wait_split_huge_page(vma->anon_vma, pmd); - else { - /* - * All logical pages in the range are present - * if backed by a huge page. - */ - memset(vec, 1, (end - addr) >> PAGE_SHIFT); - } - } else + if (__pmd_trans_huge_lock(pmd, vma) == 1) { + /* + * All logical pages in the range are present + * if backed by a huge page. + */ spin_unlock(&vma->vm_mm->page_table_lock); + memset(vec, 1, (end - addr) >> PAGE_SHIFT); + ret = 1; + } return ret; } @@ -1110,20 +1095,11 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, goto out; } - spin_lock(&mm->page_table_lock); - if (likely(pmd_trans_huge(*old_pmd))) { - if (pmd_trans_splitting(*old_pmd)) { - spin_unlock(&mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, old_pmd); - ret = -1; - } else { - pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); - VM_BUG_ON(!pmd_none(*new_pmd)); - set_pmd_at(mm, new_addr, new_pmd, pmd); - spin_unlock(&mm->page_table_lock); - ret = 1; - } - } else { + ret = __pmd_trans_huge_lock(old_pmd, vma); + if (ret == 1) { + pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); + VM_BUG_ON(!pmd_none(*new_pmd)); + set_pmd_at(mm, new_addr, new_pmd, pmd); spin_unlock(&mm->page_table_lock); } out: @@ -1136,24 +1112,41 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, struct mm_struct *mm = vma->vm_mm; int ret = 0; - spin_lock(&mm->page_table_lock); + if (__pmd_trans_huge_lock(pmd, vma) == 1) { + pmd_t entry; + entry = pmdp_get_and_clear(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + set_pmd_at(mm, addr, pmd, entry); + spin_unlock(&vma->vm_mm->page_table_lock); + ret = 1; + } + + return ret; +} + +/* + * Returns 1 if a given pmd maps a stable (not under splitting) thp. + * Returns -1 if it maps a thp under splitting. Returns 0 otherwise. + * + * Note that if it returns 1, this routine returns without unlocking page + * table locks. So callers must unlock them. + */ +int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) +{ + spin_lock(&vma->vm_mm->page_table_lock); if (likely(pmd_trans_huge(*pmd))) { if (unlikely(pmd_trans_splitting(*pmd))) { - spin_unlock(&mm->page_table_lock); + spin_unlock(&vma->vm_mm->page_table_lock); wait_split_huge_page(vma->anon_vma, pmd); + return -1; } else { - pmd_t entry; - - entry = pmdp_get_and_clear(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - set_pmd_at(mm, addr, pmd, entry); - spin_unlock(&vma->vm_mm->page_table_lock); - ret = 1; + /* Thp mapped by 'pmd' is stable, so we can + * handle it as it is. */ + return 1; } - } else - spin_unlock(&vma->vm_mm->page_table_lock); - - return ret; + } + spin_unlock(&vma->vm_mm->page_table_lock); + return 0; } pmd_t *page_check_address_pmd(struct page *page, -- cgit v1.2.3 From e873c49fbfdd595481976b915850e682441bcbec Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:58 -0700 Subject: pagemap: export KPF_THP This flag shows that a given page is a subpage of a transparent hugepage. It helps us debug and test the kernel by showing physical address of thp. Signed-off-by: Naoya Horiguchi Reviewed-by: Wu Fengguang Reviewed-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Cc: David Rientjes Cc: Andi Kleen Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/page.c | 2 ++ include/linux/kernel-page-flags.h | 1 + 2 files changed, 3 insertions(+) (limited to 'fs/proc') diff --git a/fs/proc/page.c b/fs/proc/page.c index 6d8e6a9e93a..7fcd0d60a96 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -115,6 +115,8 @@ u64 stable_page_flags(struct page *page) u |= 1 << KPF_COMPOUND_TAIL; if (PageHuge(page)) u |= 1 << KPF_HUGE; + else if (PageTransCompound(page)) + u |= 1 << KPF_THP; /* * Caveats on high order pages: page->_count will only be set diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h index bd92a89f4b0..26a65711676 100644 --- a/include/linux/kernel-page-flags.h +++ b/include/linux/kernel-page-flags.h @@ -30,6 +30,7 @@ #define KPF_NOPAGE 20 #define KPF_KSM 21 +#define KPF_THP 22 /* kernel hacking assistances * WARNING: subject to change, never rely on them! -- cgit v1.2.3 From 092b50bacd1cdbffef2643b7a46f2a215407919c Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:59 -0700 Subject: pagemap: introduce data structure for pagemap entry Currently a local variable of pagemap entry in pagemap_pte_range() is named pfn and typed with u64, but it's not correct (pfn should be unsigned long.) This patch introduces special type for pagemap entries and replaces code with it. Signed-off-by: Naoya Horiguchi Cc: David Rientjes Cc: Andi Kleen Cc: Wu Fengguang Cc: Andrea Arcangeli Cc: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 69 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 38 insertions(+), 31 deletions(-) (limited to 'fs/proc') diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 328843de6e9..c7e3a163295 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -594,9 +594,13 @@ const struct file_operations proc_clear_refs_operations = { .llseek = noop_llseek, }; +typedef struct { + u64 pme; +} pagemap_entry_t; + struct pagemapread { int pos, len; - u64 *buffer; + pagemap_entry_t *buffer; }; #define PAGEMAP_WALK_SIZE (PMD_SIZE) @@ -619,10 +623,15 @@ struct pagemapread { #define PM_NOT_PRESENT PM_PSHIFT(PAGE_SHIFT) #define PM_END_OF_BUFFER 1 -static int add_to_pagemap(unsigned long addr, u64 pfn, +static inline pagemap_entry_t make_pme(u64 val) +{ + return (pagemap_entry_t) { .pme = val }; +} + +static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme, struct pagemapread *pm) { - pm->buffer[pm->pos++] = pfn; + pm->buffer[pm->pos++] = *pme; if (pm->pos >= pm->len) return PM_END_OF_BUFFER; return 0; @@ -634,8 +643,10 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, struct pagemapread *pm = walk->private; unsigned long addr; int err = 0; + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); + for (addr = start; addr < end; addr += PAGE_SIZE) { - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm); + err = add_to_pagemap(addr, &pme, pm); if (err) break; } @@ -648,36 +659,33 @@ static u64 swap_pte_to_pagemap_entry(pte_t pte) return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT); } -static u64 pte_to_pagemap_entry(pte_t pte) +static void pte_to_pagemap_entry(pagemap_entry_t *pme, pte_t pte) { - u64 pme = 0; if (is_swap_pte(pte)) - pme = PM_PFRAME(swap_pte_to_pagemap_entry(pte)) - | PM_PSHIFT(PAGE_SHIFT) | PM_SWAP; + *pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte)) + | PM_PSHIFT(PAGE_SHIFT) | PM_SWAP); else if (pte_present(pte)) - pme = PM_PFRAME(pte_pfn(pte)) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; - return pme; + *pme = make_pme(PM_PFRAME(pte_pfn(pte)) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, + pmd_t pmd, int offset) { - u64 pme = 0; /* * Currently pmd for thp is always present because thp can not be * swapped-out, migrated, or HWPOISONed (split in such cases instead.) * This if-check is just to prepare for future implementation. */ if (pmd_present(pmd)) - pme = PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; - return pme; + *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); } #else -static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, + pmd_t pmd, int offset) { - return 0; } #endif @@ -688,7 +696,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; - u64 pfn = PM_NOT_PRESENT; + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); if (pmd_trans_unstable(pmd)) return 0; @@ -702,8 +710,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - pfn = thp_pmd_to_pagemap_entry(*pmd, offset); - err = add_to_pagemap(addr, pfn, pm); + thp_pmd_to_pagemap_entry(&pme, *pmd, offset); + err = add_to_pagemap(addr, &pme, pm); if (err) break; } @@ -723,11 +731,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (vma && (vma->vm_start <= addr) && !is_vm_hugetlb_page(vma)) { pte = pte_offset_map(pmd, addr); - pfn = pte_to_pagemap_entry(*pte); + pte_to_pagemap_entry(&pme, *pte); /* unmap before userspace copy */ pte_unmap(pte); } - err = add_to_pagemap(addr, pfn, pm); + err = add_to_pagemap(addr, &pme, pm); if (err) return err; } @@ -738,13 +746,12 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } #ifdef CONFIG_HUGETLB_PAGE -static u64 huge_pte_to_pagemap_entry(pte_t pte, int offset) +static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, + pte_t pte, int offset) { - u64 pme = 0; if (pte_present(pte)) - pme = PM_PFRAME(pte_pfn(pte) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; - return pme; + *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); } /* This function walks within one hugetlb entry in the single call */ @@ -754,12 +761,12 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, { struct pagemapread *pm = walk->private; int err = 0; - u64 pfn; + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); for (; addr != end; addr += PAGE_SIZE) { int offset = (addr & ~hmask) >> PAGE_SHIFT; - pfn = huge_pte_to_pagemap_entry(*pte, offset); - err = add_to_pagemap(addr, pfn, pm); + huge_pte_to_pagemap_entry(&pme, *pte, offset); + err = add_to_pagemap(addr, &pme, pm); if (err) return err; } -- cgit v1.2.3 From b76437579d1344b612cf1851ae610c636cec7db0 Mon Sep 17 00:00:00 2001 From: Siddhesh Poyarekar Date: Wed, 21 Mar 2012 16:34:04 -0700 Subject: procfs: mark thread stack correctly in proc//maps Stack for a new thread is mapped by userspace code and passed via sys_clone. This memory is currently seen as anonymous in /proc//maps, which makes it difficult to ascertain which mappings are being used for thread stacks. This patch uses the individual task stack pointers to determine which vmas are actually thread stacks. For a multithreaded program like the following: #include void *thread_main(void *foo) { while(1); } int main() { pthread_t t; pthread_create(&t, NULL, thread_main, NULL); pthread_join(t, NULL); } proc/PID/maps looks like the following: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but that is not always a reliable way to find out which vma is a thread stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same content. With this patch in place, /proc/PID/task/TID/maps are treated as 'maps as the task would see it' and hence, only the vma that that task uses as stack is marked as [stack]. All other 'stack' vmas are marked as anonymous memory. /proc/PID/maps acts as a thread group level view, where all thread stack vmas are marked as [stack:TID] where TID is the process ID of the task that uses that vma as stack, while the process stack is marked as [stack]. So /proc/PID/maps will look like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Thus marking all vmas that are used as stacks by the threads in the thread group along with the process stack. The task level maps will however like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] where only the vma that is being used as a stack by *that* task is marked as [stack]. Analogous changes have been made to /proc/PID/smaps, /proc/PID/numa_maps, /proc/PID/task/TID/smaps and /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and numa_maps: [siddhesh@localhost ~ ]$ pgrep a.out 1441 [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack" 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack" 7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2 7fff6273a000 default stack anon=3 dirty=3 N0=3 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack" 7f8a44492000 default stack anon=2 dirty=2 N0=2 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack" 7fff6273a000 default stack anon=3 dirty=3 N0=3 [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix build] Signed-off-by: Siddhesh Poyarekar Cc: KOSAKI Motohiro Cc: Alexander Viro Cc: Jamie Lokier Cc: Mike Frysinger Cc: Alexey Dobriyan Cc: Matt Mackall Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 32 +++++- fs/proc/base.c | 12 +-- fs/proc/internal.h | 9 +- fs/proc/task_mmu.c | 210 +++++++++++++++++++++++++++++-------- fs/proc/task_nommu.c | 69 ++++++++++-- include/linux/mm.h | 3 + mm/util.c | 41 ++++++++ 7 files changed, 313 insertions(+), 63 deletions(-) (limited to 'fs/proc') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index a76a26a1db8..b7413cb46dc 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -290,7 +290,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) rsslim current limit in bytes on the rss start_code address above which program text can run end_code address below which program text can run - start_stack address of the start of the stack + start_stack address of the start of the main process stack esp current value of ESP eip current value of EIP pending bitmap of pending signals @@ -325,7 +325,7 @@ address perms offset dev inode pathname a7cb1000-a7cb2000 ---p 00000000 00:00 0 a7cb2000-a7eb2000 rw-p 00000000 00:00 0 a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001] a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 @@ -357,11 +357,39 @@ is not associated with a file: [heap] = the heap of the program [stack] = the stack of the main process + [stack:1001] = the stack of the thread with tid 1001 [vdso] = the "virtual dynamic shared object", the kernel system call handler or if empty, the mapping is anonymous. +The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint +of the individual tasks of a process. In this file you will see a mapping marked +as [stack] if that task sees it as a stack. This is a key difference from the +content of /proc/PID/maps, where you will see all mappings that are being used +as stack by all of those tasks. Hence, for the example above, the task-level +map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each of mappings there diff --git a/fs/proc/base.c b/fs/proc/base.c index 965d4bde3a3..3b42c1418f3 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2989,9 +2989,9 @@ static const struct pid_entry tgid_base_stuff[] = { INF("cmdline", S_IRUGO, proc_pid_cmdline), ONE("stat", S_IRUGO, proc_tgid_stat), ONE("statm", S_IRUGO, proc_pid_statm), - REG("maps", S_IRUGO, proc_maps_operations), + REG("maps", S_IRUGO, proc_pid_maps_operations), #ifdef CONFIG_NUMA - REG("numa_maps", S_IRUGO, proc_numa_maps_operations), + REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), @@ -3002,7 +3002,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("mountstats", S_IRUSR, proc_mountstats_operations), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, proc_clear_refs_operations), - REG("smaps", S_IRUGO, proc_smaps_operations), + REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), #endif #ifdef CONFIG_SECURITY @@ -3348,9 +3348,9 @@ static const struct pid_entry tid_base_stuff[] = { INF("cmdline", S_IRUGO, proc_pid_cmdline), ONE("stat", S_IRUGO, proc_tid_stat), ONE("statm", S_IRUGO, proc_pid_statm), - REG("maps", S_IRUGO, proc_maps_operations), + REG("maps", S_IRUGO, proc_tid_maps_operations), #ifdef CONFIG_NUMA - REG("numa_maps", S_IRUGO, proc_numa_maps_operations), + REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), @@ -3360,7 +3360,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("mountinfo", S_IRUGO, proc_mountinfo_operations), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, proc_clear_refs_operations), - REG("smaps", S_IRUGO, proc_smaps_operations), + REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), #endif #ifdef CONFIG_SECURITY diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 292577531ad..c44efe19798 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -53,9 +53,12 @@ extern int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task); extern loff_t mem_lseek(struct file *file, loff_t offset, int orig); -extern const struct file_operations proc_maps_operations; -extern const struct file_operations proc_numa_maps_operations; -extern const struct file_operations proc_smaps_operations; +extern const struct file_operations proc_pid_maps_operations; +extern const struct file_operations proc_tid_maps_operations; +extern const struct file_operations proc_pid_numa_maps_operations; +extern const struct file_operations proc_tid_numa_maps_operations; +extern const struct file_operations proc_pid_smaps_operations; +extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; extern const struct file_operations proc_net_operations; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index c7e3a163295..9694cc28351 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -209,16 +209,20 @@ static int do_maps_open(struct inode *inode, struct file *file, return ret; } -static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) +static void +show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) { struct mm_struct *mm = vma->vm_mm; struct file *file = vma->vm_file; + struct proc_maps_private *priv = m->private; + struct task_struct *task = priv->task; vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; unsigned long start, end; dev_t dev = 0; int len; + const char *name = NULL; if (file) { struct inode *inode = vma->vm_file->f_path.dentry->d_inode; @@ -252,36 +256,57 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) if (file) { pad_len_spaces(m, len); seq_path(m, &file->f_path, "\n"); - } else { - const char *name = arch_vma_name(vma); - if (!name) { - if (mm) { - if (vma->vm_start <= mm->brk && - vma->vm_end >= mm->start_brk) { - name = "[heap]"; - } else if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { - name = "[stack]"; - } + goto done; + } + + name = arch_vma_name(vma); + if (!name) { + pid_t tid; + + if (!mm) { + name = "[vdso]"; + goto done; + } + + if (vma->vm_start <= mm->brk && + vma->vm_end >= mm->start_brk) { + name = "[heap]"; + goto done; + } + + tid = vm_is_stack(task, vma, is_pid); + + if (tid != 0) { + /* + * Thread stack in /proc/PID/task/TID/maps or + * the main process stack. + */ + if (!is_pid || (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack)) { + name = "[stack]"; } else { - name = "[vdso]"; + /* Thread stack in /proc/PID/maps */ + pad_len_spaces(m, len); + seq_printf(m, "[stack:%d]", tid); } } - if (name) { - pad_len_spaces(m, len); - seq_puts(m, name); - } + } + +done: + if (name) { + pad_len_spaces(m, len); + seq_puts(m, name); } seq_putc(m, '\n'); } -static int show_map(struct seq_file *m, void *v) +static int show_map(struct seq_file *m, void *v, int is_pid) { struct vm_area_struct *vma = v; struct proc_maps_private *priv = m->private; struct task_struct *task = priv->task; - show_map_vma(m, vma); + show_map_vma(m, vma, is_pid); if (m->count < m->size) /* vma is copied successfully */ m->version = (vma != get_gate_vma(task->mm)) @@ -289,20 +314,49 @@ static int show_map(struct seq_file *m, void *v) return 0; } +static int show_pid_map(struct seq_file *m, void *v) +{ + return show_map(m, v, 1); +} + +static int show_tid_map(struct seq_file *m, void *v) +{ + return show_map(m, v, 0); +} + static const struct seq_operations proc_pid_maps_op = { .start = m_start, .next = m_next, .stop = m_stop, - .show = show_map + .show = show_pid_map +}; + +static const struct seq_operations proc_tid_maps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_map }; -static int maps_open(struct inode *inode, struct file *file) +static int pid_maps_open(struct inode *inode, struct file *file) { return do_maps_open(inode, file, &proc_pid_maps_op); } -const struct file_operations proc_maps_operations = { - .open = maps_open, +static int tid_maps_open(struct inode *inode, struct file *file) +{ + return do_maps_open(inode, file, &proc_tid_maps_op); +} + +const struct file_operations proc_pid_maps_operations = { + .open = pid_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_maps_operations = { + .open = tid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, @@ -416,7 +470,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } -static int show_smap(struct seq_file *m, void *v) +static int show_smap(struct seq_file *m, void *v, int is_pid) { struct proc_maps_private *priv = m->private; struct task_struct *task = priv->task; @@ -434,7 +488,7 @@ static int show_smap(struct seq_file *m, void *v) if (vma->vm_mm && !is_vm_hugetlb_page(vma)) walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); - show_map_vma(m, vma); + show_map_vma(m, vma, is_pid); seq_printf(m, "Size: %8lu kB\n" @@ -473,20 +527,49 @@ static int show_smap(struct seq_file *m, void *v) return 0; } +static int show_pid_smap(struct seq_file *m, void *v) +{ + return show_smap(m, v, 1); +} + +static int show_tid_smap(struct seq_file *m, void *v) +{ + return show_smap(m, v, 0); +} + static const struct seq_operations proc_pid_smaps_op = { .start = m_start, .next = m_next, .stop = m_stop, - .show = show_smap + .show = show_pid_smap +}; + +static const struct seq_operations proc_tid_smaps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_smap }; -static int smaps_open(struct inode *inode, struct file *file) +static int pid_smaps_open(struct inode *inode, struct file *file) { return do_maps_open(inode, file, &proc_pid_smaps_op); } -const struct file_operations proc_smaps_operations = { - .open = smaps_open, +static int tid_smaps_open(struct inode *inode, struct file *file) +{ + return do_maps_open(inode, file, &proc_tid_smaps_op); +} + +const struct file_operations proc_pid_smaps_operations = { + .open = pid_smaps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_smaps_operations = { + .open = tid_smaps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, @@ -1039,7 +1122,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, /* * Display pages allocated per node and memory policy via /proc. */ -static int show_numa_map(struct seq_file *m, void *v) +static int show_numa_map(struct seq_file *m, void *v, int is_pid) { struct numa_maps_private *numa_priv = m->private; struct proc_maps_private *proc_priv = &numa_priv->proc_maps; @@ -1076,9 +1159,19 @@ static int show_numa_map(struct seq_file *m, void *v) seq_path(m, &file->f_path, "\n\t= "); } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { seq_printf(m, " heap"); - } else if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { - seq_printf(m, " stack"); + } else { + pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); + if (tid != 0) { + /* + * Thread stack in /proc/PID/task/TID/maps or + * the main process stack. + */ + if (!is_pid || (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack)) + seq_printf(m, " stack"); + else + seq_printf(m, " stack:%d", tid); + } } if (is_vm_hugetlb_page(vma)) @@ -1121,21 +1214,39 @@ out: return 0; } +static int show_pid_numa_map(struct seq_file *m, void *v) +{ + return show_numa_map(m, v, 1); +} + +static int show_tid_numa_map(struct seq_file *m, void *v) +{ + return show_numa_map(m, v, 0); +} + static const struct seq_operations proc_pid_numa_maps_op = { - .start = m_start, - .next = m_next, - .stop = m_stop, - .show = show_numa_map, + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_pid_numa_map, }; -static int numa_maps_open(struct inode *inode, struct file *file) +static const struct seq_operations proc_tid_numa_maps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_numa_map, +}; + +static int numa_maps_open(struct inode *inode, struct file *file, + const struct seq_operations *ops) { struct numa_maps_private *priv; int ret = -ENOMEM; priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (priv) { priv->proc_maps.pid = proc_pid(inode); - ret = seq_open(file, &proc_pid_numa_maps_op); + ret = seq_open(file, ops); if (!ret) { struct seq_file *m = file->private_data; m->private = priv; @@ -1146,8 +1257,25 @@ static int numa_maps_open(struct inode *inode, struct file *file) return ret; } -const struct file_operations proc_numa_maps_operations = { - .open = numa_maps_open, +static int pid_numa_maps_open(struct inode *inode, struct file *file) +{ + return numa_maps_open(inode, file, &proc_pid_numa_maps_op); +} + +static int tid_numa_maps_open(struct inode *inode, struct file *file) +{ + return numa_maps_open(inode, file, &proc_tid_numa_maps_op); +} + +const struct file_operations proc_pid_numa_maps_operations = { + .open = pid_numa_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_numa_maps_operations = { + .open = tid_numa_maps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index 980de547c07..74fe164d1b2 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -134,9 +134,11 @@ static void pad_len_spaces(struct seq_file *m, int len) /* * display a single VMA to a sequenced file */ -static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) +static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma, + int is_pid) { struct mm_struct *mm = vma->vm_mm; + struct proc_maps_private *priv = m->private; unsigned long ino = 0; struct file *file; dev_t dev = 0; @@ -168,10 +170,19 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) pad_len_spaces(m, len); seq_path(m, &file->f_path, ""); } else if (mm) { - if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { + pid_t tid = vm_is_stack(priv->task, vma, is_pid); + + if (tid != 0) { pad_len_spaces(m, len); - seq_puts(m, "[stack]"); + /* + * Thread stack in /proc/PID/task/TID/maps or + * the main process stack. + */ + if (!is_pid || (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack)) + seq_printf(m, "[stack]"); + else + seq_printf(m, "[stack:%d]", tid); } } @@ -182,11 +193,22 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) /* * display mapping lines for a particular process's /proc/pid/maps */ -static int show_map(struct seq_file *m, void *_p) +static int show_map(struct seq_file *m, void *_p, int is_pid) { struct rb_node *p = _p; - return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb)); + return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb), + is_pid); +} + +static int show_pid_map(struct seq_file *m, void *_p) +{ + return show_map(m, _p, 1); +} + +static int show_tid_map(struct seq_file *m, void *_p) +{ + return show_map(m, _p, 0); } static void *m_start(struct seq_file *m, loff_t *pos) @@ -240,10 +262,18 @@ static const struct seq_operations proc_pid_maps_ops = { .start = m_start, .next = m_next, .stop = m_stop, - .show = show_map + .show = show_pid_map +}; + +static const struct seq_operations proc_tid_maps_ops = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_map }; -static int maps_open(struct inode *inode, struct file *file) +static int maps_open(struct inode *inode, struct file *file, + const struct seq_operations *ops) { struct proc_maps_private *priv; int ret = -ENOMEM; @@ -251,7 +281,7 @@ static int maps_open(struct inode *inode, struct file *file) priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (priv) { priv->pid = proc_pid(inode); - ret = seq_open(file, &proc_pid_maps_ops); + ret = seq_open(file, ops); if (!ret) { struct seq_file *m = file->private_data; m->private = priv; @@ -262,8 +292,25 @@ static int maps_open(struct inode *inode, struct file *file) return ret; } -const struct file_operations proc_maps_operations = { - .open = maps_open, +static int pid_maps_open(struct inode *inode, struct file *file) +{ + return maps_open(inode, file, &proc_pid_maps_ops); +} + +static int tid_maps_open(struct inode *inode, struct file *file) +{ + return maps_open(inode, file, &proc_tid_maps_ops); +} + +const struct file_operations proc_pid_maps_operations = { + .open = pid_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_maps_operations = { + .open = tid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, diff --git a/include/linux/mm.h b/include/linux/mm.h index 378bccebc26..df17ff23d50 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1040,6 +1040,9 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma, !vma_growsup(vma->vm_next, addr); } +extern pid_t +vm_is_stack(struct task_struct *task, struct vm_area_struct *vma, int in_group); + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len); diff --git a/mm/util.c b/mm/util.c index 136ac4f322b..ae962b31de8 100644 --- a/mm/util.c +++ b/mm/util.c @@ -239,6 +239,47 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, next->vm_prev = vma; } +/* Check if the vma is being used as a stack by this task */ +static int vm_is_stack_for_task(struct task_struct *t, + struct vm_area_struct *vma) +{ + return (vma->vm_start <= KSTK_ESP(t) && vma->vm_end >= KSTK_ESP(t)); +} + +/* + * Check if the vma is being used as a stack. + * If is_group is non-zero, check in the entire thread group or else + * just check in the current task. Returns the pid of the task that + * the vma is stack for. + */ +pid_t vm_is_stack(struct task_struct *task, + struct vm_area_struct *vma, int in_group) +{ + pid_t ret = 0; + + if (vm_is_stack_for_task(task, vma)) + return task->pid; + + if (in_group) { + struct task_struct *t; + rcu_read_lock(); + if (!pid_alive(task)) + goto done; + + t = task; + do { + if (vm_is_stack_for_task(t, vma)) { + ret = t->pid; + goto done; + } + } while_each_thread(task, t); +done: + rcu_read_unlock(); + } + + return ret; +} + #if defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT) void arch_pick_mmap_layout(struct mm_struct *mm) { -- cgit v1.2.3