summaryrefslogtreecommitdiff
path: root/mm/memcontrol.c
AgeCommit message (Collapse)Author
2010-04-07memcg: fix race in file_mapped accountingKAMEZAWA Hiroyuki
Presently, memcg's FILE_MAPPED accounting has following race with move_account (happens at rmdir()). increment page->mapcount (rmap.c) mem_cgroup_update_file_mapped() move_account() lock_page_cgroup() check page_mapped() if page_mapped(page)>1 { FILE_MAPPED -1 from old memcg FILE_MAPPED +1 to old memcg } ..... overwrite pc->mem_cgroup unlock_page_cgroup() lock_page_cgroup() FILE_MAPPED + 1 to pc->mem_cgroup unlock_page_cgroup() Then, old memcg (-1 file mapped) new memcg (+2 file mapped) This happens because move_account see page_mapped() which is not guarded by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup and move account information based on it. Now, all checks are synchronous with lock_page_cgroup(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24memcontrol: fix potential null derefDan Carpenter
There was a potential null deref introduced in c62b1a3b31b5 ("memcg: use generic percpu instead of private implementation"). Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24memcg: disable move charge in no mmu caseDaisuke Nishimura
In commit 02491447 ("memcg: move charges of anonymous swap"), I tried to disable move charge feature in no mmu case by enclosing all the related functions with "#ifdef CONFIG_MMU", but the commit places these ifdefs in wrong place. (it seems that it's mangled while handling some fixes...) This patch fixes it up. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: fix oom kill behaviorKAMEZAWA Hiroyuki
In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12cgroups: remove events before destroying subsystem state objectsKirill A. Shutemov
Events should be removed after rmdir of cgroup directory, but before destroying subsystem state objects. Let's take reference to cgroup directory dentry to do that. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg : share event counter rather than duplicateKAMEZAWA Hiroyuki
Memcg has 2 eventcountes which counts "the same" event. Just usages are different from each other. This patch tries to reduce event counter. Now logic uses "only increment, no reset" counter and masks for each checks. Softlimit chesk was done per 1000 evetns. So, the similar check can be done by !(new_counter & 0x3ff). Threshold check was done per 100 events. So, the similar check can be done by (!new_counter & 0x7f) ALL event checks are done right after EVENT percpu counter is updated. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: update threshold and softlimit at commitKAMEZAWA Hiroyuki
Presently, move_task does "batched" precharge. Because res_counter or css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be done in batched manner if allowed. Now, softlimit and threshold check their event counter in try_charge, but the charge is not a per-page event. And event counter is not updated at charge(). Moreover, precharge doesn't pass "page" to try_charge() and softlimit tree will be never updated until uncharge() causes an event." So the best place to check the event counter is commit_charge(). This is per-page event by its nature. This patch move checks to there. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: use generic percpu instead of private implementationKAMEZAWA Hiroyuki
When per-cpu counter for memcg was implemneted, dynamic percpu allocator was not very good. But now, we have good one and useful macros. This patch replaces memcg's private percpu counter implementation with generic dynamic percpu allocator. The benefits are - We can remove private implementation. - The counters will be NUMA-aware. (Current one is not...) - This patch makes sizeof struct mem_cgroup smaller. Then, struct mem_cgroup may be fit in page size on small config. - About basic performance aspects, see below. [Before] # size mm/memcontrol.o text data bss dec hex filename 24373 2528 4132 31033 7939 mm/memcontrol.o [page-fault-throuput test on 8cpu/SMP in root cgroup] # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8 Performance counter stats for './multi-fault-fork 8' (5 runs): 45878618 page-faults ( +- 0.110% ) 602635826 cache-misses ( +- 0.105% ) 61.005373262 seconds time elapsed ( +- 0.004% ) Then cache-miss/page fault = 13.14 [After] #size mm/memcontrol.o text data bss dec hex filename 23913 2528 4132 30573 776d mm/memcontrol.o # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8 Performance counter stats for './multi-fault-fork 8' (5 runs): 48179400 page-faults ( +- 0.271% ) 588628407 cache-misses ( +- 0.136% ) 61.004615021 seconds time elapsed ( +- 0.004% ) Then cache-miss/page fault = 12.22 Text size is reduced. This performance improvement is not big and will be invisible in real world applications. But this result shows this patch has some good effect even on (small) SMP. Here is a test program I used. 1. fork() processes on each cpus. 2. do page fault repeatedly on each process. 3. after 60secs, kill all childredn and exit. (3 is necessary for getting stable data, this is improvement from previous one.) #define _GNU_SOURCE #include <stdio.h> #include <sched.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <signal.h> #include <stdlib.h> /* * For avoiding contention in page table lock, FAULT area is * sparse. If FAULT_LENGTH is too large for your cpus, decrease it. */ #define FAULT_LENGTH (2 * 1024 * 1024) #define PAGE_SIZE 4096 #define MAXNUM (128) void alarm_handler(int sig) { } void *worker(int cpu, int ppid) { void *start, *end; char *c; cpu_set_t set; int i; CPU_ZERO(&set); CPU_SET(cpu, &set); sched_setaffinity(0, sizeof(set), &set); start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (start == MAP_FAILED) { perror("mmap"); exit(1); } end = start + FAULT_LENGTH; pause(); //fprintf(stderr, "run%d", cpu); while (1) { for (c = (char*)start; (void *)c < end; c += PAGE_SIZE) *c = 0; madvise(start, FAULT_LENGTH, MADV_DONTNEED); } return NULL; } int main(int argc, char *argv[]) { int num, i, ret, pid, status; int pids[MAXNUM]; if (argc < 2) return 0; setpgid(0, 0); signal(SIGALRM, alarm_handler); num = atoi(argv[1]); pid = getpid(); for (i = 0; i < num; ++i) { ret = fork(); if (!ret) { worker(i, pid); exit(0); } pids[i] = ret; } sleep(1); kill(-pid, SIGALRM); sleep(60); for (i = 0; i < num; i++) kill(pids[i], SIGKILL); for (i = 0; i < num; i++) waitpid(pids[i], &status, 0); return 0; } Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: typo in comment to mem_cgroup_print_oom_info()Kirill A. Shutemov
s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/ Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: implement memory thresholdsKirill A. Shutemov
It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. To register a threshold application need: - create an eventfd; - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to cgroup.event_control. Application will be notified through eventfd when memory usage crosses threshold in any direction. It's applicable for root and non-root cgroup. It uses stats to track memory usage, simmilar to soft limits. It checks if we need to send event to userspace on every 100 page in/out. I guess it's good compromise between performance and accuracy of thresholds. [akpm@linux-foundation.org: coding-style fixes] [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: rework usage of stats by soft limitKirill A. Shutemov
Instead of incrementing counter on each page in/out and comparing it with constant, we set counter to constant, decrement counter on each page in/out and compare it with zero. We want to make comparing as fast as possible. On many RISC systems (probably not only RISC) comparing with zero is more effective than comparing with a constant, since not every constant can be immediate operand for compare instruction. Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT, since really it's not a generic counter. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: extract mem_group_usage() from mem_cgroup_read()Kirill A. Shutemov
Helper to get memory or mem+swap usage of the cgroup. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: improve performance in moving swap chargeDaisuke Nishimura
Try to reduce overheads in moving swap charge by: - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and decrement mem->refcnt by "count". - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path of moving swap account, and consolidate all of them into mem_cgroup_clear_mc. We cannot do that about mc.to->refcnt. These changes reduces the overhead from 1.35sec to 0.9sec to move charges of 1G anonymous memory(including 500MB swap) in my test environment. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: move charges of anonymous swapDaisuke Nishimura
This patch is another core part of this move-charge-at-task-migration feature. It enables moving charges of anonymous swaps. To move the charge of swap, we need to exchange swap_cgroup's record. In current implementation, swap_cgroup's record is protected by: - page lock: if the entry is on swap cache. - swap_lock: if the entry is not on swap cache. This works well in usual swap-in/out activity. But this behavior make the feature of moving swap charge check many conditions to exchange swap_cgroup's record safely. So I changed modification of swap_cgroup's recored(swap_cgroup_record()) to use xchg, and define a new function to cmpxchg swap_cgroup's record. This patch also enables moving charge of non pte_present but not uncharged swap caches, which can be exist on swap-out path, by getting the target pages via find_get_page() as do_mincore() does. [kosaki.motohiro@jp.fujitsu.com: fix ia64 build] [akpm@linux-foundation.org: fix typos] Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: avoid oom during moving chargeDaisuke Nishimura
This move-charge-at-task-migration feature has extra charges on "to"(pre-charges) and "from"(left-over charges) during moving charge. This means unnecessary oom can happen. This patch tries to avoid such oom. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: improve performance in moving chargeDaisuke Nishimura
Try to reduce overheads in moving charge by: - Instead of calling res_counter_uncharge() against the old cgroup in __mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end of task migration once. - removed css_get(&to->css) from __mem_cgroup_move_account() because callers should have already called css_get(). And removed css_put(&to->css) too, which was called by callers of move_account on success of move_account. - Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(), repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if possible. - Instead of calling css_get()/css_put() repeatedly, make use of coalesce __css_get()/__css_put() if possible. These changes reduces the overhead from 1.7sec to 0.6sec to move charges of 1G anonymous memory in my test environment. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: move charges of anonymous pageDaisuke Nishimura
This patch is the core part of this move-charge-at-task-migration feature. It implements functions to move charges of anonymous pages mapped only by the target task. Implementation: - define struct move_charge_struct and a valuable of it(mc) to remember the count of pre-charges and other information. - At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge() repeatedly and count up mc.precharge. - At attach(), parse the page table, find a target page to be move, and call mem_cgroup_move_account() about the page. - Cancel all precharges if mc.precharge > 0 on failure or at the end of task move. [akpm@linux-foundation.org: a little simplification] Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12memcg: add interface to move charge at task migrationDaisuke Nishimura
In current memcg, charges associated with a task aren't moved to the new cgroup at task migration. Some users feel this behavior to be strange. These patches are for this feature, that is, for charging to the new cgroup and, of course, uncharging from the old cgroup at task migration. This patch adds "memory.move_charge_at_immigrate" file, which is a flag file to determine whether charges should be moved to the new cgroup at task migration or not and what type of charges should be moved. This patch also adds read and write handlers of the file. This patch also adds no-op handlers for this feature. These handlers will be implemented in later patches. And you cannot write any values other than 0 to move_charge_at_immigrate yet. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/memcontrol.c: fix "integer as NULL pointer" sparse warningThiago Farina
mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <tfransosi@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16memcg: ensure list is empty at rmdirDaisuke Nishimura
Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on success. But this doesn't guarantee memcg's LRU is really empty, because there are some cases in which !PageCgrupUsed pages exist on memcg's LRU. For example: - Pages can be uncharged by its owner process while they are on LRU. - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common(). So there can be a case in which the usage is zero but some of the LRUs are not empty. OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with rmdir, accesses the mem_cgroup, so this access can cause a problem if it races with rmdir because the mem_cgroup might have been freed by rmdir. Actually, I saw a bug which seems to be caused by this race. [1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [1530745.950651] IP: [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0 [1530745.950651] Oops: 0002 [#1] SMP [1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map [1530745.950651] CPU 3 [1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] [1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G M 2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065] [1530745.950651] RIP: 0010:[<ffffffff810fbc11>] [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] RSP: 0018:ffff8803863ddcb8 EFLAGS: 00010002 [1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0 [1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238 [1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643 [1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000 [1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310 [1530745.950651] FS: 0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000 [1530745.950651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0 [1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0) [1530745.950651] Stack: [1530745.950651] ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a [1530745.950651] <0> 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000 [1530745.950651] <0> 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046 [1530745.950651] Call Trace: [1530745.950651] [<ffffffff810c739a>] release_pages+0x142/0x1e7 [1530745.950651] [<ffffffff810c778f>] ? pagevec_move_tail+0x6e/0x112 [1530745.950651] [<ffffffff810c781e>] pagevec_move_tail+0xfd/0x112 [1530745.950651] [<ffffffff810c78a9>] lru_add_drain+0x76/0x94 [1530745.950651] [<ffffffff810dba0c>] exit_mmap+0x6e/0x145 [1530745.950651] [<ffffffff8103f52d>] mmput+0x5e/0xcf [1530745.950651] [<ffffffff81043ea8>] exit_mm+0x11c/0x129 [1530745.950651] [<ffffffff8108fb29>] ? audit_free+0x196/0x1c9 [1530745.950651] [<ffffffff81045353>] do_exit+0x1f5/0x6b7 [1530745.950651] [<ffffffff8106133f>] ? up_read+0x2b/0x2f [1530745.950651] [<ffffffff8137d187>] ? lockdep_sys_exit_thunk+0x35/0x67 [1530745.950651] [<ffffffff81045898>] do_group_exit+0x83/0xb0 [1530745.950651] [<ffffffff810458dc>] sys_exit_group+0x17/0x1b [1530745.950651] [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b [1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 <48> ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b [1530745.950651] RIP [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] RSP <ffff8803863ddcb8> [1530745.950651] CR2: 0000000000000230 [1530745.950651] ---[ end trace c3419c1bb8acc34f ]--- [1530745.950651] Fixing recursive fault but reboot is needed! The problem here is pages on LRU may contain pointer to stale memcg. To make res->usage to be 0, all pages on memcg must be uncharged or moved to another(parent) memcg. Moved page_cgroup have already removed from original LRU, but uncharged page_cgroup contains pointer to memcg withou PCG_USED bit. (This asynchronous LRU work is for improving performance.) If PCG_USED bit is not set, page_cgroup will never be added to memcg's LRU. So, about pages not on LRU, they never access stale pointer. Then, what we have to take care of is page_cgroup _on_ LRU list. This patch fixes this problem by making mem_cgroup_force_empty() visit all LRUs before exiting its loop and guarantee there are no pages on its LRU. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16Merge branch 'hwpoison' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits) HWPOISON: Remove stray phrase in a comment HWPOISON: Try to allocate migration page on the same node HWPOISON: Don't do early filtering if filter is disabled HWPOISON: Add a madvise() injector for soft page offlining HWPOISON: Add soft page offline support HWPOISON: Undefine short-hand macros after use to avoid namespace conflict HWPOISON: Use new shake_page in memory_failure HWPOISON: Use correct name for MADV_HWPOISON in documentation HWPOISON: mention HWPoison in Kconfig entry HWPOISON: Use get_user_page_fast in hwpoison madvise HWPOISON: add an interface to switch off/on all the page filters HWPOISON: add memory cgroup filter memcg: add accessor to mem_cgroup.css memcg: rename and export try_get_mem_cgroup_from_page() HWPOISON: add page flags filter mm: export stable page flags HWPOISON: limit hwpoison injector to known page types HWPOISON: add fs/device filters HWPOISON: return 0 to indicate success reliably HWPOISON: make semantics of IGNORED/DELAYED clear ...
2009-12-16memcg: code clean, remove unused variable in mem_cgroup_resize_limit()Bob Liu
Variable `progress' isn't used in mem_cgroup_resize_limit() any more. Remove it. [akpm@linux-foundation.org: cleanup] Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: remove memcg_tasklistDaisuke Nishimura
memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock problem. The cgroup_mutex, which was removed by the commit, in mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e (Memory controller: OOM handling). IIUC, the intention of this cgroup_mutex was to prevent task move during select_bad_process() so that situations like below can be avoided. Assume cgroup "foo" has exceeded its limit and is about to trigger oom. 1. Process A, which has been in cgroup "baa" and uses large memory, is just moved to cgroup "foo". Process A can be the candidates for being killed. 2. Process B, which has been in cgroup "foo" and uses large memory, is just moved from cgroup "foo". Process B can be excluded from the candidates for being killed. But these race window exists anyway even if we hold a lock, because __mem_cgroup_try_charge() decides wether it should trigger oom or not outside of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And IMHO, those races are not so critical for users. This patch removes it and make codes simpler. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: avoid oom-killing innocent task in case of use_hierarchyDaisuke Nishimura
task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: cleanup mem_cgroup_move_parent()Daisuke Nishimura
mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. IMHO, charge/uncharge(especially charge) is high cost operation, so we should avoid it as far as possible. This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering checks it does. And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), changes the return value of __mem_cgroup_move_account() from int to void, and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid for moving account and calls __mem_cgroup_move_account(). This patch removes the last caller of trylock_page_cgroup(), so removes its definition too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: add mem_cgroup_cancel_charge()Daisuke Nishimura
There are some places calling both res_counter_uncharge() and css_put() to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). This patch introduces mem_cgroup_cancel_charge() and call it in those places. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: make memcg's file mapped consistent with global VMKAMEZAWA Hiroyuki
In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED And in global VM, mapped shared memory is accounted into FILE_MAPPED. But memcg doesn't. fix it. Note: page_is_file_cache() just checks SwapBacked or not. So, we need to check PageAnon. Cc: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: coalesce charging via percpu storageKAMEZAWA Hiroyuki
This is a patch for coalescing access to res_counter at charging by percpu caching. At charge, memcg charges 64pages and remember it in percpu cache. Because it's cache, drain/flush if necessary. This version uses public percpu area. 2 benefits for using public percpu area. 1. Sum of stocked charge in the system is limited to # of cpus not to the number of memcg. This shows better synchonization. 2. drain code for flush/cpuhotplug is very easy (and quick) The most important point of this patch is that we never touch res_counter in fast path. The res_counter is system-wide shared counter which is modified very frequently. We shouldn't touch it as far as we can for avoiding false sharing. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 cache miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 cache miss/faults [ + coalescing uncharge patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 cache miss/faults [ + coalescing uncharge patch + this patch ] 34224709 page-faults # 0.072 M/sec ( +- 0.173% ) 34.69 cache miss/faults Changelog (since Oct/2): - updated comments - replaced get_cpu_var() with __get_cpu_var() if possible. - removed mutex for system-wide drain. adds a counter instead of it. - removed CONFIG_HOTPLUG_CPU Changelog (old): - rebased onto the latest mmotm - moved charge size check before __GFP_WAIT check for avoiding unnecesary - added asynchronous flush routine. - fixed bugs pointed out by Nishimura-san. [akpm@linux-foundation.org: tweak comments] [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: coalesce uncharge during unmap/truncateKAMEZAWA Hiroyuki
In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: fix memory.memsw.usage_in_bytes for root cgroupKirill A. Shutemov
A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum of the usage of pages and swapents in the cgroup. Presently the root cgroup's memsw.usage_in_bytes shows the wrong value - the number of swapents are not added. So take MEM_CGROUP_STAT_SWAPOUT into account. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: add accessor to mem_cgroup.cssWu Fengguang
So that an outside user can free the reference count grabbed by try_get_mem_cgroup_from_page(). CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-16memcg: rename and export try_get_mem_cgroup_from_page()Wu Fengguang
So that the hwpoison injector can get mem_cgroup for arbitrary page and thus know whether it is owned by some mem_cgroup task(s). [AK: Merged with latest git tree] CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-15ksm: mem cgroup charge swapin copyHugh Dickins
But ksm swapping does require one small change in mem cgroup handling. When do_swap_page()'s call to ksm_might_need_to_copy() does indeed substitute a duplicate page to accommodate a different anon_vma (or a the !PageSwapCache check in mem_cgroup_try_charge_swapin(). That was returning success without charging, on the assumption that pte_same() would fail after, which is not the case here. Originally I proposed that success, so that an unshrinkable mem cgroup at its limit would not fail unnecessarily; but that's a minor point, and there are plenty of other places where we may fail an overallocation which might later prove unnecessary. So just go ahead and do what all the other exceptions do: proceed to charge current mm. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-04tree-wide: fix assorted typos all over the placeAndré Goddard Rosa
That is "success", "unknown", "through", "performance", "[re|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-11-09tree-wide: fix typos "aquire" -> "acquire", "cumsumed" -> "consumed"Uwe Kleine-König
This patch was generated by git grep -E -i -l '[Aa]quire' | xargs -r perl -p -i -e 's/([Aa])quire/$1cquire/' and the cumsumed was found by checking the diff for aquire. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-10-01memcg: reduce check for softlimit excessKAMEZAWA Hiroyuki
In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly and it takes res_counter's spin_lock every time. This patch removes unnecessary calls for res_count_soft_limit_excess. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-01memcg: some modification to softlimit under hierarchical memory reclaim.KAMEZAWA Hiroyuki
This patch clean up/fixes for memcg's uncharge soft limit path. Problems: Now, res_counter_charge()/uncharge() handles softlimit information at charge/uncharge and softlimit-check is done when event counter per memcg goes over limit. Now, event counter per memcg is updated only when memory usage is over soft limit. Here, considering hierarchical memcg management, ancesotors should be taken care of. Now, ancerstors(hierarchy) are handled in charge() but not in uncharge(). This is not good. Prolems: 1. memcg's event counter incremented only when softlimit hits. That's bad. It makes event counter hard to be reused for other purpose. 2. At uncharge, only the lowest level rescounter is handled. This is bug. Because ancesotor's event counter is not incremented, children should take care of them. 3. res_counter_uncharge()'s 3rd argument is NULL in most case. ops under res_counter->lock should be small. No "if" sentense is better. Fixes: * Removed soft_limit_xx poitner and checks in charge and uncharge. Do-check-only-when-necessary scheme works enough well without them. * make event-counter of memcg incremented at every charge/uncharge. (per-cpu area will be accessed soon anyway) * All ancestors are checked at soft-limit-check. This is necessary because ancesotor's event counter may never be modified. Then, they should be checked at the same time. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-01memcg: fix refcnt going negativeKAMEZAWA Hiroyuki
__mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz" with incremnted mz->mem->css's refcnt. Then, the caller of this function has to call css_put(mz->mem->css). But, mz can be !NULL even if "not found" i.e. without css_get(). By this, css->refcnt will go down to minus. This may cause various things...one of results will be initite-loop in css_tryget() as this. INFO: RCU detected CPU 0 stall (t=10000 jiffies) sending NMI to all CPUs: NMI backtrace for cpu 0 CPU 0: <snip> <<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10 [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0 [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70 [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0 [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370 [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130 [<ffffffff81063a26>] update_process_times+0x46/0x70 [<ffffffff81085930>] tick_sched_timer+0x60/0x160 [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160 [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150 [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0 [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180 [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180 [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180 [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110 [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330 [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60 Above shows CPU0 caught in css_tryget()'s inifinite loop because of bad refcnt. This is a fix to set mz=NULL at the top of retry path. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memcg: show swap usage in stat fileDaisuke Nishimura
We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would be useful for users to show swap usage in memory.stat file, because they don't need calculate memsw.usage - res.usage to know swap usage. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memcg: improve resource counter scalabilityBalbir Singh
Reduce the resource counter overhead (mostly spinlock) associated with the root cgroup. This is a part of the several patches to reduce mem cgroup overhead. I had posted other approaches earlier (including using percpu counters). Those patches will be a natural addition and will be added iteratively on top of these. The patch stops resource counter accounting for the root cgroup. The data for display is derived from the statisitcs we maintain via mem_cgroup_charge_statistics (which is more scalable). What happens today is that, we do double accounting, once using res_counter_charge() and once using memory_cgroup_charge_statistics(). For the root, since we don't implement limits any more, we don't need to track every charge via res_counter_charge() and check for limit being exceeded and reclaim. The main mem->res usage_in_bytes can be derived by summing the cache and rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need additional data about swapped out memory. This patch adds a MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed recursively when hierarchy is enabled. The tests results I see on a 24 way show that 1. The lock contention disappears from /proc/lock_stats 2. The results of the test are comparable to running with cgroup_disable=memory. Here is a sample of my program runs Without Patch Performance counter stats for '/home/balbir/parallel_pagefault': 7192804.124144 task-clock-msecs # 23.937 CPUs 424691 context-switches # 0.000 M/sec 267 CPU-migrations # 0.000 M/sec 28498113 page-faults # 0.004 M/sec 5826093739340 cycles # 809.989 M/sec 408883496292 instructions # 0.070 IPC 7057079452 cache-references # 0.981 M/sec 3036086243 cache-misses # 0.422 M/sec 300.485365680 seconds time elapsed With cgroup_disable=memory Performance counter stats for '/home/balbir/parallel_pagefault': 7182183.546587 task-clock-msecs # 23.915 CPUs 425458 context-switches # 0.000 M/sec 203 CPU-migrations # 0.000 M/sec 92545093 page-faults # 0.013 M/sec 6034363609986 cycles # 840.185 M/sec 437204346785 instructions # 0.072 IPC 6636073192 cache-references # 0.924 M/sec 2358117732 cache-misses # 0.328 M/sec 300.320905827 seconds time elapsed With this patch applied Performance counter stats for '/home/balbir/parallel_pagefault': 7191619.223977 task-clock-msecs # 23.955 CPUs 422579 context-switches # 0.000 M/sec 88 CPU-migrations # 0.000 M/sec 91946060 page-faults # 0.013 M/sec 5957054385619 cycles # 828.333 M/sec 1058117350365 instructions # 0.178 IPC 9161776218 cache-references # 1.274 M/sec 1920494280 cache-misses # 0.267 M/sec 300.218764862 seconds time elapsed Data from Prarit (kernel compile with make -j64 on a 64 CPU/32G machine) For a single run Without patch real 27m8.988s user 87m24.916s sys 382m6.037s With patch real 4m18.607s user 84m58.943s sys 50m52.682s With config turned off real 4m54.972s user 90m13.456s sys 50m19.711s NOTE: The data looks counterintuitive due to the increased performance with the patch, even over the config being turned off. We probably need more runs, but so far all testing has shown that the patches definitely help. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memory controller: soft limit reclaim on contentionBalbir Singh
Implement reclaim from groups over their soft limit Permit reclaim from memory cgroups on contention (via the direct reclaim path). memory cgroup soft limit reclaim finds the group that exceeds its soft limit by the largest number of pages and reclaims pages from it and then reinserts the cgroup into its correct place in the rbtree. Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long loops in case all swap is turned off. The code has been refactored and the loop check (loop < 2) has been enhanced for soft limits. For soft limits, we try to do more targetted reclaim. Instead of bailing out after two loops, the routine now reclaims memory proportional to the size by which the soft limit is exceeded. The proportion has been empirically determined. [akpm@linux-foundation.org: build fix] [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling] [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memory controller: soft limit refactor reclaim flagsBalbir Singh
Refactor mem_cgroup_hierarchical_reclaim() Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't have to be passed as we make the reclaim routine more flexible Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memory controller: soft limit organize cgroupsBalbir Singh
Organize cgroups over soft limit in a RB-Tree Introduce an RB-Tree for storing memory cgroups that are over their soft limit. The overall goal is to 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded. We are careful about updates, updates take place only after a particular time interval has passed 2. We remove the node from the RB-Tree when the usage goes below the soft limit The next set of patches will exploit the RB-Tree to get the group that is over its soft limit by the largest amount and reclaim from it, when we face memory contention. [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memory controller: soft limit interfaceBalbir Singh
Add an interface to allow get/set of soft limits. Soft limits for memory plus swap controller (memsw) is currently not supported. Resource counters have been enhanced to support soft limits and new type RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be directly set and do not need any reclaim or checks before setting them to a newer value. Kamezawa-San raised a question as to whether soft limit should belong to res_counter. Since all resources understand the basic concepts of hard and soft limits, it is justified to add soft limits here. Soft limits are a generic resource usage feature, even file system quotas support soft limits. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memcg: add comments explaining memory barriersKAMEZAWA Hiroyuki
Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge(). [akpm@linux-foundation.org: coding-style fixes] Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24memcg: remove the overhead associated with the root cgroupBalbir Singh
Change the memory cgroup to remove the overhead associated with accounting all pages in the root cgroup. As a side-effect, we can no longer set a memory hard limit in the root cgroup. A new flag to track whether the page has been accounted or not has been added as well. Flags are now set atomically for page_cgroup, pcg_default_flags is now obsolete and removed. [akpm@linux-foundation.org: fix a few documentation glitches] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24cgroups: let ss->can_attach and ss->attach do whole threadgroups at a timeBen Blum
Alter the ss->can_attach and ss->attach functions to be able to deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a pre-patch to cgroup-procs-writable.patch.) Currently, new mode of the attach function can only tell the subsystem about the old cgroup of the threadgroup leader. No subsystem currently needs that information for each thread that's being moved, but if one were to be added (for example, one that counts tasks within a group) this bit would need to be reworked a bit to tell the subsystem the right information. [hidave.darkstar@gmail.com: fix build] Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: drop unneeded double negationsJohannes Weiner
Remove double negations where the operand is already boolean. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-29cgroup avoid permanent sleep at rmdirKAMEZAWA Hiroyuki
After commit ec64f51545fffbc4cb968f0cea56341a4b07e85a ("cgroup: fix frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg) doesn't return -EBUSY by temporary ref counts. That commit expects all refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can wait permanently. This patch tries to fix that and change followings. - set CGRP_WAIT_ON_RMDIR flag before pre_destroy(). - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case. if there are sleeping ones, wakes them up. - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set. Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Paul Menage <menage@google.com> Acked-by: Balbir Sigh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-10Fix congestion_wait() sync/async vs read/write confusionJens Axboe
Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke the bdi congestion wait queue logic, causing us to wait on congestion for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>