From 08951e545918c1594434d000d88a7793e2452a9b Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 8 Jul 2011 15:39:36 -0700 Subject: mm: vmscan: correct check for kswapd sleeping in sleeping_prematurely MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. This seems to happen most with recent sandybridge laptops but it's probably a co-incidence as some of these laptops just happen to have a small Normal zone. The reproduction case is almost always during copying large files that kswapd pegs at 100% CPU until the file is deleted or cache is dropped. The problem is mostly down to sleeping_prematurely() keeping kswapd awake when the highest zone is small and unreclaimable and compounded by the fact we shrink slabs even when not shrinking zones causing a lot of time to be spent in shrinkers and a lot of memory to be reclaimed. Patch 1 corrects sleeping_prematurely to check the zones matching the classzone_idx instead of all zones. Patch 2 avoids shrinking slab when we are not shrinking a zone. Patch 3 notes that sleeping_prematurely is checking lower zones against a high classzone which is not what allocators or balance_pgdat() is doing leading to an artifical belief that kswapd should be still awake. Patch 4 notes that when balance_pgdat() gives up on a high zone that the decision is not communicated to sleeping_prematurely() This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable to be picked up by distros and this series is against 3.0-rc4. I've cc'd people that reported similar problems recently to see if they still suffer from the problem and if this fixes it. This patch: correct the check for kswapd sleeping in sleeping_prematurely() During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. A problem occurs if the highest zone is small. balance_pgdat() only considers unreclaimable zones when priority is DEF_PRIORITY but sleeping_prematurely considers all zones. It's possible for this sequence to occur 1. kswapd wakes up and enters balance_pgdat() 2. At DEF_PRIORITY, marks highest zone unreclaimable 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from highest zone, clearing all_unreclaimable. Highest zone is still unbalanced 5. kswapd returns and calls sleeping_prematurely 6. sleeping_prematurely looks at *all* zones, not just the ones being considered by balance_pgdat. The highest small zone has all_unreclaimable cleared but the zone is not balanced. all_zones_ok is false so kswapd stays awake This patch corrects the behaviour of sleeping_prematurely to check the zones balance_pgdat() checked. Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Reviewed-by: Minchan Kim Reviewed-by: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f49535d4cd..04c49fe781f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2326,7 +2326,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, return true; /* Check the watermark levels */ - for (i = 0; i < pgdat->nr_zones; i++) { + for (i = 0; i <= classzone_idx; i++) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) -- cgit v1.2.3 From d7868dae893c83c50c7824bc2bc75f93d114669f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 8 Jul 2011 15:39:38 -0700 Subject: mm: vmscan: do not apply pressure to slab if we are not applying pressure to zone MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. When kswapd applies pressure to zones during node balancing, it checks if the zone is above a high+balance_gap threshold. If it is, it does not apply pressure but it unconditionally shrinks slab on a global basis which is excessive. In the event kswapd is being kept awake due to a high small unreclaimable zone, it skips zone shrinking but still calls shrink_slab(). Once pressure has been applied, the check for zone being unreclaimable is being made before the check is made if all_unreclaimable should be set. This miss of unreclaimable can cause has_under_min_watermark_zone to be set due to an unreclaimable zone preventing kswapd backing off on congestion_wait(). Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Reviewed-by: Minchan Kim Reviewed-by: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 04c49fe781f..a0245861934 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2510,18 +2510,18 @@ loop_again: KSWAPD_ZONE_BALANCE_GAP_RATIO); if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) + balance_gap, - end_zone, 0)) + end_zone, 0)) { shrink_zone(priority, zone, &sc); - reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); - sc.nr_reclaimed += reclaim_state->reclaimed_slab; - total_scanned += sc.nr_scanned; - if (zone->all_unreclaimable) - continue; - if (nr_slab == 0 && - !zone_reclaimable(zone)) - zone->all_unreclaimable = 1; + reclaim_state->reclaimed_slab = 0; + nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); + sc.nr_reclaimed += reclaim_state->reclaimed_slab; + total_scanned += sc.nr_scanned; + + if (nr_slab == 0 && !zone_reclaimable(zone)) + zone->all_unreclaimable = 1; + } + /* * If we've done a decent amount of scanning and * the reclaim ratio is low, start doing writepage @@ -2531,6 +2531,9 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; + if (zone->all_unreclaimable) + continue; + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { all_zones_ok = 0; -- cgit v1.2.3 From da175d06b437093f93109ba9e5efbe44dfdf9409 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 8 Jul 2011 15:39:39 -0700 Subject: mm: vmscan: evaluate the watermarks against the correct classzone MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When deciding if kswapd is sleeping prematurely, the classzone is taken into account but this is different to what balance_pgdat() and the allocator are doing. Specifically, the DMA zone will be checked based on the classzone used when waking kswapd which could be for a GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in error. Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Reviewed-by: Minchan Kim Cc: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index a0245861934..a51b3c9f05b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2344,7 +2344,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, } if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), - classzone_idx, 0)) + i, 0)) all_zones_ok = false; else balanced += zone->present_pages; -- cgit v1.2.3 From 215ddd6664ced067afca7eebd2d1eb83f064ff5a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 8 Jul 2011 15:39:40 -0700 Subject: mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. When balance_pgdat() returns, it may be at a lower classzone_idx than it started because the highest zone was unreclaimable. Before checking if it should go to sleep though, it checks pgdat->classzone_idx which when there is no other activity will be MAX_NR_ZONES-1. It interprets this as it has been woken up while reclaiming, skips scheduling and reclaims again. As there is no useful reclaim work to do, it enters into a loop of shrinking slab consuming loads of CPU until the highest zone becomes reclaimable for a long period of time. There are two problems here. 1) If the returned classzone or order is lower, it'll continue reclaiming without scheduling. 2) if the highest zone was marked unreclaimable but balance_pgdat() returns immediately at DEF_PRIORITY, the new lower classzone is not communicated back to kswapd() for sleeping. This patch does two things that are related. If the end_zone is unreclaimable, this information is communicated back. Second, if the classzone or order was reduced due to failing to reclaim, new information is not read from pgdat and instead an attempt is made to go to sleep. Due to this, it is also necessary that pgdat->classzone_idx be initialised each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as wakeups. Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 34 +++++++++++++++++++++------------- 1 file changed, 21 insertions(+), 13 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index a51b3c9f05b..5ed24b94c5e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2451,7 +2451,6 @@ loop_again: if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; - *classzone_idx = i; break; } } @@ -2531,8 +2530,11 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; - if (zone->all_unreclaimable) + if (zone->all_unreclaimable) { + if (end_zone && end_zone == i) + end_zone--; continue; + } if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { @@ -2712,8 +2714,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) */ static int kswapd(void *p) { - unsigned long order; - int classzone_idx; + unsigned long order, new_order; + int classzone_idx, new_classzone_idx; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; @@ -2743,17 +2745,23 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); - order = 0; - classzone_idx = MAX_NR_ZONES - 1; + order = new_order = 0; + classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; for ( ; ; ) { - unsigned long new_order; - int new_classzone_idx; int ret; - new_order = pgdat->kswapd_max_order; - new_classzone_idx = pgdat->classzone_idx; - pgdat->kswapd_max_order = 0; - pgdat->classzone_idx = MAX_NR_ZONES - 1; + /* + * If the last balance_pgdat was unsuccessful it's unlikely a + * new request of a similar or harder type will succeed soon + * so consider going to sleep on the basis we reclaimed at + */ + if (classzone_idx >= new_classzone_idx && order == new_order) { + new_order = pgdat->kswapd_max_order; + new_classzone_idx = pgdat->classzone_idx; + pgdat->kswapd_max_order = 0; + pgdat->classzone_idx = pgdat->nr_zones - 1; + } + if (order < new_order || classzone_idx > new_classzone_idx) { /* * Don't sleep if someone wants a larger 'order' @@ -2766,7 +2774,7 @@ static int kswapd(void *p) order = pgdat->kswapd_max_order; classzone_idx = pgdat->classzone_idx; pgdat->kswapd_max_order = 0; - pgdat->classzone_idx = MAX_NR_ZONES - 1; + pgdat->classzone_idx = pgdat->nr_zones - 1; } ret = try_to_freeze(); -- cgit v1.2.3 From 0b43c3aab0137595335b08b340a3f3e5af9818a6 Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Fri, 8 Jul 2011 15:39:41 -0700 Subject: mm: __tlb_remove_page() check the correct batch __tlb_remove_page() switches to a new batch page, but still checks space in the old batch. This check always fails, and causes a forced tlb flush. Signed-off-by: Shaohua Li Acked-by: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 40b7531ee8b..9b8a01d941c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -305,6 +305,7 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return 0; + batch = tlb->active; } VM_BUG_ON(batch->nr > batch->max); -- cgit v1.2.3 From 4d0c066d29f030d47d19678f8008933e67dd3b72 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 8 Jul 2011 15:39:42 -0700 Subject: memcg: fix reclaimable lru check in memcg Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is used for checking whether the memcg contains reclaimable pages or not. If no pages in it, the routine skips it. But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle "noswap" condition correctly. This doesn't work on a swapless system. This patch adds test_mem_cgroup_reclaimable() and replaces mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter and returns correct answer to the caller. And this new function has "noswap" argument and can see only FILE LRU if necessary. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix kerneldoc layout] Signed-off-by: KAMEZAWA Hiroyuki Cc: Michal Hocko Cc: Ying Han Cc: Johannes Weiner Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 107 ++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 76 insertions(+), 31 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ddffc74cdeb..a7a5cb1bf2c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -577,15 +577,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem, return val; } -static long mem_cgroup_local_usage(struct mem_cgroup *mem) -{ - long ret; - - ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS); - ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE); - return ret; -} - static void mem_cgroup_swap_statistics(struct mem_cgroup *mem, bool charge) { @@ -1129,7 +1120,6 @@ unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, return MEM_CGROUP_ZSTAT(mz, lru); } -#ifdef CONFIG_NUMA static unsigned long mem_cgroup_node_nr_file_lru_pages(struct mem_cgroup *memcg, int nid) { @@ -1141,6 +1131,17 @@ static unsigned long mem_cgroup_node_nr_file_lru_pages(struct mem_cgroup *memcg, return ret; } +static unsigned long mem_cgroup_node_nr_anon_lru_pages(struct mem_cgroup *memcg, + int nid) +{ + unsigned long ret; + + ret = mem_cgroup_get_zonestat_node(memcg, nid, LRU_INACTIVE_ANON) + + mem_cgroup_get_zonestat_node(memcg, nid, LRU_ACTIVE_ANON); + return ret; +} + +#if MAX_NUMNODES > 1 static unsigned long mem_cgroup_nr_file_lru_pages(struct mem_cgroup *memcg) { u64 total = 0; @@ -1152,17 +1153,6 @@ static unsigned long mem_cgroup_nr_file_lru_pages(struct mem_cgroup *memcg) return total; } -static unsigned long mem_cgroup_node_nr_anon_lru_pages(struct mem_cgroup *memcg, - int nid) -{ - unsigned long ret; - - ret = mem_cgroup_get_zonestat_node(memcg, nid, LRU_INACTIVE_ANON) + - mem_cgroup_get_zonestat_node(memcg, nid, LRU_ACTIVE_ANON); - - return ret; -} - static unsigned long mem_cgroup_nr_anon_lru_pages(struct mem_cgroup *memcg) { u64 total = 0; @@ -1559,6 +1549,28 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem) return ret; } +/** + * test_mem_cgroup_node_reclaimable + * @mem: the target memcg + * @nid: the node ID to be checked. + * @noswap : specify true here if the user wants flle only information. + * + * This function returns whether the specified memcg contains any + * reclaimable pages on a node. Returns true if there are any reclaimable + * pages in the node. + */ +static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *mem, + int nid, bool noswap) +{ + if (mem_cgroup_node_nr_file_lru_pages(mem, nid)) + return true; + if (noswap || !total_swap_pages) + return false; + if (mem_cgroup_node_nr_anon_lru_pages(mem, nid)) + return true; + return false; + +} #if MAX_NUMNODES > 1 /* @@ -1580,15 +1592,8 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem) for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) { - if (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_FILE) || - mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_FILE)) - continue; - - if (total_swap_pages && - (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_ANON) || - mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_ANON))) - continue; - node_clear(nid, mem->scan_nodes); + if (!test_mem_cgroup_node_reclaimable(mem, nid, false)) + node_clear(nid, mem->scan_nodes); } } @@ -1627,11 +1632,51 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *mem) return node; } +/* + * Check all nodes whether it contains reclaimable pages or not. + * For quick scan, we make use of scan_nodes. This will allow us to skip + * unused nodes. But scan_nodes is lazily updated and may not cotain + * enough new information. We need to do double check. + */ +bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap) +{ + int nid; + + /* + * quick check...making use of scan_node. + * We can skip unused nodes. + */ + if (!nodes_empty(mem->scan_nodes)) { + for (nid = first_node(mem->scan_nodes); + nid < MAX_NUMNODES; + nid = next_node(nid, mem->scan_nodes)) { + + if (test_mem_cgroup_node_reclaimable(mem, nid, noswap)) + return true; + } + } + /* + * Check rest of nodes. + */ + for_each_node_state(nid, N_HIGH_MEMORY) { + if (node_isset(nid, mem->scan_nodes)) + continue; + if (test_mem_cgroup_node_reclaimable(mem, nid, noswap)) + return true; + } + return false; +} + #else int mem_cgroup_select_victim_node(struct mem_cgroup *mem) { return 0; } + +bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap) +{ + return test_mem_cgroup_node_reclaimable(mem, 0, noswap); +} #endif /* @@ -1702,7 +1747,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, } } } - if (!mem_cgroup_local_usage(victim)) { + if (!mem_cgroup_reclaimable(victim, noswap)) { /* this cgroup's local usage == 0 */ css_put(&victim->css); continue; -- cgit v1.2.3 From 453a9bf347f1e22a5bb3605ced43b2366921221d Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 8 Jul 2011 15:39:43 -0700 Subject: memcg: fix numa scan information update to be triggered by memory event commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin order") adds an numa node round-robin for memcg. But the information is updated once per 10sec. This patch changes the update trigger from jiffies to memcg's event count. After this patch, numa scan information will be updated when we see 1024 events of pagein/pageout under a memcg. [akpm@linux-foundation.org: attempt to repair code layout] Signed-off-by: KAMEZAWA Hiroyuki Cc: Michal Hocko Cc: Ying Han Cc: Johannes Weiner Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 33 +++++++++++++++++++++++++++------ 1 file changed, 27 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a7a5cb1bf2c..e013b8e57d2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -108,10 +108,12 @@ enum mem_cgroup_events_index { enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, MEM_CGROUP_TARGET_SOFTLIMIT, + MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; #define THRESHOLDS_EVENTS_TARGET (128) #define SOFTLIMIT_EVENTS_TARGET (1024) +#define NUMAINFO_EVENTS_TARGET (1024) struct mem_cgroup_stat_cpu { long count[MEM_CGROUP_STAT_NSTATS]; @@ -237,7 +239,8 @@ struct mem_cgroup { int last_scanned_node; #if MAX_NUMNODES > 1 nodemask_t scan_nodes; - unsigned long next_scan_node_update; + atomic_t numainfo_events; + atomic_t numainfo_updating; #endif /* * Should the accounting and control be hierarchical, per subtree? @@ -680,6 +683,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target) case MEM_CGROUP_TARGET_SOFTLIMIT: next = val + SOFTLIMIT_EVENTS_TARGET; break; + case MEM_CGROUP_TARGET_NUMAINFO: + next = val + NUMAINFO_EVENTS_TARGET; + break; default: return; } @@ -698,11 +704,19 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page) mem_cgroup_threshold(mem); __mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH); if (unlikely(__memcg_event_check(mem, - MEM_CGROUP_TARGET_SOFTLIMIT))){ + MEM_CGROUP_TARGET_SOFTLIMIT))) { mem_cgroup_update_tree(mem, page); __mem_cgroup_target_update(mem, - MEM_CGROUP_TARGET_SOFTLIMIT); + MEM_CGROUP_TARGET_SOFTLIMIT); + } +#if MAX_NUMNODES > 1 + if (unlikely(__memcg_event_check(mem, + MEM_CGROUP_TARGET_NUMAINFO))) { + atomic_inc(&mem->numainfo_events); + __mem_cgroup_target_update(mem, + MEM_CGROUP_TARGET_NUMAINFO); } +#endif } } @@ -1582,11 +1596,15 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *mem, static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem) { int nid; - - if (time_after(mem->next_scan_node_update, jiffies)) + /* + * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET + * pagein/pageout changes since the last update. + */ + if (!atomic_read(&mem->numainfo_events)) + return; + if (atomic_inc_return(&mem->numainfo_updating) > 1) return; - mem->next_scan_node_update = jiffies + 10*HZ; /* make a nodemask where this memcg uses memory from */ mem->scan_nodes = node_states[N_HIGH_MEMORY]; @@ -1595,6 +1613,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem) if (!test_mem_cgroup_node_reclaimable(mem, nid, false)) node_clear(nid, mem->scan_nodes); } + + atomic_set(&mem->numainfo_events, 0); + atomic_set(&mem->numainfo_updating, 0); } /* -- cgit v1.2.3 From 8f3b1327aa454bc8283e96bca7669c3c88b83f79 Mon Sep 17 00:00:00 2001 From: Bob Liu Date: Fri, 8 Jul 2011 15:39:46 -0700 Subject: mm/nommu.c: fix remap_pfn_range() remap_pfn_range() means map physical address pfn<vm_start = pfn << PAGE_SHIFT which is wrong acroding the original meaning of this function. And some driver developer using remap_pfn_range() with correct parameter will get unexpected result because vm_start is changed. It should be implementd like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this patch just make it simply return. Parameter name and setting of vma->vm_flags also be fixed. Signed-off-by: Bob Liu Cc: Geert Uytterhoeven Cc: David Howells Acked-by: Greg Ungerer Cc: Mike Frysinger Cc: Bob Liu Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 1fd0c51b10a..9edc897a397 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1813,10 +1813,13 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, return NULL; } -int remap_pfn_range(struct vm_area_struct *vma, unsigned long from, - unsigned long to, unsigned long size, pgprot_t prot) +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot) { - vma->vm_start = vma->vm_pgoff << PAGE_SHIFT; + if (addr != (pfn << PAGE_SHIFT)) + return -EINVAL; + + vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP; return 0; } EXPORT_SYMBOL(remap_pfn_range); -- cgit v1.2.3