summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2011-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-post-merge-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-post-merge-2.6: [SCSI] target: Add LIO target core v4.0.0-rc6 [SCSI] sd,sr: kill compat SDEV_MEDIA_CHANGE event [SCSI] sd: implement sd_check_events()
2011-01-14Revert "drm: Update fbdev fb_fix_screeninfo"Dave Airlie
This reverts commit dfe63bb0ad9810db13aab0058caba97866e0a681. This commit was causing nouveau not to work properly, for -rc1 I'd prefer it worked and we can look if this is useful for 2.6.39. Cc: James Simmons <jsimmons@infradead.org> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14PCI / ACPI: Fix build issue in pci_root.c for !CONFIG_PCIEPORTBUSMarkus Trippelsdorf
The compilation of drivers/acpi/pci_root.c fails if CONFIG_PCIEPORTBUS is unset. Fix the problem. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: restore multiple bd_link_disk_holder() support block cfq: compensate preempted queue even if it has no slice assigned block cfq: make queue preempt work for queues from different workload
2011-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits) GRETH: resolve SMP issues and other problems GRETH: handle frame error interrupts GRETH: avoid writing bad speed/duplex when setting transfer mode GRETH: fixed skb buffer memory leak on frame errors GRETH: GBit transmit descriptor handling optimization GRETH: fix opening/closing GRETH: added raw AMBA vendor/device number to match against. cassini: Fix build bustage on x86. e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs e1000e: update Copyright for 2011 e1000: Avoid unhandled IRQ r8169: keep firmware in memory. netdev: tilepro: Use is_unicast_ether_addr helper etherdevice.h: Add is_unicast_ether_addr function ks8695net: Use default implementation of ethtool_ops::get_link ks8695net: Disable non-working ethtool operations USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable. vxge: Remember to release firmware after upgrading firmware netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr ipsec: update MAX_AH_AUTH_LEN to support sha512 ...
2011-01-14Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits) nfsd4: fix callback restarting nfsd: break lease on unlink, link, and rename nfsd4: break lease on nfsd setattr nfsd: don't support msnfs export option nfsd4: initialize cb_per_client nfsd4: allow restarting callbacks nfsd4: simplify nfsd4_cb_prepare nfsd4: give out delegations more quickly in 4.1 case nfsd4: add helper function to run callbacks nfsd4: make sure sequence flags are set after destroy_session nfsd4: re-probe callback on connection loss nfsd4: set sequence flag when backchannel is down nfsd4: keep finer-grained callback status rpc: allow xprt_class->setup to return a preexisting xprt rpc: keep backchannel xprt as long as server connection rpc: move sk_bc_xprt to svc_xprt nfsd4: allow backchannel recovery nfsd4: support BIND_CONN_TO_SESSION nfsd4: modify session list under cl_lock Documentation: fl_mylease no longer exists ... Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work. The vfs-scale work touched some msnfs cases, and this merge removes support for that entirely, so the conflict was trivial to resolve.
2011-01-14block: restore multiple bd_link_disk_holder() supportTejun Heo
Commit e09b457b (block: simplify holder symlink handling) incorrectly assumed that there is only one link at maximum. dm may use multiple links and expects block layer to track reference count for each link, which is different from and unrelated to the exclusive device holder identified by @holder when the device is opened. Remove the single holder assumption and automatic removal of the link and revive the per-link reference count tracking. The code essentially behaves the same as before commit e09b457b sans the unnecessary kobject reference count dancing. While at it, note that this facility should not be used by anyone else than the current ones. Sysfs symlinks shouldn't be abused like this and the whole thing doesn't belong in the block layer at all. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Milan Broz <mbroz@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-14Merge branch 'linux-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: PCI/PM: Report wakeup events before resuming devices PCI/PM: Use pm_wakeup_event() directly for reporting wakeup events PCI: sysfs: Update ROM to include default owner write access x86/PCI: make Broadcom CNB20LE driver EMBEDDED and EXPERIMENTAL x86/PCI: don't use native Broadcom CNB20LE driver when ACPI is available PCI/ACPI: Request _OSC control once for each root bridge (v3) PCI: enable pci=bfsort by default on future Dell systems PCI/PCIe: Clear Root PME Status bits early during system resume PCI: pci-stub: ignore zero-length id parameters x86/PCI: irq and pci_ids patch for Intel Patsburg PCI: Skip id checking if no id is passed PCI: fix __pci_device_probe kernel-doc warning PCI: make pci_restore_state return void PCI: Disable ASPM if BIOS asks us to PCI: Add mask bit definition for MSI-X table PCI: MSI: Move MSI-X entry definition to pci_regs.h Fix up trivial conflicts in drivers/net/{skge.c,sky2.c} that had in the meantime been converted to not use legacy PCI power management, and thus no longer use pci_restore_state() at all (and that caused trivial conflicts with the "make pci_restore_state return void" patch)
2011-01-14Merge git://git.infradead.org/battery-2.6Linus Torvalds
* git://git.infradead.org/battery-2.6: (21 commits) power_supply: Add MAX17042 Fuel Gauge Driver olpc_battery: Fix up XO-1.5 properties list olpc_battery: Add support for CURRENT_NOW and VOLTAGE_NOW olpc_battery: Add support for CHARGE_NOW olpc_battery: Add support for CHARGE_FULL_DESIGN olpc_battery: Ambient temperature is not available on XO-1.5 jz4740-battery: Should include linux/io.h s3c_adc_battery: Add gpio_inverted field to pdata power_supply: Don't use flush_scheduled_work() power_supply: Fix use after free and memory leak gpio-charger: Fix potential race between irq handler and probe/remove gpio-charger: Provide default name for the power_supply gpio-charger: Check result of kzalloc jz4740-battery: Check if platform_data is supplied isp1704_charger: Detect charger after probe isp1704_charger: Set isp->dev before anything needs it isp1704_charger: Detect HUB/Host chargers isp1704_charger: Correct length for storing model power_supply: Add gpio charger driver jz4740-battery: Protect against concurrent battery readings ...
2011-01-14Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: kernel: fix hlist_bl again cgroups: Fix a lockdep warning at cgroup removal fs: namei fix ->put_link on wrong inode in do_filp_open
2011-01-14Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (59 commits) mfd: ab8500-core chip version cut 2.0 support mfd: Flag WM831x /IRQ as a wake source mfd: Convert WM831x away from legacy I2C PM operations regulator: Support MAX8998/LP3974 DVS-GPIO mfd: Support LP3974 RTC i2c: Convert SCx200 driver from using raw PCI to platform device x86: OLPC: convert olpc-xo1 driver from pci device to platform device mfd: MAX8998/LP3974 hibernation support mfd/ab8500: remove spi support mfd: Remove ARCH_U8500 dependency from AB8500 misc: Make AB8500_PWM driver depend on U8500 due to PWM breakage mfd: Add __devexit annotation for vx855_remove mfd: twl6030 irq_data conversion. gpio: Fix cs5535 printk warnings misc: Fix cs5535 printk warnings mfd: Convert Wolfson MFD drivers to use irq_data accessor function mfd: Convert TWL4030 to new irq_ APIs mfd: Convert tps6586x driver to new irq_ API mfd: Convert tc6393xb driver to new irq_ APIs mfd: Convert t7166xb driver to new irq_ API ...
2011-01-14PCI/PM: Use pm_wakeup_event() directly for reporting wakeup eventsRafael J. Wysocki
After recent changes related to wakeup events pm_wakeup_event() automatically checks if the given device is configured to signal wakeup, so pci_wakeup_event() may be a static inline function calling pm_wakeup_event() directly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2011-01-14PCI/ACPI: Request _OSC control once for each root bridge (v3)Rafael J. Wysocki
Move the evaluation of acpi_pci_osc_control_set() (to request control of PCI Express native features) into acpi_pci_root_add() to avoid calling it many times for the same root complex with the same arguments. Additionally, check if all of the requisite _OSC support bits are set before calling acpi_pci_osc_control_set() for a given root complex. References: https://bugzilla.kernel.org/show_bug.cgi?id=20232 Reported-by: Ozan Caglayan <ozan@pardus.org.tr> Tested-by: Ozan Caglayan <ozan@pardus.org.tr> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2011-01-14[SCSI] target: Add LIO target core v4.0.0-rc6Nicholas Bellinger
LIO target is a full featured in-kernel target framework with the following feature set: High-performance, non-blocking, multithreaded architecture with SIMD support. Advanced SCSI feature set: * Persistent Reservations (PRs) * Asymmetric Logical Unit Assignment (ALUA) * Protocol and intra-nexus multiplexing, load-balancing and failover (MC/S) * Full Error Recovery (ERL=0,1,2) * Active/active task migration and session continuation (ERL=2) * Thin LUN provisioning (UNMAP and WRITE_SAMExx) Multiprotocol target plugins Storage media independence: * Virtualization of all storage media; transparent mapping of IO to LUNs * No hard limits on number of LUNs per Target; maximum LUN size ~750 TB * Backstores: SATA, SAS, SCSI, BluRay, DVD, FLASH, USB, ramdisk, etc. Standards compliance: * Full compliance with IETF (RFC 3720) * Full implementation of SPC-4 PRs and ALUA Significant code cleanups done by Christoph Hellwig. [jejb: fix up for new block bdev exclusive interface. Minor fixes from Randy Dunlap and Dan Carpenter.] Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2011-01-14include/gpio.h: remove remaining __must_check-annotiationsWolfram Sang
Commit 5f829e405ec4e96f711165a4a7b55c271d4363e2 (gpiolib: add missing functions to generic fallback) also introduced two. Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Greg KH <gregkh@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14Revert update for dirty_ratio for memcg.KAMEZAWA Hiroyuki
The flags added by commit db16d5ec1f87f17511599bc77857dd1662b5a22f has no user now. We believe we'll use it soon but considering patch reviewing, the change itself should be folded into incoming set of "dirty ratio for memcg" patches. So, it's better to drop this change from current mainline tree. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Greg Thelen <gthelen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14power_supply: Add MAX17042 Fuel Gauge DriverMyungJoo Ham
The MAX17042 is a fuel gauge with an I2C interface for lithium-ion betteries. Unlike its predecessor MAX17040, MAX17042 uses 16bit registers. Besides, MAX17042 has much more features than MAX17040; e.g., a thermistor, current and current accumulation measurement, battery internal resistance estimate, average values of measurement, and others. This patch implements a driver for MAX17042. In this initial release, we have implemented the most basic features of a fuel gauge: measure the battery capacity and voltage. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
2011-01-14kernel: fix hlist_bl againRussell King
__d_rehash is dereferencing an almost-NULL pointer on my ARM926. CONFIG_SMP=n and CONFIG_DEBUG_SPINLOCK=y. The faulting instruction is: strne r3, [r2, #4] and as can be seen from the register dump below, r2 is 0x00000001, hence the faulting 0x00000005 address. __d_rehash is essentially: spin_lock_bucket(b); entry->d_flags &= ~DCACHE_UNHASHED; hlist_bl_add_head_rcu(&entry->d_hash, &b->head); spin_unlock_bucket(b); which is: bit_spin_lock(0, (unsigned long *)&b->head.first); entry->d_flags &= ~DCACHE_UNHASHED; hlist_bl_add_head_rcu(&entry->d_hash, &b->head); __bit_spin_unlock(0, (unsigned long *)&b->head.first); bit_spin_lock(0, ptr) sets bit 0 of *ptr, in this case b->head.first if CONFIG_SMP or CONFIG_DEBUG_SPINLOCK is set: #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) while (unlikely(test_and_set_bit_lock(bitnum, addr))) { while (test_bit(bitnum, addr)) { preempt_enable(); cpu_relax(); preempt_disable(); } } #endif So, b->head.first starts off NULL, and becomes a non-NULL (address 1). hlist_bl_add_head_rcu() does this: static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n, struct hlist_bl_head *h) { first = hlist_bl_first(h); n->next = first; if (first) first->pprev = &n->next; It is the store to first->pprev which is faulting. hlist_bl_first(): static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h) { return (struct hlist_bl_node *) ((unsigned long)h->first & ~LIST_BL_LOCKMASK); } but: #if defined(CONFIG_SMP) #define LIST_BL_LOCKMASK 1UL #else #define LIST_BL_LOCKMASK 0UL #endif So, we have one piece of code which sets bit 0 of addresses, and another bit of code which doesn't clear it before dereferencing the pointer if !CONFIG_SMP && CONFIG_DEBUG_SPINLOCK. With the patch below, I can again sucessfully boot the kernel on my Versatile PB/926 platform. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-14mfd: ab8500-core chip version cut 2.0 supportMattias Wallin
This patch adds support for chip version 2.0 or cut 2.0. One new interrupt latch register - latch 12 - is introduced. Signed-off-by: Mattias Wallin <mattias.wallin@stericsson.com> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14regulator: Support MAX8998/LP3974 DVS-GPIOMyungJoo Ham
The previous driver did not support BUCK1-DVS3, BUCK1-DVS4, and BUCK2-DVS2 modes. This patch adds such modes and an option to block setting buck1/2 voltages out of the preset values. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14mfd: Support LP3974 RTCMyungJoo Ham
The first releases of LP3974 have a large delay in RTC registers, which requires 2 seconds of delay after writing to a rtc register (recommended by National Semiconductor's engineers) before reading it. If "rtc_delay" field of the platform data is true, the rtc driver assumes that such delays are required. Although we have not seen LP3974s without requiring such delays, we assume that such LP3974s will be released soon (or they have done so already) and they are supported by "lp3974" without setting "rtc_delay" at the platform data. This patch adds delays with msleep when writing values to RTC registers if the platform data has rtc_delay set. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14mfd: MAX8998/LP3974 hibernation supportMyungJoo Ham
This patch makes the driver to save and restore register values for hibernation. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14mfd: ab8500-core ioresources irq for subdrivers addedMattias Wallin
This patch adds the ioresources used by subdrivers to retrieve their interrupt. Signed-off-by: Mattias Wallin <mattias.wallin@stericsson.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14mfd: Provide pm_runtime_no_callbacks flag in cell dataMark Brown
Allow MFD cells to have pm_runtime_no_callbacks() called on them during registration. This causes the runtime PM framework to ignore them, allowing use of runtime PM to suspend the device as a whole even if not all drivers for the MFD can usefully implement runtime PM. For example, RTCs are likely to run continuously regardless of the power state of the system. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14mfd: Add WM8326 supportMark Brown
The WM8326 is a high performance variant of the WM832x series with no software visible differences. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-13etherdevice.h: Add is_unicast_ether_addr functionTobias Klauser
From a check for !is_multicast_ether_addr it is not always obvious that we're checking for a unicast address. So add this helper function to make those code paths easier to read. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-13ipsec: update MAX_AH_AUTH_LEN to support sha512Nicolas Dichtel
icv_truncbits is set to 256 for sha512, so update MAX_AH_AUTH_LEN to 64. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-13net: remove dev_txq_stats_fold()Eric Dumazet
After recent changes, (percpu stats on vlan/tunnels...), we dont need anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters. Only remaining users are ixgbe, sch_teql, gianfar & macvlan : 1) ixgbe can be converted to use existing tx_ring counters. 2) macvlan incremented txq->tx_dropped, it can use the dev->stats.tx_dropped counter. 3) sch_teql : almost revert ab35cd4b8f42 (Use net_device internal stats) Now we have ndo_get_stats64(), use it, even for "unsigned long" fields (No need to bring back a struct net_device_stats) 4) gianfar adds a stats structure per tx queue to hold tx_bytes/tx_packets This removes a lockdep warning (and possible lockup) in rndis gadget, calling dev_get_stats() from hard IRQ context. Ref: http://www.spinics.net/lists/netdev/msg149202.html Reported-by: Neil Jones <neiljay@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Sandeep Gopalpet <sandeep.kumar@freescale.com> CC: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-13Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits) ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework ACPI: fix resource check message ACPI / Battery: Update information on info notification and resume ACPI: Drop device flag wake_capable ACPI: Always check if _PRW is present before trying to evaluate it ACPI / PM: Check status of power resources under mutexes ACPI / PM: Rename acpi_power_off_device() ACPI / PM: Drop acpi_power_nocheck ACPI / PM: Drop acpi_bus_get_power() Platform / x86: Make fujitsu_laptop use acpi_bus_update_power() ACPI / Fan: Rework the handling of power resources ACPI / PM: Register power resource devices as soon as they are needed ACPI / PM: Register acpi_power_driver early ACPI / PM: Add function for updating device power state consistently ACPI / PM: Add function for device power state initialization ACPI / PM: Introduce __acpi_bus_get_power() ACPI / PM: Introduce function for refcounting device power resources ACPI / PM: Add functions for manipulating lists of power resources ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes ACPICA: Update version to 20101209 ...
2011-01-13Merge branch 'idle-release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6: cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer intel_idle: open broadcast clock event cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW cpuidle: delete NOP CPUIDLE_FLAG_POLL ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL ACPI, intel_idle: Cleanup idle= internal variables cpuidle: Make cpuidle_enable_device() call poll_idle_init() intel_idle: update Sandy Bridge core C-state residency targets
2011-01-13Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: fs: fix do_last error case when need_reval_dot nfs: add missing rcu-walk check fs: hlist UP debug fixup fs: fix dropping of rcu-walk from force_reval_path fs: force_reval_path drop rcu-walk before d_invalidate fs: small rcu-walk documentation fixes Fixed up trivial conflicts in Documentation/filesystems/porting
2011-01-13Merge branch 'stable/gntdev' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/p2m: Fix module linking error. xen p2m: clear the old pte when adding a page to m2p_override xen gntdev: use gnttab_map_refs and gnttab_unmap_refs xen: introduce gnttab_map_refs and gnttab_unmap_refs xen p2m: transparently change the p2m mappings in the m2p override xen/gntdev: Fix circular locking dependency xen/gntdev: stop using "token" argument xen: gntdev: move use of GNTMAP_contains_pte next to the map_op xen: add m2p override mechanism xen: move p2m handling to separate file xen/gntdev: add VM_PFNMAP to vma xen/gntdev: allow usermode to map granted pages xen: define gnttab_set_map_op/unmap_op Fix up trivial conflict in drivers/xen/Kconfig
2011-01-14fs: hlist UP debug fixupNick Piggin
Po-Yu Chuang <ratbert.chuang@gmail.com> noticed that hlist_bl_set_first could crash on a UP system when LIST_BL_LOCKMASK is 0, because LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK)); always evaulates to true. Fix the expression, and also avoid a dependency between bit spinlock implementation and list bl code (list code shouldn't know anything except that bit 0 is set when adding and removing elements). Eventually if a good use case comes up, we might use this list to store 1 or more arbitrary bits of data, so it really shouldn't be tied to locking either, but for now they are helpful for debugging. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-13nfsd: don't support msnfs export optionJ. Bruce Fields
We've long had these pointless #ifdef MSNFS's sprinkled throughout the code--pointless because MSNFS is always defined (and we give no config option to make that easy to change). So we could just remove the ifdef's and compile the resulting code unconditionally. But as long as we're there: why not just rip out this code entirely? The only purpose is to implement the "msnfs" export option which turns on Windows-like behavior in some cases, and: - the export option isn't documented anywhere; - the userland utilities (which would need to be able to parse "msnfs" in an export file) don't support it; - I don't know how to maintain this, as I don't know what the proper behavior is; and - google shows no evidence that anyone has ever used this. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13memcg: fix memory migration of shmem swapcacheDaisuke Nishimura
In the current implementation mem_cgroup_end_migration() decides whether the page migration has succeeded or not by checking "oldpage->mapping". But if we are tring to migrate a shmem swapcache, the page->mapping of it is NULL from the begining, so the check would be invalid. As a result, mem_cgroup_end_migration() assumes the migration has succeeded even if it's not, so "newpage" would be freed while it's not uncharged. This patch fixes it by passing mem_cgroup_end_migration() the result of the page migration. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13memcg: add lock to synchronize page accounting and migrationKAMEZAWA Hiroyuki
Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page accounting and migration code. This reworks the locking scheme of _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable. 1. If pages are being migrated from a memcg, then updates to that memcg page statistics are protected by grabbing PCG_MOVE_LOCK using move_lock_page_cgroup(). In an upcoming commit, memcg dirty page accounting will be updating memcg page accounting (specifically: num writeback pages) from IRQ context (softirq). Avoid a deadlocking nested spin lock attempt by disabling irq on the local processor when grabbing the PCG_MOVE_LOCK. 2. lock for update_page_stat is used only for avoiding race with move_account(). So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem is between mem_cgroup_update_page_stat() and mem_cgroup_move_account_page(). Trade-off: * Changing lock_page_cgroup() to always disable IRQ (or local_bh) has some impacts on performance and I think it's bad to disable IRQ when it's not necessary. * adding a new lock makes move_account() slower. Score is here. Performance Impact: moving a 8G anon process. Before: real 0m0.792s user 0m0.000s sys 0m0.780s After: real 0m0.854s user 0m0.000s sys 0m0.842s This score is bad but planned patches for optimization can reduce this impact. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13memcg: create extensible page stat update routinesGreg Thelen
Replace usage of the mem_cgroup_update_file_mapped() memcg statistic update routine with two new routines: * mem_cgroup_inc_page_stat() * mem_cgroup_dec_page_stat() As before, only the file_mapped statistic is managed. However, these more general interfaces allow for new statistics to be more easily added. New statistics are added with memcg dirty page accounting. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Andrea Righi <arighi@develer.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13memcg: add page_cgroup flags for dirty page trackingGreg Thelen
This patchset provides the ability for each cgroup to have independent dirty page limits. Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim) page cache used by a cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The patches are based on a series proposed by Andrea Righi in Mar 2010. Overview: - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs unstable. - Extend mem_cgroup to record the total number of pages in each of the interesting dirty states (dirty, writeback, unstable_nfs). - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs control files. - Consider both system and per-memcg dirty limits in page writeback when deciding to queue background writeback or block for foreground writeback. Known shortcomings: - When a cgroup dirty limit is exceeded, then bdi writeback is employed to writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not just inodes contributing dirty pages to the cgroup exceeding its limit. - When memory.use_hierarchy is set, then dirty limits are disabled. This is a implementation detail. An enhanced implementation is needed to check the chain of parents to ensure that no dirty limit is exceeded. Performance data: - A page fault microbenchmark workload was used to measure performance, which can be called in read or write mode: f = open(foo. $cpu) truncate(f, 4096) alarm(60) while (1) { p = mmap(f, 4096) if (write) *p = 1 else x = *p munmap(p) } - The workload was called for several points in the patch series in different modes: - s_read is a single threaded reader - s_write is a single threaded writer - p_read is a 16 thread reader, each operating on a different file - p_write is a 16 thread writer, each operating on a different file - Measurements were collected on a 16 core non-numa system using "perf stat --repeat 3". The -a option was used for parallel (p_*) runs. - All numbers are page fault rate (M/sec). Higher is better. - To compare the performance of a kernel without non-memcg compare the first and last rows, neither has memcg configured. The first row does not include any of these memcg patches. - To compare the performance of using memcg dirty limits, compare the baseline (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last row titled "all patches"). root_cgroup child_cgroup s_read s_write p_read p_write s_read s_write p_read p_write mmotm w/o memcg 0.428 0.390 0.429 0.388 mmotm w/ memcg 0.411 0.378 0.391 0.362 0.412 0.377 0.385 0.363 all patches 0.384 0.360 0.370 0.348 0.381 0.363 0.368 0.347 all patches 0.431 0.402 0.427 0.395 w/o memcg This patch: Add additional flags to page_cgroup to track dirty pages within a mem_cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrea Righi <arighi@develer.com> Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm: migration: use rcu_dereference_protected when dereferencing the radix ↵Mel Gorman
tree slot during file page migration migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page migraton"). The point of the RCU protection there is part of getting a stable reference to anon_vma and is only held for anon pages as file pages are locked which is sufficient protection against freeing. However, while a file page's mapping is being migrated, the radix tree is double checked to ensure it is the expected page. This uses radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held triggering the following warning. [ 173.674290] =================================================== [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ] [ 173.676016] --------------------------------------------------- [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection! [ 173.676016] [ 173.676016] other info that might help us debug this: [ 173.676016] [ 173.676016] [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0 [ 173.676016] 1 lock held by hugeadm/2899: [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab [ 173.676016] [ 173.676016] stack backtrace: [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild [ 173.676016] Call Trace: [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86 [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39 [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107 [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107 [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa This patch introduces radix_tree_deref_slot_protected() which calls rcu_dereference_protected(). Users of it must pass in the mapping->tree_lock that is protecting this dereference. Holding the tree lock protects against parallel updaters of the radix tree meaning that rcu_dereference_protected is allowable. [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Milton Miller <miltonm@bga.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> [2.6.37.early] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: add compound_trans_head() helperAndrea Arcangeli
Cleanup some code with common compound_trans_head helper. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: khugepaged: make khugepaged aware about madviseAndrea Arcangeli
MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after mmap and before touching the memory. While this is enough for most usages, it's little effort to make madvise more dynamic at runtime on an existing mapping by making khugepaged aware about madvise. MADV_HUGEPAGE: register in khugepaged immediately without waiting a page fault (that may not ever happen if all pages are already mapped and the "enabled" knob was set to madvise during the initial page faults). MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop collapsing pages where not needed. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: madvise(MADV_NOHUGEPAGE)Andrea Arcangeli
Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be hugepage backed. Return -EINVAL if the vma is not of an anonymous type, or the feature isn't built into the kernel. Never silently return success. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: mm: define MADV_NOHUGEPAGEAndrea Arcangeli
Define MADV_NOHUGEPAGE. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: compound_trans_orderAndrea Arcangeli
Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: fix anon memory statistics with transparent hugepagesRik van Riel
Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU statistics, so the Active(anon) and Inactive(anon) statistics in /proc/meminfo are correct. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: use compaction in kswapd for GFP_ATOMIC order > 0Andrea Arcangeli
This takes advantage of memory compaction to properly generate pages of order > 0 if regular page reclaim fails and priority level becomes more severe and we don't reach the proper watermarks. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: mmu_notifier_test_youngAndrea Arcangeli
For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's correct to return 0 when shadow_access_mask is zero, it requires gup-fast to set the referenced bit). qemu-kvm access already sets the young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we relay on gup-fast to signal the page is in use... We also need to check the young bits on the secondary pagetables for NPT and not nested shadow mmu as the data may never get accessed again by the primary pte. Without this closer accuracy, we'd have to remove the heuristic that avoids collapsing hugepages in hugepage virtual regions that have not even a single subpage in use. ->test_young is full backwards compatible with GRU and other usages that don't have young bits in pagetables set by the hardware and that should nuke the secondary mmu mappings when ->clear_flush_young runs just like EPT does. Removing the heuristic that checks the young bit in khugepaged/collapse_huge_page completely isn't so bad either probably but I thought it was worth it and this makes it reliable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: avoid breaking huge pmd invariants in case of vma_adjust failuresAndrea Arcangeli
An huge pmd can only be mapped if the corresponding 2M virtual range is fully contained in the vma. At times the VM calls split_vma twice, if the first split_vma succeeds and the second fail, the first split_vma remains in effect and it's not rolled back. For split_vma or vma_adjust to fail an allocation failure is needed so it's a very unlikely event (the out of memory killer would normally fire before any allocation failure is visible to kernel and userland and if an out of memory condition happens it's unlikely to happen exactly here). Nevertheless it's safer to ensure that no huge pmd can be left around if the vma is adjusted in a way that can't fit hugepages anymore at the new vm_start/vm_end address. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: add numa awareness to hugepage allocationsAndrea Arcangeli
It's mostly a matter of replacing alloc_pages with alloc_pages_vma after introducing alloc_pages_vma. khugepaged needs special handling as the allocation has to happen inside collapse_huge_page where the vma is known and an error has to be returned to the outer loop to sleep alloc_sleep_millisecs in case of failure. But it retains the more efficient logic of handling allocation failures in khugepaged in case of CONFIG_NUMA=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: mprotect: transparent huge page supportJohannes Weiner
Natively handle huge pmds when changing page tables on behalf of mprotect(). I left out update_mmu_cache() because we do not need it on x86 anyway but more importantly the interface works on ptes, not pmds. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>