summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/ext3.txt181
-rw-r--r--Documentation/filesystems/fuse.txt63
-rw-r--r--Documentation/filesystems/proc.txt19
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.txt72
-rw-r--r--Documentation/filesystems/relayfs.txt126
-rw-r--r--Documentation/filesystems/spufs.txt521
-rw-r--r--Documentation/filesystems/sysfs-pci.txt21
-rw-r--r--Documentation/filesystems/tmpfs.txt12
8 files changed, 897 insertions, 118 deletions
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9840d5b8d5b..afb1335c05d 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -2,11 +2,11 @@
Ext3 Filesystem
===============
-ext3 was originally released in September 1999. Written by Stephen Tweedie
-for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
+Ext3 was originally released in September 1999. Written by Stephen Tweedie
+for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
-ext3 is ext2 filesystem enhanced with journalling capabilities.
+Ext3 is the ext2 filesystem enhanced with journalling capabilities.
Options
=======
@@ -14,76 +14,81 @@ Options
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
-jounal=update Update the ext3 file system's journal to the
- current format.
+journal=update Update the ext3 file system's journal to the current
+ format.
-journal=inum When a journal already exists, this option is
- ignored. Otherwise, it specifies the number of
- the inode which will represent the ext3 file
- system's journal file.
+journal=inum When a journal already exists, this option is ignored.
+ Otherwise, it specifies the number of the inode which
+ will represent the ext3 file system's journal file.
+
+journal_dev=devnum When the external journal device's major/minor numbers
+ have changed, this option allows the user to specify
+ the new journal location. The journal device is
+ identified through its new major/minor numbers encoded
+ in devnum.
noload Don't load the journal on mounting.
-data=journal All data are committed into the journal prior
- to being written into the main file system.
+data=journal All data are committed into the journal prior to being
+ written into the main file system.
data=ordered (*) All data are forced directly out to the main file
- system prior to its metadata being committed to
- the journal.
+ system prior to its metadata being committed to the
+ journal.
-data=writeback Data ordering is not preserved, data may be
- written into the main file system after its
- metadata has been committed to the journal.
+data=writeback Data ordering is not preserved, data may be written
+ into the main file system after its metadata has been
+ committed to the journal.
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
- This means that if you lose your power, you will lose,
- as much, the latest 5 seconds of work (your filesystem
- will not be damaged though, thanks to journaling). This
- default value (or any low value) will hurt performance,
- but it's good for data-safety. Setting it to 0 will
- have the same effect than leaving the default 5 sec.
+ This means that if you lose your power, you will lose
+ as much as the latest 5 seconds of work (your
+ filesystem will not be damaged though, thanks to the
+ journaling). This default value (or any low value)
+ will hurt performance, but it's good for data-safety.
+ Setting it to 0 will have the same effect as leaving
+ it at the default (5 seconds).
Setting it to very large values will improve
performance.
-barrier=1 This enables/disables barriers. barrier=0 disables it,
- barrier=1 enables it.
+barrier=1 This enables/disables barriers. barrier=0 disables
+ it, barrier=1 enables it.
-orlov (*) This enables the new Orlov block allocator. It's enabled
- by default.
+orlov (*) This enables the new Orlov block allocator. It is
+ enabled by default.
-oldalloc This disables the Orlov block allocator and enables the
- old block allocator. Orlov should have better performance,
- we'd like to get some feedback if it's the contrary for
- you.
+oldalloc This disables the Orlov block allocator and enables
+ the old block allocator. Orlov should have better
+ performance - we'd like to get some feedback if it's
+ the contrary for you.
-user_xattr Enables Extended User Attributes. Additionally, you need
- to have extended attribute support enabled in the kernel
- configuration (CONFIG_EXT3_FS_XATTR). See the attr(5)
- manual page and http://acl.bestbits.at to learn more
- about extended attributes.
+user_xattr Enables Extended User Attributes. Additionally, you
+ need to have extended attribute support enabled in the
+ kernel configuration (CONFIG_EXT3_FS_XATTR). See the
+ attr(5) manual page and http://acl.bestbits.at/ to
+ learn more about extended attributes.
nouser_xattr Disables Extended User Attributes.
-acl Enables POSIX Access Control Lists support. Additionally,
- you need to have ACL support enabled in the kernel
- configuration (CONFIG_EXT3_FS_POSIX_ACL). See the acl(5)
- manual page and http://acl.bestbits.at for more
- information.
+acl Enables POSIX Access Control Lists support.
+ Additionally, you need to have ACL support enabled in
+ the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
+ See the acl(5) manual page and http://acl.bestbits.at/
+ for more information.
-noacl This option disables POSIX Access Control List support.
+noacl This option disables POSIX Access Control List
+ support.
reservation
noreservation
-resize=
-
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
check=none Don't do extra checking of bitmaps on mount.
-nocheck
+nocheck
debug Extra debugging information is sent to syslog.
@@ -92,7 +97,7 @@ errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
grpid Give objects the same group ID as their creator.
-bsdgroups
+bsdgroups
nogrpid (*) New objects have the group ID of their creator.
sysvgroups
@@ -103,81 +108,83 @@ resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
-quota Quota options are currently silently ignored.
-noquota (see fs/ext3/super.c, line 594)
+quota
+noquota
grpquota
usrquota
Specification
=============
-ext3 shares all disk implementation with ext2 filesystem, and add
-transactions capabilities to ext2. Journaling is done by the
-Journaling block device layer.
+Ext3 shares all disk implementation with the ext2 filesystem, and adds
+transactions capabilities to ext2. Journaling is done by the Journaling Block
+Device layer.
Journaling Block Device layer
-----------------------------
-The Journaling Block Device layer (JBD) isn't ext3 specific. It was
-design to add journaling capabilities on a block device. The ext3
-filesystem code will inform the JBD of modifications it is performing
-(Call a transaction). the journal support the transactions start and
-stop, and in case of crash, the journal can replayed the transactions
-to put the partition on a consistent state fastly.
+The Journaling Block Device layer (JBD) isn't ext3 specific. It was design to
+add journaling capabilities on a block device. The ext3 filesystem code will
+inform the JBD of modifications it is performing (called a transaction). The
+journal supports the transactions start and stop, and in case of crash, the
+journal can replayed the transactions to put the partition back in a
+consistent state fast.
-handles represent a single atomic update to a filesystem. JBD can
-handle external journal on a block device.
+Handles represent a single atomic update to a filesystem. JBD can handle an
+external journal on a block device.
Data Mode
---------
-There's 3 different data modes:
+There are 3 different data modes:
* writeback mode
-In data=writeback mode, ext3 does not journal data at all. This mode
-provides a similar level of journaling as XFS, JFS, and ReiserFS in its
-default mode - metadata journaling. A crash+recovery can cause
-incorrect data to appear in files which were written shortly before the
-crash. This mode will typically provide the best ext3 performance.
+In data=writeback mode, ext3 does not journal data at all. This mode provides
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+mode - metadata journaling. A crash+recovery can cause incorrect data to
+appear in files which were written shortly before the crash. This mode will
+typically provide the best ext3 performance.
* ordered mode
-In data=ordered mode, ext3 only officially journals metadata, but it
-logically groups metadata and data blocks into a single unit called a
-transaction. When it's time to write the new metadata out to disk, the
-associated data blocks are written first. In general, this mode
-perform slightly slower than writeback but significantly faster than
-journal mode.
+In data=ordered mode, ext3 only officially journals metadata, but it logically
+groups metadata and data blocks into a single unit called a transaction. When
+it's time to write the new metadata out to disk, the associated data blocks
+are written first. In general, this mode performs slightly slower than
+writeback but significantly faster than journal mode.
* journal mode
-data=journal mode provides full data and metadata journaling. All new
-data is written to the journal first, and then to its final location.
-In the event of a crash, the journal can be replayed, bringing both
-data and metadata into a consistent state. This mode is the slowest
-except when data needs to be read from and written to disk at the same
-time where it outperform all others mode.
+data=journal mode provides full data and metadata journaling. All new data is
+written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both data and
+metadata into a consistent state. This mode is the slowest except when data
+needs to be read from and written to disk at the same time where it
+outperforms all others modes.
Compatibility
-------------
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
-Ext3 is fully compatible with Ext2. Ext3 partitions can easily be
-mounted as Ext2.
+Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as
+Ext2.
+
External Tools
==============
-see manual pages to know more.
+See manual pages to learn more.
+
+tune2fs: create a ext3 journal on a ext2 partition with the -j flag.
+mke2fs: create a ext3 partition with the -j flag.
+debugfs: ext2 and ext3 file system debugger.
+ext2online: online (mounted) ext2 and ext3 filesystem resizer
-tune2fs: create a ext3 journal on a ext2 partition with the -j flags
-mke2fs: create a ext3 partition with the -j flags
-debugfs: ext2 and ext3 file system debugger
References
==========
-kernel source: file:/usr/src/linux/fs/ext3
- file:/usr/src/linux/fs/jbd
+kernel source: <file:fs/ext3/>
+ <file:fs/jbd/>
-programs: http://e2fsprogs.sourceforge.net
+programs: http://e2fsprogs.sourceforge.net/
+ http://ext2resize.sourceforge.net
-useful link:
- http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
+useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
http://www-106.ibm.com/developerworks/linux/library/l-fs8/
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
index 6b5741e651a..33f74310d16 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.txt
@@ -86,6 +86,62 @@ Mount options
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
+Sysfs
+~~~~~
+
+FUSE sets up the following hierarchy in sysfs:
+
+ /sys/fs/fuse/connections/N/
+
+where N is an increasing number allocated to each new connection.
+
+For each connection the following attributes are defined:
+
+ 'waiting'
+
+ The number of requests which are waiting to be transfered to
+ userspace or being processed by the filesystem daemon. If there is
+ no filesystem activity and 'waiting' is non-zero, then the
+ filesystem is hung or deadlocked.
+
+ 'abort'
+
+ Writing anything into this file will abort the filesystem
+ connection. This means that all waiting requests will be aborted an
+ error returned for all aborted and new requests.
+
+Only a privileged user may read or write these attributes.
+
+Aborting a filesystem connection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to get into certain situations where the filesystem is
+not responding. Reasons for this may be:
+
+ a) Broken userspace filesystem implementation
+
+ b) Network connection down
+
+ c) Accidental deadlock
+
+ d) Malicious deadlock
+
+(For more on c) and d) see later sections)
+
+In either of these cases it may be useful to abort the connection to
+the filesystem. There are several ways to do this:
+
+ - Kill the filesystem daemon. Works in case of a) and b)
+
+ - Kill the filesystem daemon and all users of the filesystem. Works
+ in all cases except some malicious deadlocks
+
+ - Use forced umount (umount -f). Works in all cases but only if
+ filesystem is still attached (it hasn't been lazy unmounted)
+
+ - Abort filesystem through the sysfs interface. Most powerful
+ method, always works.
+
How do non-privileged mounts work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -313,3 +369,10 @@ faulted with get_user_pages(). The 'req->locked' flag indicates
when the copy is taking place, and interruption is delayed until
this flag is unset.
+Scenario 3 - Tricky deadlock with asynchronous read
+---------------------------------------------------
+
+The same situation as above, except thread-1 will wait on page lock
+and hence it will be uninterruptible as well. The solution is to
+abort the connection with forced umount (if mount is attached) or
+through the abort attribute in sysfs.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d4773565ea2..944cf109a6f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -418,7 +418,7 @@ VmallocChunk: 111088 kB
Dirty: Memory which is waiting to get written back to the disk
Writeback: Memory which is actively being written back to the disk
Mapped: files which have been mmaped, such as libraries
- Slab: in-kernel data structures cache
+ Slab: in-kernel data structures cache
CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
this is the total amount of memory currently available to
be allocated on the system. This limit is only adhered to
@@ -1302,6 +1302,23 @@ VM has token based thrashing control mechanism and uses the token to prevent
unnecessary page faults in thrashing situation. The unit of the value is
second. The value would be useful to tune thrashing behavior.
+drop_caches
+-----------
+
+Writing to this will cause the kernel to drop clean caches, dentries and
+inodes from memory, causing that memory to become free.
+
+To free pagecache:
+ echo 1 > /proc/sys/vm/drop_caches
+To free dentries and inodes:
+ echo 2 > /proc/sys/vm/drop_caches
+To free pagecache, dentries and inodes:
+ echo 3 > /proc/sys/vm/drop_caches
+
+As this is a non-destructive operation and dirty objects are not freeable, the
+user should run `sync' first.
+
+
2.5 /proc/sys/dev - Device specific parameters
----------------------------------------------
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
index b3404a03259..60ab61e54e8 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -143,12 +143,26 @@ as the following example:
dir /mnt 755 0 0
file /init initramfs/init.sh 755 0 0
+Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
+documenting the above file format.
+
One advantage of the text file is that root access is not required to
set permissions or create device nodes in the new archive. (Note that those
two example "file" entries expect to find files named "init.sh" and "busybox" in
a directory called "initramfs", under the linux-2.6.* directory. See
Documentation/early-userspace/README for more details.)
+The kernel does not depend on external cpio tools, gen_init_cpio is created
+from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
+boot-time extractor is also (obviously) self-contained. However, if you _do_
+happen to have cpio installed, the following command line can extract the
+generated cpio image back into its component files:
+
+ cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
+
+Contents of initramfs:
+----------------------
+
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
@@ -161,13 +175,69 @@ designed to be a tiny C library to statically link early userspace
code against, along with some related utilities. It is BSD licensed.
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
-myself. These are LGPL and GPL, respectively.
+myself. These are LGPL and GPL, respectively. (A self-contained initramfs
+package is planned for the busybox 1.2 release.)
In theory you could use glibc, but that's not well suited for small embedded
uses like this. (A "hello world" program statically linked against glibc is
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
name lookups, even when otherwise statically linked.)
+Why cpio rather than tar?
+-------------------------
+
+This decision was made back in December, 2001. The discussion started here:
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
+
+And spawned a second thread (specifically on tar vs cpio), starting here:
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
+
+The quick and dirty summary version (which is no substitute for reading
+the above threads) is:
+
+1) cpio is a standard. It's decades old (from the AT&T days), and already
+ widely used on Linux (inside RPM, Red Hat's device driver disks). Here's
+ a Linux Journal article about it from 1996:
+
+ http://www.linuxjournal.com/article/1213
+
+ It's not as popular as tar because the traditional cpio command line tools
+ require _truly_hideous_ command line arguments. But that says nothing
+ either way about the archive format, and there are alternative tools,
+ such as:
+
+ http://freshmeat.net/projects/afio/
+
+2) The cpio archive format chosen by the kernel is simpler and cleaner (and
+ thus easier to create and parse) than any of the (literally dozens of)
+ various tar archive formats. The complete initramfs archive format is
+ explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+ extracted in init/initramfs.c. All three together come to less than 26k
+ total of human-readable text.
+
+3) The GNU project standardizing on tar is approximately as relevant as
+ Windows standardizing on zip. Linux is not part of either, and is free
+ to make its own technical decisions.
+
+4) Since this is a kernel internal format, it could easily have been
+ something brand new. The kernel provides its own tools to create and
+ extract this format anyway. Using an existing standard was preferable,
+ but not essential.
+
+5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
+ supported on the kernel side"):
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
+
+ explained his reasoning:
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
+
+ and, most importantly, designed and implemented the initramfs code.
+
Future directions:
------------------
diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt
index d803abed29f..5832377b734 100644
--- a/Documentation/filesystems/relayfs.txt
+++ b/Documentation/filesystems/relayfs.txt
@@ -44,30 +44,41 @@ relayfs can operate in a mode where it will overwrite data not yet
collected by userspace, and not wait for it to consume it.
relayfs itself does not provide for communication of such data between
-userspace and kernel, allowing the kernel side to remain simple and not
-impose a single interface on userspace. It does provide a separate
-helper though, described below.
+userspace and kernel, allowing the kernel side to remain simple and
+not impose a single interface on userspace. It does provide a set of
+examples and a separate helper though, described below.
+
+klog and relay-apps example code
+================================
+
+relayfs itself is ready to use, but to make things easier, a couple
+simple utility functions and a set of examples are provided.
+
+The relay-apps example tarball, available on the relayfs sourceforge
+site, contains a set of self-contained examples, each consisting of a
+pair of .c files containing boilerplate code for each of the user and
+kernel sides of a relayfs application; combined these two sets of
+boilerplate code provide glue to easily stream data to disk, without
+having to bother with mundane housekeeping chores.
+
+The 'klog debugging functions' patch (klog.patch in the relay-apps
+tarball) provides a couple of high-level logging functions to the
+kernel which allow writing formatted text or raw data to a channel,
+regardless of whether a channel to write into exists or not, or
+whether relayfs is compiled into the kernel or is configured as a
+module. These functions allow you to put unconditional 'trace'
+statements anywhere in the kernel or kernel modules; only when there
+is a 'klog handler' registered will data actually be logged (see the
+klog and kleak examples for details).
+
+It is of course possible to use relayfs from scratch i.e. without
+using any of the relay-apps example code or klog, but you'll have to
+implement communication between userspace and kernel, allowing both to
+convey the state of buffers (full, empty, amount of padding).
+
+klog and the relay-apps examples can be found in the relay-apps
+tarball on http://relayfs.sourceforge.net
-klog, relay-app & librelay
-==========================
-
-relayfs itself is ready to use, but to make things easier, two
-additional systems are provided. klog is a simple wrapper to make
-writing formatted text or raw data to a channel simpler, regardless of
-whether a channel to write into exists or not, or whether relayfs is
-compiled into the kernel or is configured as a module. relay-app is
-the kernel counterpart of userspace librelay.c, combined these two
-files provide glue to easily stream data to disk, without having to
-bother with housekeeping. klog and relay-app can be used together,
-with klog providing high-level logging functions to the kernel and
-relay-app taking care of kernel-user control and disk-logging chores.
-
-It is possible to use relayfs without relay-app & librelay, but you'll
-have to implement communication between userspace and kernel, allowing
-both to convey the state of buffers (full, empty, amount of padding).
-
-klog, relay-app and librelay can be found in the relay-apps tarball on
-http://relayfs.sourceforge.net
The relayfs user space API
==========================
@@ -125,6 +136,8 @@ Here's a summary of the API relayfs provides to in-kernel clients:
relay_reset(chan)
relayfs_create_dir(name, parent)
relayfs_remove_dir(dentry)
+ relayfs_create_file(name, parent, mode, fops, data)
+ relayfs_remove_file(dentry)
channel management typically called on instigation of userspace:
@@ -141,6 +154,8 @@ Here's a summary of the API relayfs provides to in-kernel clients:
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
buf_unmapped(buf, filp)
+ create_buf_file(filename, parent, mode, buf, is_global)
+ remove_buf_file(dentry)
helper functions:
@@ -320,6 +335,71 @@ forces a sub-buffer switch on all the channel buffers, and can be used
to finalize and process the last sub-buffers before the channel is
closed.
+Creating non-relay files
+------------------------
+
+relay_open() automatically creates files in the relayfs filesystem to
+represent the per-cpu kernel buffers; it's often useful for
+applications to be able to create their own files alongside the relay
+files in the relayfs filesystem as well e.g. 'control' files much like
+those created in /proc or debugfs for similar purposes, used to
+communicate control information between the kernel and user sides of a
+relayfs application. For this purpose the relayfs_create_file() and
+relayfs_remove_file() API functions exist. For relayfs_create_file(),
+the caller passes in a set of user-defined file operations to be used
+for the file and an optional void * to a user-specified data item,
+which will be accessible via inode->u.generic_ip (see the relay-apps
+tarball for examples). The file_operations are a required parameter
+to relayfs_create_file() and thus the semantics of these files are
+completely defined by the caller.
+
+See the relay-apps tarball at http://relayfs.sourceforge.net for
+examples of how these non-relay files are meant to be used.
+
+Creating relay files in other filesystems
+-----------------------------------------
+
+By default of course, relay_open() creates relay files in the relayfs
+filesystem. Because relay_file_operations is exported, however, it's
+also possible to create and use relay files in other pseudo-filesytems
+such as debugfs.
+
+For this purpose, two callback functions are provided,
+create_buf_file() and remove_buf_file(). create_buf_file() is called
+once for each per-cpu buffer from relay_open() to allow the client to
+create a file to be used to represent the corresponding buffer; if
+this callback is not defined, the default implementation will create
+and return a file in the relayfs filesystem to represent the buffer.
+The callback should return the dentry of the file created to represent
+the relay buffer. Note that the parent directory passed to
+relay_open() (and passed along to the callback), if specified, must
+exist in the same filesystem the new relay file is created in. If
+create_buf_file() is defined, remove_buf_file() must also be defined;
+it's responsible for deleting the file(s) created in create_buf_file()
+and is called during relay_close().
+
+The create_buf_file() implementation can also be defined in such a way
+as to allow the creation of a single 'global' buffer instead of the
+default per-cpu set. This can be useful for applications interested
+mainly in seeing the relative ordering of system-wide events without
+the need to bother with saving explicit timestamps for the purpose of
+merging/sorting per-cpu files in a postprocessing step.
+
+To have relay_open() create a global buffer, the create_buf_file()
+implementation should set the value of the is_global outparam to a
+non-zero value in addition to creating the file that will be used to
+represent the single buffer. In the case of a global buffer,
+create_buf_file() and remove_buf_file() will be called only once. The
+normal channel-writing functions e.g. relay_write() can still be used
+- writes from any cpu will transparently end up in the global buffer -
+but since it is a global buffer, callers should make sure they use the
+proper locking for such a buffer, either by wrapping writes in a
+spinlock, or by copying a write function from relayfs_fs.h and
+creating a local version that internally does the proper locking.
+
+See the 'exported-relayfile' examples in the relay-apps tarball for
+examples of creating and using relay files in debugfs.
+
Misc
----
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt
new file mode 100644
index 00000000000..8edc3952eff
--- /dev/null
+++ b/Documentation/filesystems/spufs.txt
@@ -0,0 +1,521 @@
+SPUFS(2) Linux Programmer's Manual SPUFS(2)
+
+
+
+NAME
+ spufs - the SPU file system
+
+
+DESCRIPTION
+ The SPU file system is used on PowerPC machines that implement the Cell
+ Broadband Engine Architecture in order to access Synergistic Processor
+ Units (SPUs).
+
+ The file system provides a name space similar to posix shared memory or
+ message queues. Users that have write permissions on the file system
+ can use spu_create(2) to establish SPU contexts in the spufs root.
+
+ Every SPU context is represented by a directory containing a predefined
+ set of files. These files can be used for manipulating the state of the
+ logical SPU. Users can change permissions on those files, but not actu-
+ ally add or remove files.
+
+
+MOUNT OPTIONS
+ uid=<uid>
+ set the user owning the mount point, the default is 0 (root).
+
+ gid=<gid>
+ set the group owning the mount point, the default is 0 (root).
+
+
+FILES
+ The files in spufs mostly follow the standard behavior for regular sys-
+ tem calls like read(2) or write(2), but often support only a subset of
+ the operations supported on regular file systems. This list details the
+ supported operations and the deviations from the behaviour in the
+ respective man pages.
+
+ All files that support the read(2) operation also support readv(2) and
+ all files that support the write(2) operation also support writev(2).
+ All files support the access(2) and stat(2) family of operations, but
+ only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
+ contain reliable information.
+
+ All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
+ tions, but will not be able to grant permissions that contradict the
+ possible operations, e.g. read access on the wbox file.
+
+ The current set of files is:
+
+
+ /mem
+ the contents of the local storage memory of the SPU. This can be
+ accessed like a regular shared memory file and contains both code and
+ data in the address space of the SPU. The possible operations on an
+ open mem file are:
+
+ read(2), pread(2), write(2), pwrite(2), lseek(2)
+ These operate as documented, with the exception that seek(2),
+ write(2) and pwrite(2) are not supported beyond the end of the
+ file. The file size is the size of the local storage of the SPU,
+ which normally is 256 kilobytes.
+
+ mmap(2)
+ Mapping mem into the process address space gives access to the
+ SPU local storage within the process address space. Only
+ MAP_SHARED mappings are allowed.
+
+
+ /mbox
+ The first SPU to CPU communication mailbox. This file is read-only and
+ can be read in units of 32 bits. The file can only be used in non-
+ blocking mode and it even poll() will not block on it. The possible
+ operations on an open mbox file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. If there is no data available in the mail
+ box, the return value is set to -1 and errno becomes EAGAIN.
+ When data has been read successfully, four bytes are placed in
+ the data buffer and the value four is returned.
+
+
+ /ibox
+ The second SPU to CPU communication mailbox. This file is similar to
+ the first mailbox file, but can be read in blocking I/O mode, and the
+ poll familiy of system calls can be used to wait for it. The possible
+ operations on an open ibox file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. If there is no data available in the mail
+ box and the file descriptor has been opened with O_NONBLOCK, the
+ return value is set to -1 and errno becomes EAGAIN.
+
+ If there is no data available in the mail box and the file
+ descriptor has been opened without O_NONBLOCK, the call will
+ block until the SPU writes to its interrupt mailbox channel.
+ When data has been read successfully, four bytes are placed in
+ the data buffer and the value four is returned.
+
+ poll(2)
+ Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
+ data is available for reading.
+
+
+ /wbox
+ The CPU to SPU communation mailbox. It is write-only can can be written
+ in units of 32 bits. If the mailbox is full, write() will block and
+ poll can be used to wait for it becoming empty again. The possible
+ operations on an open wbox file are: write(2) If a count smaller than
+ four is requested, write returns -1 and sets errno to EINVAL. If there
+ is no space available in the mail box and the file descriptor has been
+ opened with O_NONBLOCK, the return value is set to -1 and errno becomes
+ EAGAIN.
+
+ If there is no space available in the mail box and the file descriptor
+ has been opened without O_NONBLOCK, the call will block until the SPU
+ reads from its PPE mailbox channel. When data has been read success-
+ fully, four bytes are placed in the data buffer and the value four is
+ returned.
+
+ poll(2)
+ Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
+ space is available for writing.
+
+
+ /mbox_stat
+ /ibox_stat
+ /wbox_stat
+ Read-only files that contain the length of the current queue, i.e. how
+ many words can be read from mbox or ibox or how many words can be
+ written to wbox without blocking. The files can be read only in 4-byte
+ units and return a big-endian binary integer number. The possible
+ operations on an open *box_stat file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the number of elements that can be
+ read from (for mbox_stat and ibox_stat) or written to (for
+ wbox_stat) the respective mail box without blocking or resulting
+ in EAGAIN.
+
+
+ /npc
+ /decr
+ /decr_status
+ /spu_tag_mask
+ /event_mask
+ /srr0
+ Internal registers of the SPU. The representation is an ASCII string
+ with the numeric value of the next instruction to be executed. These
+ can be used in read/write mode for debugging, but normal operation of
+ programs should not rely on them because access to any of them except
+ npc requires an SPU context save and is therefore very inefficient.
+
+ The contents of these files are:
+
+ npc Next Program Counter
+
+ decr SPU Decrementer
+
+ decr_status Decrementer Status
+
+ spu_tag_mask MFC tag mask for SPU DMA
+
+ event_mask Event mask for SPU interrupts
+
+ srr0 Interrupt Return address register
+
+
+ The possible operations on an open npc, decr, decr_status,
+ spu_tag_mask, event_mask or srr0 file are:
+
+ read(2)
+ When the count supplied to the read call is shorter than the
+ required length for the pointer value plus a newline character,
+ subsequent reads from the same file descriptor will result in
+ completing the string, regardless of changes to the register by
+ a running SPU task. When a complete string has been read, all
+ subsequent read operations will return zero bytes and a new file
+ descriptor needs to be opened to read the value again.
+
+ write(2)
+ A write operation on the file results in setting the register to
+ the value given in the string. The string is parsed from the
+ beginning to the first non-numeric character or the end of the
+ buffer. Subsequent writes to the same file descriptor overwrite
+ the previous setting.
+
+
+ /fpcr
+ This file gives access to the Floating Point Status and Control Regis-
+ ter as a four byte long file. The operations on the fpcr file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the current value of the fpcr regis-
+ ter.
+
+ write(2)
+ If a count smaller than four is requested, write returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is copied
+ from the data buffer, updating the value of the fpcr register.
+
+
+ /signal1
+ /signal2
+ The two signal notification channels of an SPU. These are read-write
+ files that operate on a 32 bit word. Writing to one of these files
+ triggers an interrupt on the SPU. The value writting to the signal
+ files can be read from the SPU through a channel read or from host user
+ space through the file. After the value has been read by the SPU, it
+ is reset to zero. The possible operations on an open signal1 or sig-
+ nal2 file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the current value of the specified
+ signal notification register.
+
+ write(2)
+ If a count smaller than four is requested, write returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is copied
+ from the data buffer, updating the value of the specified signal
+ notification register. The signal notification register will
+ either be replaced with the input data or will be updated to the
+ bitwise OR or the old value and the input data, depending on the
+ contents of the signal1_type, or signal2_type respectively,
+ file.
+
+
+ /signal1_type
+ /signal2_type
+ These two files change the behavior of the signal1 and signal2 notifi-
+ cation files. The contain a numerical ASCII string which is read as
+ either "1" or "0". In mode 0 (overwrite), the hardware replaces the
+ contents of the signal channel with the data that is written to it. in
+ mode 1 (logical OR), the hardware accumulates the bits that are subse-
+ quently written to it. The possible operations on an open signal1_type
+ or signal2_type file are:
+
+ read(2)
+ When the count supplied to the read call is shorter than the
+ required length for the digit plus a newline character, subse-
+ quent reads from the same file descriptor will result in com-
+ pleting the string. When a complete string has been read, all
+ subsequent read operations will return zero bytes and a new file
+ descriptor needs to be opened to read the value again.
+
+ write(2)
+ A write operation on the file results in setting the register to
+ the value given in the string. The string is parsed from the
+ beginning to the first non-numeric character or the end of the
+ buffer. Subsequent writes to the same file descriptor overwrite
+ the previous setting.
+
+
+EXAMPLES
+ /etc/fstab entry
+ none /spu spufs gid=spu 0 0
+
+
+AUTHORS
+ Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
+ Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
+
+SEE ALSO
+ capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
+
+
+
+Linux 2005-09-28 SPUFS(2)
+
+------------------------------------------------------------------------------
+
+SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
+
+
+
+NAME
+ spu_run - execute an spu context
+
+
+SYNOPSIS
+ #include <sys/spu.h>
+
+ int spu_run(int fd, unsigned int *npc, unsigned int *event);
+
+DESCRIPTION
+ The spu_run system call is used on PowerPC machines that implement the
+ Cell Broadband Engine Architecture in order to access Synergistic Pro-
+ cessor Units (SPUs). It uses the fd that was returned from spu_cre-
+ ate(2) to address a specific SPU context. When the context gets sched-
+ uled to a physical SPU, it starts execution at the instruction pointer
+ passed in npc.
+
+ Execution of SPU code happens synchronously, meaning that spu_run does
+ not return while the SPU is still running. If there is a need to exe-
+ cute SPU code in parallel with other code on either the main CPU or
+ other SPUs, you need to create a new thread of execution first, e.g.
+ using the pthread_create(3) call.
+
+ When spu_run returns, the current value of the SPU instruction pointer
+ is written back to npc, so you can call spu_run again without updating
+ the pointers.
+
+ event can be a NULL pointer or point to an extended status code that
+ gets filled when spu_run returns. It can be one of the following con-
+ stants:
+
+ SPE_EVENT_DMA_ALIGNMENT
+ A DMA alignment error
+
+ SPE_EVENT_SPE_DATA_SEGMENT
+ A DMA segmentation error
+
+ SPE_EVENT_SPE_DATA_STORAGE
+ A DMA storage error
+
+ If NULL is passed as the event argument, these errors will result in a
+ signal delivered to the calling process.
+
+RETURN VALUE
+ spu_run returns the value of the spu_status register or -1 to indicate
+ an error and set errno to one of the error codes listed below. The
+ spu_status register value contains a bit mask of status codes and
+ optionally a 14 bit code returned from the stop-and-signal instruction
+ on the SPU. The bit masks for the status codes are:
+
+ 0x02 SPU was stopped by stop-and-signal.
+
+ 0x04 SPU was stopped by halt.
+
+ 0x08 SPU is waiting for a channel.
+
+ 0x10 SPU is in single-step mode.
+
+ 0x20 SPU has tried to execute an invalid instruction.
+
+ 0x40 SPU has tried to access an invalid channel.
+
+ 0x3fff0000
+ The bits masked with this value contain the code returned from
+ stop-and-signal.
+
+ There are always one or more of the lower eight bits set or an error
+ code is returned from spu_run.
+
+ERRORS
+ EAGAIN or EWOULDBLOCK
+ fd is in non-blocking mode and spu_run would block.
+
+ EBADF fd is not a valid file descriptor.
+
+ EFAULT npc is not a valid pointer or status is neither NULL nor a valid
+ pointer.
+
+ EINTR A signal occured while spu_run was in progress. The npc value
+ has been updated to the new program counter value if necessary.
+
+ EINVAL fd is not a file descriptor returned from spu_create(2).
+
+ ENOMEM Insufficient memory was available to handle a page fault result-
+ ing from an MFC direct memory access.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+
+NOTES
+ spu_run is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+CONFORMING TO
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+BUGS
+ The code does not yet fully implement all features lined out here.
+
+
+AUTHOR
+ Arnd Bergmann <arndb@de.ibm.com>
+
+SEE ALSO
+ capabilities(7), close(2), spu_create(2), spufs(7)
+
+
+
+Linux 2005-09-28 SPU_RUN(2)
+
+------------------------------------------------------------------------------
+
+SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
+
+
+
+NAME
+ spu_create - create a new spu context
+
+
+SYNOPSIS
+ #include <sys/types.h>
+ #include <sys/spu.h>
+
+ int spu_create(const char *pathname, int flags, mode_t mode);
+
+DESCRIPTION
+ The spu_create system call is used on PowerPC machines that implement
+ the Cell Broadband Engine Architecture in order to access Synergistic
+ Processor Units (SPUs). It creates a new logical context for an SPU in
+ pathname and returns a handle to associated with it. pathname must
+ point to a non-existing directory in the mount point of the SPU file
+ system (spufs). When spu_create is successful, a directory gets cre-
+ ated on pathname and it is populated with files.
+
+ The returned file handle can only be passed to spu_run(2) or closed,
+ other operations are not defined on it. When it is closed, all associ-
+ ated directory entries in spufs are removed. When the last file handle
+ pointing either inside of the context directory or to this file
+ descriptor is closed, the logical SPU context is destroyed.
+
+ The parameter flags can be zero or any bitwise or'd combination of the
+ following constants:
+
+ SPU_RAWIO
+ Allow mapping of some of the hardware registers of the SPU into
+ user space. This flag requires the CAP_SYS_RAWIO capability, see
+ capabilities(7).
+
+ The mode parameter specifies the permissions used for creating the new
+ directory in spufs. mode is modified with the user's umask(2) value
+ and then used for both the directory and the files contained in it. The
+ file permissions mask out some more bits of mode because they typically
+ support only read or write access. See stat(2) for a full list of the
+ possible mode values.
+
+
+RETURN VALUE
+ spu_create returns a new file descriptor. It may return -1 to indicate
+ an error condition and set errno to one of the error codes listed
+ below.
+
+
+ERRORS
+ EACCESS
+ The current user does not have write access on the spufs mount
+ point.
+
+ EEXIST An SPU context already exists at the given path name.
+
+ EFAULT pathname is not a valid string pointer in the current address
+ space.
+
+ EINVAL pathname is not a directory in the spufs mount point.
+
+ ELOOP Too many symlinks were found while resolving pathname.
+
+ EMFILE The process has reached its maximum open file limit.
+
+ ENAMETOOLONG
+ pathname was too long.
+
+ ENFILE The system has reached the global open file limit.
+
+ ENOENT Part of pathname could not be resolved.
+
+ ENOMEM The kernel could not allocate all resources required.
+
+ ENOSPC There are not enough SPU resources available to create a new
+ context or the user specific limit for the number of SPU con-
+ texts has been reached.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+ ENOTDIR
+ A part of pathname is not a directory.
+
+
+
+NOTES
+ spu_create is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+FILES
+ pathname must point to a location beneath the mount point of spufs. By
+ convention, it gets mounted in /spu.
+
+
+CONFORMING TO
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+BUGS
+ The code does not yet fully implement all features lined out here.
+
+
+AUTHOR
+ Arnd Bergmann <arndb@de.ibm.com>
+
+SEE ALSO
+ capabilities(7), close(2), spu_run(2), spufs(7)
+
+
+
+Linux 2005-09-28 SPU_CREATE(2)
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
index 988a62fae11..7ba2baa165f 100644
--- a/Documentation/filesystems/sysfs-pci.txt
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -1,4 +1,5 @@
Accessing PCI device resources through sysfs
+--------------------------------------------
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
that support it. For example, a given bus might look like this:
@@ -47,14 +48,21 @@ files, each with their own function.
binary - file contains binary data
cpumask - file contains a cpumask type
-The read only files are informational, writes to them will be ignored.
-Writable files can be used to perform actions on the device (e.g. changing
-config space, detaching a device). mmapable files are available via an
-mmap of the file at offset 0 and can be used to do actual device programming
-from userspace. Note that some platforms don't support mmapping of certain
-resources, so be sure to check the return value from any attempted mmap.
+The read only files are informational, writes to them will be ignored, with
+the exception of the 'rom' file. Writable files can be used to perform
+actions on the device (e.g. changing config space, detaching a device).
+mmapable files are available via an mmap of the file at offset 0 and can be
+used to do actual device programming from userspace. Note that some platforms
+don't support mmapping of certain resources, so be sure to check the return
+value from any attempted mmap.
+
+The 'rom' file is special in that it provides read-only access to the device's
+ROM file, if available. It's disabled by default, however, so applications
+should write the string "1" to the file to enable it before attempting a read
+call, and disable it following the access by writing "0" to the file.
Accessing legacy resources through sysfs
+----------------------------------------
Legacy I/O port and ISA memory resources are also provided in sysfs if the
underlying platform supports them. They're located in the PCI class heirarchy,
@@ -75,6 +83,7 @@ simply dereference the returned pointer (after checking for errors of course)
to access legacy memory space.
Supporting PCI access on new platforms
+--------------------------------------
In order to support PCI resource mapping as described above, Linux platform
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 0d783c504ea..dbe4d87d261 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -78,6 +78,18 @@ use up all the memory on the machine; but enhances the scalability of
that instance in a system with many cpus making intensive use of it.
+tmpfs has a mount option to set the NUMA memory allocation policy for
+all files in that instance:
+mpol=interleave prefers to allocate memory from each node in turn
+mpol=default prefers to allocate memory from the local node
+mpol=bind prefers to allocate from mpol_nodelist
+mpol=preferred prefers to allocate from first node in mpol_nodelist
+
+The following mount option is used in conjunction with mpol=interleave,
+mpol=bind or mpol=preferred:
+mpol_nodelist: nodelist suitable for parsing with nodelist_parse.
+
+
To specify the initial root directory you can use the following mount
options: