summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--Documentation/DocBook/kernel-hacking.tmpl310
1 files changed, 144 insertions, 166 deletions
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl
index 49a9ef82d57..6367bba32d2 100644
--- a/Documentation/DocBook/kernel-hacking.tmpl
+++ b/Documentation/DocBook/kernel-hacking.tmpl
@@ -8,8 +8,7 @@
<authorgroup>
<author>
- <firstname>Paul</firstname>
- <othername>Rusty</othername>
+ <firstname>Rusty</firstname>
<surname>Russell</surname>
<affiliation>
<address>
@@ -20,7 +19,7 @@
</authorgroup>
<copyright>
- <year>2001</year>
+ <year>2005</year>
<holder>Rusty Russell</holder>
</copyright>
@@ -64,7 +63,7 @@
<chapter id="introduction">
<title>Introduction</title>
<para>
- Welcome, gentle reader, to Rusty's Unreliable Guide to Linux
+ Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux
Kernel Hacking. This document describes the common routines and
general requirements for kernel code: its goal is to serve as a
primer for Linux kernel development for experienced C
@@ -96,13 +95,13 @@
<listitem>
<para>
- not associated with any process, serving a softirq, tasklet or bh;
+ not associated with any process, serving a softirq or tasklet;
</para>
</listitem>
<listitem>
<para>
- running in kernel space, associated with a process;
+ running in kernel space, associated with a process (user context);
</para>
</listitem>
@@ -114,11 +113,12 @@
</itemizedlist>
<para>
- There is a strict ordering between these: other than the last
- category (userspace) each can only be pre-empted by those above.
- For example, while a softirq is running on a CPU, no other
- softirq will pre-empt it, but a hardware interrupt can. However,
- any other CPUs in the system execute independently.
+ There is an ordering between these. The bottom two can preempt
+ each other, but above that is a strict hierarchy: each can only be
+ preempted by the ones above it. For example, while a softirq is
+ running on a CPU, no other softirq will preempt it, but a hardware
+ interrupt can. However, any other CPUs in the system execute
+ independently.
</para>
<para>
@@ -130,10 +130,10 @@
<title>User Context</title>
<para>
- User context is when you are coming in from a system call or
- other trap: you can sleep, and you own the CPU (except for
- interrupts) until you call <function>schedule()</function>.
- In other words, user context (unlike userspace) is not pre-emptable.
+ User context is when you are coming in from a system call or other
+ trap: like userspace, you can be preempted by more important tasks
+ and by interrupts. You can sleep, by calling
+ <function>schedule()</function>.
</para>
<note>
@@ -153,7 +153,7 @@
<caution>
<para>
- Beware that if you have interrupts or bottom halves disabled
+ Beware that if you have preemption or softirqs disabled
(see below), <function>in_interrupt()</function> will return a
false positive.
</para>
@@ -168,10 +168,10 @@
<hardware>keyboard</hardware> are examples of real
hardware which produce interrupts at any time. The kernel runs
interrupt handlers, which services the hardware. The kernel
- guarantees that this handler is never re-entered: if another
+ guarantees that this handler is never re-entered: if the same
interrupt arrives, it is queued (or dropped). Because it
disables interrupts, this handler has to be fast: frequently it
- simply acknowledges the interrupt, marks a `software interrupt'
+ simply acknowledges the interrupt, marks a 'software interrupt'
for execution and exits.
</para>
@@ -188,60 +188,52 @@
</sect1>
<sect1 id="basics-softirqs">
- <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title>
+ <title>Software Interrupt Context: Softirqs and Tasklets</title>
<para>
Whenever a system call is about to return to userspace, or a
- hardware interrupt handler exits, any `software interrupts'
+ hardware interrupt handler exits, any 'software interrupts'
which are marked pending (usually by hardware interrupts) are
run (<filename>kernel/softirq.c</filename>).
</para>
<para>
Much of the real interrupt handling work is done here. Early in
- the transition to <acronym>SMP</acronym>, there were only `bottom
+ the transition to <acronym>SMP</acronym>, there were only 'bottom
halves' (BHs), which didn't take advantage of multiple CPUs. Shortly
after we switched from wind-up computers made of match-sticks and snot,
- we abandoned this limitation.
+ we abandoned this limitation and switched to 'softirqs'.
</para>
<para>
<filename class="headerfile">include/linux/interrupt.h</filename> lists the
- different BH's. No matter how many CPUs you have, no two BHs will run at
- the same time. This made the transition to SMP simpler, but sucks hard for
- scalable performance. A very important bottom half is the timer
- BH (<filename class="headerfile">include/linux/timer.h</filename>): you
- can register to have it call functions for you in a given length of time.
+ different softirqs. A very important softirq is the
+ timer softirq (<filename
+ class="headerfile">include/linux/timer.h</filename>): you can
+ register to have it call functions for you in a given length of
+ time.
</para>
<para>
- 2.3.43 introduced softirqs, and re-implemented the (now
- deprecated) BHs underneath them. Softirqs are fully-SMP
- versions of BHs: they can run on as many CPUs at once as
- required. This means they need to deal with any races in shared
- data using their own locks. A bitmask is used to keep track of
- which are enabled, so the 32 available softirqs should not be
- used up lightly. (<emphasis>Yes</emphasis>, people will
- notice).
- </para>
-
- <para>
- tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>)
- are like softirqs, except they are dynamically-registrable (meaning you
- can have as many as you want), and they also guarantee that any tasklet
- will only run on one CPU at any time, although different tasklets can
- run simultaneously (unlike different BHs).
+ Softirqs are often a pain to deal with, since the same softirq
+ will run simultaneously on more than one CPU. For this reason,
+ tasklets (<filename
+ class="headerfile">include/linux/interrupt.h</filename>) are more
+ often used: they are dynamically-registrable (meaning you can have
+ as many as you want), and they also guarantee that any tasklet
+ will only run on one CPU at any time, although different tasklets
+ can run simultaneously.
</para>
<caution>
<para>
- The name `tasklet' is misleading: they have nothing to do with `tasks',
+ The name 'tasklet' is misleading: they have nothing to do with 'tasks',
and probably more to do with some bad vodka Alexey Kuznetsov had at the
time.
</para>
</caution>
<para>
- You can tell you are in a softirq (or bottom half, or tasklet)
+ You can tell you are in a softirq (or tasklet)
using the <function>in_softirq()</function> macro
(<filename class="headerfile">include/linux/interrupt.h</filename>).
</para>
@@ -288,11 +280,10 @@
<term>A rigid stack limit</term>
<listitem>
<para>
- The kernel stack is about 6K in 2.2 (for most
- architectures: it's about 14K on the Alpha), and shared
- with interrupts so you can't use it all. Avoid deep
- recursion and huge local arrays on the stack (allocate
- them dynamically instead).
+ Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's
+ about 14K on most 64-bit archs, and often shared with interrupts
+ so you can't use it all. Avoid deep recursion and huge local
+ arrays on the stack (allocate them dynamically instead).
</para>
</listitem>
</varlistentry>
@@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg)
<para>
If all your routine does is read or write some parameter, consider
- implementing a <function>sysctl</function> interface instead.
+ implementing a <function>sysfs</function> interface instead.
</para>
<para>
@@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */
</para>
<para>
- You will eventually lock up your box if you break these rules.
+ You should always compile your kernel
+ <symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn
+ you if you break these rules. If you <emphasis>do</emphasis> break
+ the rules, you will eventually lock up your box.
</para>
<para>
@@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
success).
</para>
</caution>
- [Yes, this moronic interface makes me cringe. Please submit a
- patch and become my hero --RR.]
+ [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.]
</para>
<para>
The functions may sleep implicitly. This should never be called
@@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
</variablelist>
<para>
- If you see a <errorname>kmem_grow: Called nonatomically from int
- </errorname> warning message you called a memory allocation function
- from interrupt context without <constant>GFP_ATOMIC</constant>.
- You should really fix that. Run, don't walk.
+ If you see a <errorname>sleeping function called from invalid
+ context</errorname> warning message, then maybe you called a
+ sleeping allocation function from interrupt context without
+ <constant>GFP_ATOMIC</constant>. You should really fix that.
+ Run, don't walk.
</para>
<para>
@@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
</sect1>
<sect1 id="routines-udelay">
- <title><function>udelay()</function>/<function>mdelay()</function>
+ <title><function>mdelay()</function>/<function>udelay()</function>
<filename class="headerfile">include/asm/delay.h</filename>
<filename class="headerfile">include/linux/delay.h</filename>
</title>
<para>
- The <function>udelay()</function> function can be used for small pauses.
- Do not use large values with <function>udelay()</function> as you risk
+ The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses.
+ Do not use large values with them as you risk
overflow - the helper function <function>mdelay()</function> is useful
- here, or even consider <function>schedule_timeout()</function>.
+ here, or consider <function>msleep()</function>.
</para>
</sect1>
@@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
These routines disable soft interrupts on the local CPU, and
restore them. They are reentrant; if soft interrupts were
disabled before, they will still be disabled after this pair
- of functions has been called. They prevent softirqs, tasklets
- and bottom halves from running on the current CPU.
+ of functions has been called. They prevent softirqs and tasklets
+ from running on the current CPU.
</para>
</sect1>
@@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/asm/smp.h</filename></title>
<para>
- <function>smp_processor_id()</function> returns the current
- processor number, between 0 and <symbol>NR_CPUS</symbol> (the
- maximum number of CPUs supported by Linux, currently 32). These
- values are not necessarily continuous.
+ <function>get_cpu()</function> disables preemption (so you won't
+ suddenly get moved to another CPU) and returns the current
+ processor number, between 0 and <symbol>NR_CPUS</symbol>. Note
+ that the CPU numbers are not necessarily continuous. You return
+ it again with <function>put_cpu()</function> when you are done.
+ </para>
+ <para>
+ If you know you cannot be preempted by another task (ie. you are
+ in interrupt context, or have preemption disabled) you can use
+ smp_processor_id().
</para>
</sect1>
@@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
After boot, the kernel frees up a special section; functions
marked with <type>__init</type> and data structures marked with
- <type>__initdata</type> are dropped after boot is complete (within
- modules this directive is currently ignored). <type>__exit</type>
+ <type>__initdata</type> are dropped after boot is complete: similarly
+ modules discard this memory after initialization. <type>__exit</type>
is used to declare a function which is only required on exit: the
function will be dropped if this file is not compiled as a module.
See the header file for use. Note that it makes no sense for a function
marked with <type>__init</type> to be exported to modules with
<function>EXPORT_SYMBOL()</function> - this will break.
</para>
- <para>
- Static data structures marked as <type>__initdata</type> must be initialised
- (as opposed to ordinary static data which is zeroed BSS) and cannot be
- <type>const</type>.
- </para>
</sect1>
@@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
The function can return a negative error number to cause
module loading to fail (unfortunately, this has no effect if
- the module is compiled into the kernel). For modules, this is
- called in user context, with interrupts enabled, and the
- kernel lock held, so it can sleep.
+ the module is compiled into the kernel). This function is
+ called in user context with interrupts enabled, so it can sleep.
</para>
</sect1>
@@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
reached zero. This function can also sleep, but cannot fail:
everything must be cleaned up by the time it returns.
</para>
+
+ <para>
+ Note that this macro is optional: if it is not present, your
+ module will not be removable (except for 'rmmod -f').
+ </para>
+ </sect1>
+
+ <sect1 id="routines-module-use-counters">
+ <title> <function>try_module_get()</function>/<function>module_put()</function>
+ <filename class="headerfile">include/linux/module.h</filename></title>
+
+ <para>
+ These manipulate the module usage count, to protect against
+ removal (a module also can't be removed if another module uses one
+ of its exported symbols: see below). Before calling into module
+ code, you should call <function>try_module_get()</function> on
+ that module: if it fails, then the module is being removed and you
+ should act as if it wasn't there. Otherwise, you can safely enter
+ the module, and call <function>module_put()</function> when you're
+ finished.
+ </para>
+
+ <para>
+ Most registerable structures have an
+ <structfield>owner</structfield> field, such as in the
+ <structname>file_operations</structname> structure. Set this field
+ to the macro <symbol>THIS_MODULE</symbol>.
+ </para>
</sect1>
<!-- add info on new-style module refcounting here -->
@@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
There is a macro to do this:
<function>wait_event_interruptible()</function>
- <filename class="headerfile">include/linux/sched.h</filename> The
+ <filename class="headerfile">include/linux/wait.h</filename> The
first argument is the wait queue head, and the second is an
expression which is evaluated; the macro returns
<returnvalue>0</returnvalue> when this expression is true, or
@@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
Call <function>wake_up()</function>
- <filename class="headerfile">include/linux/sched.h</filename>;,
+ <filename class="headerfile">include/linux/wait.h</filename>;,
which will wake up every process in the queue. The exception is
if one has <constant>TASK_EXCLUSIVE</constant> set, in which case
- the remainder of the queue will not be woken.
+ the remainder of the queue will not be woken. There are other variants
+ of this basic function available in the same header.
</para>
</sect1>
</chapter>
@@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
first class of operations work on <type>atomic_t</type>
<filename class="headerfile">include/asm/atomic.h</filename>; this
- contains a signed integer (at least 24 bits long), and you must use
+ contains a signed integer (at least 32 bits long), and you must use
these functions to manipulate or read atomic_t variables.
<function>atomic_read()</function> and
<function>atomic_set()</function> get and set the counter,
@@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
Note that these functions are slower than normal arithmetic, and
- so should not be used unnecessarily. On some platforms they
- are much slower, like 32-bit Sparc where they use a spinlock.
+ so should not be used unnecessarily.
</para>
<para>
- The second class of atomic operations is atomic bit operations on a
- <type>long</type>, defined in
+ The second class of atomic operations is atomic bit operations on an
+ <type>unsigned long</type>, defined in
<filename class="headerfile">include/linux/bitops.h</filename>. These
operations generally take a pointer to the bit pattern, and a bit
@@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<function>test_and_clear_bit()</function> and
<function>test_and_change_bit()</function> do the same thing,
except return true if the bit was previously set; these are
- particularly useful for very simple locking.
+ particularly useful for atomically setting flags.
</para>
<para>
@@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
than BITS_PER_LONG. The resulting behavior is strange on big-endian
platforms though so it is a good idea not to do this.
</para>
-
- <para>
- Note that the order of bits depends on the architecture, and in
- particular, the bitfield passed to these operations must be at
- least as large as a <type>long</type>.
- </para>
</chapter>
<chapter id="symbols">
@@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/linux/module.h</filename></title>
<para>
- This is the classic method of exporting a symbol, and it works
- for both modules and non-modules. In the kernel all these
- declarations are often bundled into a single file to help
- genksyms (which searches source files for these declarations).
- See the comment on genksyms and Makefiles below.
+ This is the classic method of exporting a symbol: dynamically
+ loaded modules will be able to use the symbol as normal.
</para>
</sect1>
@@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can
only be seen by modules with a
<function>MODULE_LICENSE()</function> that specifies a GPL
- compatible license.
+ compatible license. It implies that the function is considered
+ an internal implementation issue, and not really an interface.
</para>
</sect1>
</chapter>
@@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/linux/list.h</filename></title>
<para>
- There are three sets of linked-list routines in the kernel
- headers, but this one seems to be winning out (and Linus has
- used it). If you don't have some particular pressing need for
- a single list, it's a good choice. In fact, I don't care
- whether it's a good choice or not, just use it so we can get
- rid of the others.
+ There used to be three sets of linked-list routines in the kernel
+ headers, but this one is the winner. If you don't have some
+ particular pressing need for a single list, it's a good choice.
+ </para>
+
+ <para>
+ In particular, <function>list_for_each_entry</function> is useful.
</para>
</sect1>
@@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
convention, and return <returnvalue>0</returnvalue> for success,
and a negative error number
(eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be
- unintuitive at first, but it's fairly widespread in the networking
- code, for example.
+ unintuitive at first, but it's fairly widespread in the kernel.
</para>
<para>
- The filesystem code uses <function>ERR_PTR()</function>
+ Using <function>ERR_PTR()</function>
- <filename class="headerfile">include/linux/fs.h</filename>; to
+ <filename class="headerfile">include/linux/err.h</filename>; to
encode a negative error number into a pointer, and
<function>IS_ERR()</function> and <function>PTR_ERR()</function>
to get it back out again: avoids a separate pointer parameter for
@@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = {
supported, due to lack of general use, but the following are
considered standard (see the GCC info page section "C
Extensions" for more details - Yes, really the info page, the
- man page is only a short summary of the stuff in info):
+ man page is only a short summary of the stuff in info).
</para>
<itemizedlist>
<listitem>
@@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = {
</listitem>
<listitem>
<para>
- Function names as strings (__FUNCTION__)
+ Function names as strings (__func__).
</para>
</listitem>
<listitem>
@@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = {
<listitem>
<para>
Usually you want a configuration option for your kernel hack.
- Edit <filename>Config.in</filename> in the appropriate directory
- (but under <filename>arch/</filename> it's called
- <filename>config.in</filename>). The Config Language used is not
- bash, even though it looks like bash; the safe way is to use only
- the constructs that you already see in
- <filename>Config.in</filename> files (see
- <filename>Documentation/kbuild/kconfig-language.txt</filename>).
- It's good to run "make xconfig" at least once to test (because
- it's the only one with a static parser).
- </para>
-
- <para>
- Variables which can be Y or N use <type>bool</type> followed by a
- tagline and the config define name (which must start with
- CONFIG_). The <type>tristate</type> function is the same, but
- allows the answer M (which defines
- <symbol>CONFIG_foo_MODULE</symbol> in your source, instead of
- <symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol>
- is enabled.
+ Edit <filename>Kconfig</filename> in the appropriate directory.
+ The Config language is simple to use by cut and paste, and there's
+ complete documentation in
+ <filename>Documentation/kbuild/kconfig-language.txt</filename>.
</para>
<para>
You may well want to make your CONFIG option only visible if
<symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a
warning to users. There many other fancy things you can do: see
- the various <filename>Config.in</filename> files for ideas.
+ the various <filename>Kconfig</filename> files for ideas.
</para>
- </listitem>
- <listitem>
<para>
- Edit the <filename>Makefile</filename>: the CONFIG variables are
- exported here so you can conditionalize compilation with `ifeq'.
- If your file exports symbols then add the names to
- <varname>export-objs</varname> so that genksyms will find them.
- <caution>
- <para>
- There is a restriction on the kernel build system that objects
- which export symbols must have globally unique names.
- If your object does not have a globally unique name then the
- standard fix is to move the
- <function>EXPORT_SYMBOL()</function> statements to their own
- object with a unique name.
- This is why several systems have separate exporting objects,
- usually suffixed with ksyms.
- </para>
- </caution>
+ In your description of the option, make sure you address both the
+ expert user and the user who knows nothing about your feature. Mention
+ incompatibilities and issues here. <emphasis> Definitely
+ </emphasis> end your description with <quote> if in doubt, say N
+ </quote> (or, occasionally, `Y'); this is for people who have no
+ idea what you are talking about.
</para>
</listitem>
<listitem>
<para>
- Document your option in Documentation/Configure.help. Mention
- incompatibilities and issues here. <emphasis> Definitely
- </emphasis> end your description with <quote> if in doubt, say N
- </quote> (or, occasionally, `Y'); this is for people who have no
- idea what you are talking about.
+ Edit the <filename>Makefile</filename>: the CONFIG variables are
+ exported here so you can usually just add a "obj-$(CONFIG_xxx) +=
+ xxx.o" line. The syntax is documented in
+ <filename>Documentation/kbuild/makefiles.txt</filename>.
</para>
</listitem>
@@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = {
</para>
<para>
- <filename>include/linux/brlock.h:</filename>
+ <filename>include/asm-i386/delay.h:</filename>
</para>
<programlisting>
-extern inline void br_read_lock (enum brlock_indices idx)
-{
- /*
- * This causes a link-time bug message if an
- * invalid index is used:
- */
- if (idx >= __BR_END)
- __br_lock_usage_bug();
-
- read_lock(&amp;__brlock_array[smp_processor_id()][idx]);
-}
+#define ndelay(n) (__builtin_constant_p(n) ? \
+ ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
+ __ndelay(n))
</programlisting>
<para>