From 8ac270d1e29f0428228ab2b9a8ae5e1ed4a5cd84 Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Thu, 12 Apr 2012 16:48:04 -0500 Subject: Documentation: prctl/seccomp_filter Documents how system call filtering using Berkeley Packet Filter programs works and how it may be used. Includes an example for x86 and a semi-generic example using a macro-based code generator. Acked-by: Eric Paris Signed-off-by: Will Drewry Acked-by: Kees Cook v18: - added acked by - update no new privs numbers v17: - remove @compat note and add Pitfalls section for arch checking (keescook@chromium.org) v16: - v15: - v14: - rebase/nochanges v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - comment on the ptrace_event use - update arch support comment - note the behavior of SECCOMP_RET_DATA when there are multiple filters (keescook@chromium.org) - lots of samples/ clean up incl 64-bit bpf-direct support (markus@chromium.org) - rebase to linux-next v11: - overhaul return value language, updates (keescook@chromium.org) - comment on do_exit(SIGSYS) v10: - update for SIGSYS - update for new seccomp_data layout - update for ptrace option use v9: - updated bpf-direct.c for SIGILL v8: - add PR_SET_NO_NEW_PRIVS to the samples. v7: - updated for all the new stuff in v7: TRAP, TRACE - only talk about PR_SET_SECCOMP now - fixed bad JLE32 check (coreyb@linux.vnet.ibm.com) - adds dropper.c: a simple system call disabler v6: - tweak the language to note the requirement of PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu) v5: - update sample to use system call arguments - adds a "fancy" example using a macro-based generator - cleaned up bpf in the sample - update docs to mention arguments - fix prctl value (eparis@redhat.com) - language cleanup (rdunlap@xenotime.net) v4: - update for no_new_privs use - minor tweaks v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net) - document use of tentative always-unprivileged - guard sample compilation for i386 and x86_64 v2: - move code to samples (corbet@lwn.net) Signed-off-by: James Morris --- Documentation/prctl/seccomp_filter.txt | 163 +++++++++++++++++++++++++++++++++ 1 file changed, 163 insertions(+) create mode 100644 Documentation/prctl/seccomp_filter.txt (limited to 'Documentation/prctl') diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt new file mode 100644 index 000000000000..597c3c581375 --- /dev/null +++ b/Documentation/prctl/seccomp_filter.txt @@ -0,0 +1,163 @@ + SECure COMPuting with filters + ============================= + +Introduction +------------ + +A large number of system calls are exposed to every userland process +with many of them going unused for the entire lifetime of the process. +As system calls change and mature, bugs are found and eradicated. A +certain subset of userland applications benefit by having a reduced set +of available system calls. The resulting set reduces the total kernel +surface exposed to the application. System call filtering is meant for +use with those applications. + +Seccomp filtering provides a means for a process to specify a filter for +incoming system calls. The filter is expressed as a Berkeley Packet +Filter (BPF) program, as with socket filters, except that the data +operated on is related to the system call being made: system call +number and the system call arguments. This allows for expressive +filtering of system calls using a filter program language with a long +history of being exposed to userland and a straightforward data set. + +Additionally, BPF makes it impossible for users of seccomp to fall prey +to time-of-check-time-of-use (TOCTOU) attacks that are common in system +call interposition frameworks. BPF programs may not dereference +pointers which constrains all filters to solely evaluating the system +call arguments directly. + +What it isn't +------------- + +System call filtering isn't a sandbox. It provides a clearly defined +mechanism for minimizing the exposed kernel surface. It is meant to be +a tool for sandbox developers to use. Beyond that, policy for logical +behavior and information flow should be managed with a combination of +other system hardening techniques and, potentially, an LSM of your +choosing. Expressive, dynamic filters provide further options down this +path (avoiding pathological sizes or selecting which of the multiplexed +system calls in socketcall() is allowed, for instance) which could be +construed, incorrectly, as a more complete sandboxing solution. + +Usage +----- + +An additional seccomp mode is added and is enabled using the same +prctl(2) call as the strict seccomp. If the architecture has +CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: + +PR_SET_SECCOMP: + Now takes an additional argument which specifies a new filter + using a BPF program. + The BPF program will be executed over struct seccomp_data + reflecting the system call number, arguments, and other + metadata. The BPF program must then return one of the + acceptable values to inform the kernel which action should be + taken. + + Usage: + prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); + + The 'prog' argument is a pointer to a struct sock_fprog which + will contain the filter program. If the program is invalid, the + call will return -1 and set errno to EINVAL. + + If fork/clone and execve are allowed by @prog, any child + processes will be constrained to the same filters and system + call ABI as the parent. + + Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or + run with CAP_SYS_ADMIN privileges in its namespace. If these are not + true, -EACCES will be returned. This requirement ensures that filter + programs cannot be applied to child processes with greater privileges + than the task that installed them. + + Additionally, if prctl(2) is allowed by the attached filter, + additional filters may be layered on which will increase evaluation + time, but allow for further decreasing the attack surface during + execution of a process. + +The above call returns 0 on success and non-zero on error. + +Return values +------------- +A seccomp filter may return any of the following values. If multiple +filters exist, the return value for the evaluation of a given system +call will always use the highest precedent value. (For example, +SECCOMP_RET_KILL will always take precedence.) + +In precedence order, they are: + +SECCOMP_RET_KILL: + Results in the task exiting immediately without executing the + system call. The exit status of the task (status & 0x7f) will + be SIGSYS, not SIGKILL. + +SECCOMP_RET_TRAP: + Results in the kernel sending a SIGSYS signal to the triggering + task without executing the system call. The kernel will + rollback the register state to just before the system call + entry such that a signal handler in the task will be able to + inspect the ucontext_t->uc_mcontext registers and emulate + system call success or failure upon return from the signal + handler. + + The SECCOMP_RET_DATA portion of the return value will be passed + as si_errno. + + SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. + +SECCOMP_RET_ERRNO: + Results in the lower 16-bits of the return value being passed + to userland as the errno without executing the system call. + +SECCOMP_RET_TRACE: + When returned, this value will cause the kernel to attempt to + notify a ptrace()-based tracer prior to executing the system + call. If there is no tracer present, -ENOSYS is returned to + userland and the system call is not executed. + + A tracer will be notified if it requests PTRACE_O_TRACESECCOMP + using ptrace(PTRACE_SETOPTIONS). The tracer will be notified + of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of + the BPF program return value will be available to the tracer + via PTRACE_GETEVENTMSG. + +SECCOMP_RET_ALLOW: + Results in the system call being executed. + +If multiple filters exist, the return value for the evaluation of a +given system call will always use the highest precedent value. + +Precedence is only determined using the SECCOMP_RET_ACTION mask. When +multiple filters return values of the same precedence, only the +SECCOMP_RET_DATA from the most recently installed filter will be +returned. + +Pitfalls +-------- + +The biggest pitfall to avoid during use is filtering on system call +number without checking the architecture value. Why? On any +architecture that supports multiple system call invocation conventions, +the system call numbers may vary based on the specific invocation. If +the numbers in the different calling conventions overlap, then checks in +the filters may be abused. Always check the arch value! + +Example +------- + +The samples/seccomp/ directory contains both an x86-specific example +and a more generic example of a higher level macro interface for BPF +program generation. + + + +Adding architecture support +----------------------- + +See arch/Kconfig for the authoritative requirements. In general, if an +architecture supports both ptrace_event and seccomp, it will be able to +support seccomp filter with minor fixup: SIGSYS support and seccomp return +value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER +to its arch-specific Kconfig. -- cgit v1.2.3