The Unix Way ■ Episode 06
The Inheritance Problem
A process is compromised. A buffer overflow, a malformed packet, a dependency with quiet ambitions. The attacker now controls execution. What happens next depends entirely on what the process can reach, and on a stock Unix system, the answer has been the same since 1969: everything the user can touch. Every file. Every socket. Every device. The compromised process inherits the full ambient authority of the user who launched it.
This is not a bug. It is the original Unix security model, and it served admirably when "users" meant researchers at Bell Labs who could be trusted not to run hostile code from the internet. The internet, regrettably, had other plans.
Two operating systems decided to fix this. They chose opposite philosophies. One removed the doors from the room. The other hired a bouncer and handed him a clipboard.
FreeBSD: Capsicum
In 2010, Robert Watson and Jonathan Anderson at the University of Cambridge presented a paper that won Best Student Paper at USENIX Security. The core insight was disarmingly simple: rather than listing what a process may not do, remove everything and hand back precisely what it needs.
The result was
Capsicum,
compiled into FreeBSD 10.0 by default in 2014. The API is one function
that matters: cap_enter(). One syscall. Irreversible.
#include <sys/capsicum.h>
int main(void)
{
int fd = open("/var/log/capture.pcap", O_WRONLY);
/* Restrict fd: write and seek only */
cap_rights_t rights;
cap_rights_init(&rights, CAP_WRITE, CAP_SEEK);
cap_rights_limit(fd, &rights);
/* Enter capability mode. No return. Ever. */
cap_enter();
/*
* The process loses all access to global namespaces.
* No filesystem. No new sockets. No new processes.
* What remains: only the file descriptors already held,
* with rights explicitly granted above.
*/
write_packets(fd);
return 0;
}
There is no cap_exit(). There is no escalation path, no
privilege restoration, no polite request form for processes that would
like their authority back. The kernel sets a flag. The flag does not unset.
The process enters a world that contains precisely its open file descriptors,
restricted to the operations explicitly granted. The filesystem, the network,
the process table: they do not merely become inaccessible. For the sandboxed
process, they cease to exist.
The model is subtraction. Start with everything, remove everything, hand back precisely what is needed. The process cannot escape. Not because a filter stops it, but because the door no longer exists. One might observe that this is rather difficult to bypass.
Linux: seccomp-bpf
On Linux, the answer arrived in two stages. In 2005, Andrea Arcangeli added
seccomp
strict mode: four syscalls permitted (read, write,
exit, sigreturn), everything else killed the process.
Elegant, certainly. Also almost completely unusable for anything that needed
to do actual work.
In 2012, Will Drewry introduced seccomp-bpf in Linux 3.5: a BPF programme that inspects every syscall at runtime and decides whether to allow, deny, or kill. This was genuinely useful. It was also a fundamentally different philosophy.
#include <seccomp.h>
int main(void)
{
/* Default: kill on any syscall not explicitly allowed */
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
/* Allowlist: only these syscalls permitted */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
/* Load the filter */
seccomp_load(ctx);
/*
* From here:
* - Only listed syscalls work
* - But: all existing FDs retain full rights
* - An allowed read() can read ANY open FD
* - The filter checks the call, not the target
*/
write_packets(fd);
return 0;
}
The model is filtration. The process retains full ambient authority. A filter sits between the process and the kernel, checking each call against a list. The bouncer checks your name at the door. He does not check what you do once you are inside.
Docker's default seccomp profile blocks approximately 44 of 300+ syscalls. The remaining 256 pass through. The difference between an allowlist and a blocklist is the difference between "you may enter rooms 3 and 7" and "you may enter any room except 12 and 15." One of these gets more dangerous as the building adds floors.
The Same Tool, Two Philosophies
The most instructive comparison is not theoretical. It is tcpdump.
tcpdump captures network packets. It runs as root, because it must open a BPF device. It parses untrusted network data from the wire. It is precisely the kind of tool an attacker dreams of compromising: root privileges, network-facing, parsing arbitrary input. Both FreeBSD and Linux sandbox it. They chose opposite approaches.
On FreeBSD, tcpdump uses
Capsicum.
It opens the BPF capture device and the output file, restricts their
rights with cap_rights_limit(), then calls
cap_enter(). From that moment, only the already-opened
BPF descriptor and the output file remain. The filesystem does not exist.
The network does not exist. New sockets cannot be opened. A compromised
tcpdump on FreeBSD can read packets from one device and write to one file.
Nothing else.
On Linux, tcpdump uses
seccomp-bpf.
A filter list decides which syscalls pass. The allowed calls retain full
authority over every open file descriptor. An allowed read()
can read any open FD. The filter checks the call, not the target. Same
tool. Same threat model. One removes access. The other filters calls.
The structural difference matters when the kernel grows. Capsicum does
not care how many syscalls the kernel adds. After cap_enter(),
a new syscall that opens files does not work because the process is in
capability mode. The restriction is structural, not enumerative. The
kernel can gain a thousand new syscalls and the sandbox holds, because
the sandbox is not a list of things you cannot do. It is the absence
of the ability to do them.
The CVE That Proved the Point
In 2022,
CVE-2022-30594
demonstrated precisely why the architectural difference matters.
On Linux kernels before 5.17.2, PTRACE_SEIZE allowed local
attackers to set PT_SUSPEND_SECCOMP, bypassing seccomp
filters entirely. The filter was correct. Every rule was properly written.
The mechanism to enforce it was not.
Capsicum's model is structurally immune to this class of attack. There is no filter to suspend. There is no enforcement layer that can be circumvented. The process is in capability mode. The global namespace does not exist. You cannot bypass a door that is not there, regardless of how creative your lock-picking skills might be.
The filter was correct. The mechanism to enforce it was not. This is the fundamental difference between subtraction and filtration.
The Epistemological Divide
The difference between Capsicum and seccomp is not a difference of implementation. It is a difference of epistemology, and it is worth spending a moment on why this matters beyond the technical detail.
seccomp asks: "What should this process not be allowed to call?" This requires you to know, in advance, every dangerous thing a process might do. Every new kernel version, every new syscall, every new attack vector must be anticipated and added to the filter. The Linux kernel had 335 syscalls in 2012. It has over 450 today. Each addition is a potential gap in every seccomp profile that uses a blocklist.
Capsicum asks: "What does this process actually need?" Then it removes everything else. You do not enumerate the threats. You enumerate the requirements. The set of things a process needs is small, knowable, and stable. The set of things a process might abuse grows with every kernel release. One of these sets is rather easier to manage than the other.
The Practical Evidence
tcpdump is not alone. FreeBSD's base system quietly demonstrates what Capsicum makes possible across its most security-critical tools.
dhclient follows the same pattern. Open the socket, open the
lease file, enter capability mode. The DHCP client, which runs as root and
handles network input from untrusted sources, is sandboxed to precisely the
resources it requires. On most Linux distributions, a compromised dhclient
can read anything the dhclient user can read, which, given that it typically
runs as root, is everything.
The full list of Capsicum-enabled tools in FreeBSD base is quietly
instructive: tcpdump, dhclient,
hastd, auditdistd, gzip,
and OpenSSH. These are precisely the tools an attacker targets first:
network-facing, parsing untrusted input, often running as root. On FreeBSD,
they are precisely the tools that have the least to offer once compromised.
The Point
Both approaches improve security. Meaningfully. seccomp-bpf protects millions of containers, phones, and browsers every day. Dismissing it would be ignorant. But the question, as ever, is architectural: would you rather patch the filter, or remove what needs filtering?
Capsicum eliminates ambient authority. seccomp restricts it. One locks the door and removes it from the hinges. The other hires a bouncer and hopes the guest list is complete.
The door that does not exist cannot be opened. Rather reassuring, that.