Examining OpenSSH Sandboxing and Privilege Separation – Attack Surface Analysis
Examining the effectiveness of OpenSSH's security mechanisms

The recentOpenSSH double-free vulnerability – CVE-2023-25136, created a lot of interest and confusion regarding OpenSSH’s custom security mechanisms – Sandbox and Privilege Separation. Until now, both of these security mechanisms were somewhat unnoticed and only partially documented. The double-free vulnerability raised interest for those who were affected and those controlling servers that use OpenSSH.
This blog post provides an in-depth analysis of OpenSSH’s attack surface and security measures.
- How does OpenSSH implement Privilege Separation?
- OpenSSH特权分离——深入分析
- What is the OpenSSH Sandbox?
- OpenSSH Sandbox – In-Depth Analysis
- Conclusion – Don’t mess with the defaults!
- Stay up-to-date with JFrog Security Research
- Appendix A – OpenSSH sandbox full syscall lists
How does OpenSSH implement Privilege Separation?
OpenSSH’sprivilege separationmechanism has been around since March 2002, implemented more than 20 years ago.
The feature is designed to enhance the security of SSH servers by limiting the privileges of the SSH server process and separating it from the user’s authentication and session processes.
The goal of privilege separation is to make sure pre-authentication attacks cannot compromise the root account even though other parts of OpenSSH do run with root privileges.
Prior to the introduction of Privilege Separation, the OpenSSH server process had to run with elevated privileges to access system resources required for authentication and session management. This elevated privilege level made the server process a high-value target for attackers, who could potentially gain full control over a system by exploiting any vulnerability in the server process.
Any remote code execution vulnerability in the OpenSSH server process (sshd) could lead to an immediate remote root compromise if it happened before authentication, subsequently giving the attacker full control over the machine running OpenSSH.
Privilege Separation Mechanism
With Privilege Separation, the OpenSSH server process is split into two separate processes: one process that runs with elevated privileges to handle system-level tasks such as network I/O, and another process that runs with reduced privileges to handle user authentication.
When a user initiates an SSH connection to an OpenSSH server with Privilege Separation enabled, the server spawns two separate processes to handle the incoming connection.
The first process, known as the privileged process, runs with elevated privileges and is responsible for handling network I/O, such as listening for incoming connections, managing network sockets, and managing pseudo-terminals.
第二个过程,称为无特权的process, runs with reduced privileges and is responsible for handling user authentication. It is isolated from the privileged process and has limited access to system resources, such as file systems and network interfaces.
OpenSSH特权分离——深入分析
Privilege separation uses two processes: A privileged parent process monitors the progress of an unprivileged child process.
The child process is unprivileged. This is achieved by changing its uid/gid to an unused user (usuallysshd) which has no login shell, and restricting its file system access viachroot()to/var/empty. ItThe child process is the only process that handles network data.
The parent process determines whether the child process performed the authentication successfully.
Communication between the privileged and the unprivileged process is achieved via pipes. Shared memory stores state that can not be otherwise exported and the child has to ask the privileged parent to determine whether authentication was successful.
If the child process gets corrupted and believes that the remote user has been authenticated, access will only be granted if the parent has reached the same decision.
During authentication, the child process communicates with the user and the authentication agent to obtain the necessary credentials for authentication. Once the child process has obtained the credentials, it sends them to the parent process for validation.
The parent process then performs the actual authentication by using the credentials provided by the child process to authenticate the user. If the authentication is successful, the parent process sends a message to the child process indicating that authentication has succeeded. If the authentication fails, the parent process sends a message to the child process indicating that authentication has failed.
The communication between the parent and child processes is done using Unix domain sockets, which are a form of inter-process communication (IPC) mechanism.
The parent and child processes each have their own Unix domain socket, and they use these sockets to communicate with each other.
通过执行身份验证在父process, OpenSSH is able to ensure that sensitive authentication data never leaves the privileged process, which provides an additional layer of security. Additionally, by using IPC to communicate between the parent and child processes, OpenSSH is able to maintain separation between the two processes and prevent the child process from interfering with the critical operation performed by the parent process.
During the pre-authentication phase,sshdwillchroot()to/var/emptyand change its privileges to thesshduser and its primary group.sshdis a pseudo-account that is locked, is not used by other daemons, and does not contain a valid shell.
Given the following process listing:
| UID | PID | PPID | C | STIME | TTY | TIME | COMMAND |
|---|---|---|---|---|---|---|---|
| root | 957 | 9 | 0 | 09:14 | ? | 00:00:00 | /usr/sbin/sshd -D [listener] 0 of 10-100 startups |
| root | 1015 | 957 | 0 | 09:14 | ? | 00:00:00 | sshd: [accepted] |
| sshd | 1016 | 1015 | 0 | 09:14 | ? | 00:00:00 | sshd: [net] |
- Process 957 is the sshd process listening for new connections.
- Process 1015 is the privileged monitor process.
- Process 1016 is the unprivileged
authenticator-handlerprocess.
The Privilege Separation mechanism is controlled by theUsePrivilegeSeparationconfiguration key. By default, the key is set to the most restrictivesandboxsetting (even when the key is not specified) which means the pre-authentication unprivileged process is subject to additional restrictions, which we will cover in the next section. The default location for this configuration file is/etc/sshd_config.
A samplesshd_configconfiguration file:
# Connection Port 22 Protocol 2 UseDNS no Compression no # Authentication: PubkeyAuthentication yes PermitEmptyPasswords no UsePAM yes ChallengeResponseAuthentication yes LoginGraceTime 60 UsePrivilegeSeparation sandbox # The relevant Privilege Separation config key …
What is the OpenSSH Sandbox?
The OpenSSH pre-authentication sandbox is a security mechanism first introduced in OpenSSH version 5.9 that aims to prevent attackers from fully compromising a system after exploiting vulnerabilities during the pre-authentication phase. It creates a restricted environment that limits the scope of potential vulnerabilities during the authentication phase of SSH connections.
它是by launching an isolated environment using a combination of kernel security mechanisms, such as seccomp filtering and namespace isolation – essentially restricting its capabilities to only a few pre-approved system calls.
When a user initiates an SSH connection to an OpenSSH server with the sandbox feature enabled, the server spawns a new process that runs in a restricted environment, also known as the sandbox. The sandboxed process is created with limited privileges and restricted access to system resources, including file systems and network interfaces.
OpenSSH Sandbox – In-Depth Analysis
OpenSSH has 7(!) different sandbox styles that are determined by the platform you compile it for and its kernel capabilities.
All of the different sandbox styles are centered around the concept of system call restriction – meaning that the sandboxed process cannot use most of the system’s services, like opening files, communicating over the network, etc.
linux Sandbox
OpenSSH配置步骤(运行性能试验ore the compilation step) checks for seccomp compatibility by checking whether the kernel is configured with theSECCOMP_MODE_FILTERoption.
This is what it looks like when configuring:
checking whether SECCOMP_MODE_FILTER is declared... yes
checking kernel for seccomp_filter support... yes
Checking the Ubuntu 22.04 LTS sshd daemon binary for the sandbox type, we see that it uses a seccomp filter, so we’ll focus on that:
> strings ./sshd | grep preparing
%s: preparing seccomp filter sandbox
Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12, released in 2005. It is used tofilter and restrict the available system callsto userland processes, thus reducing the kernel surface exposed which limits the attack surface for a privilege escalation.
This is done by having only the essential system calls needed for the application to function properly. The filter is expressed as a Berkeley Packet Filter (BPF), as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments.
We examine thesandbox-seccomp-filter.cfile, and find the attachment of the Seccomp filter to the program inssh_sandbox_child()function, the seccomp profile used by OpenSSH is –
/* Syscall filtering set for preauth. */ static const struct sock_filter preauth_insns[] = { …………………… /* Syscalls to non-fatally deny */ #ifdef __NR_lstat SC_DENY(__NR_lstat, EACCES), #endif #ifdef __NR_lstat64 SC_DENY(__NR_lstat64, EACCES), #endif #ifdef __NR_fstat SC_DENY(__NR_fstat, EACCES), #endif …………………… /* Syscalls to permit */ #ifdef __NR_brk SC_ALLOW(__NR_brk), #endif #ifdef __NR_clock_gettime SC_ALLOW(__NR_clock_gettime), #endif #ifdef __NR_clock_gettime64 SC_ALLOW(__NR_clock_gettime64), #endif #ifdef __NR_close SC_ALLOW(__NR_close), #endif #ifdef __NR_exit SC_ALLOW(__NR_exit), #endif #ifdef __NR_mmap SC_ALLOW_ARG_MASK(__NR_mmap, 2, PROT_READ|PROT_WRITE|PROT_NONE), #endif #ifdef __NR_mmap2 SC_ALLOW_ARG_MASK(__NR_mmap2, 2, PROT_READ|PROT_WRITE|PROT_NONE), #endif #ifdef __NR_mprotect SC_ALLOW_ARG_MASK(__NR_mprotect, 2, PROT_READ|PROT_WRITE|PROT_NONE), #endif /* Default deny */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_FILTER_FAIL),
It uses macros (SC_DENY/SC_ALLOW/SC_ALLOW_ARG_MASK) to create the BPF filters that are used as the Seccomp filter.
For example, we see that lstat() is explicitly denied to fail silently, close() is explicitly allowed, and mprotect() is allowed but must pass an argument mask that denies unwanted arguments.
We can also see that the default for non-detailed syscalls is SECCOMP_FILTER_FAIL:
/* Linux seccomp_filter sandbox */ #define SECCOMP_FILTER_FAIL SECCOMP_RET_KILL
SECCOMP_RET_KILL results in the process exiting immediately without executing the system call.
This was our result in the previous blog post when trying to trigger the vulnerability, the seccomp sandbox would fail and exit the process because writev() is not defined, and automatically leads to SECCOMP_RET_KILL.
Some of the major silently-denied syscalls[1]:
open– used to open a file and obtain a file descriptor that can be used to read from or write to the file. Removing this syscall heavily decreases the attack surface since attackers won’t be able to open arbitrary files.
openat– similar to open, but works relative to a given directory.
Some of the major explicitly-allowed syscalls that check for arguments[2]:
mmap– used to map a region of memory into the calling process’s address space. Denying certain arguments will prevent attackers from creating dangerous memory maps, and block some ROP shellcodes (see next section).
mprotect– used to modify the access permissions for a range of memory pages.
Some of the major explicitly-allowed syscalls without arguments checking[3]:
close– used to release a file descriptor previously obtained by opening a file using theopenoropenatsystem calls.
madvise– used to advise the kernel about the intended usage of a range of memory. Allows a program to communicate to the operating system how it plans to use a particular region of memory, which can help the kernel optimize its management of that memory.
mremap– used to change the size or location of an existing memory mapping.
munmap– used to remove a memory mapping that was previously established using the mmap syscall. When a program no longer needs to access a memory mapping, it should call themunmapsyscall to release the associated memory and remove the mapping.
write– Used to write data from a buffer to a file descriptor.
We’ll first dive into the restriction for mmap:
#ifdef __NR_mprotect SC_ALLOW_ARG_MASK(__NR_mprotect, 2, PROT_READ|PROT_WRITE|PROT_NONE), #endif
/* Allow if syscall argument contains only values in mask */ #define SC_ALLOW_ARG_MASK(_nr, _arg_nr, _arg_mask) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (_nr), 0, 8), \ /* load, mask and test syscall argument, low word */ \ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \ offsetof(struct seccomp_data, args[(_arg_nr)]) + ARG_LO_OFFSET), \ BPF_STMT(BPF_ALU+BPF_AND+BPF_K, ~((_arg_mask) & 0xFFFFFFFF)), \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0, 0, 4), \ /* load, mask and test syscall argument, high word */ \ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \ offsetof(struct seccomp_data, args[(_arg_nr)]) + ARG_HI_OFFSET), \ BPF_STMT(BPF_ALU+BPF_AND+BPF_K, \ ~(((uint32_t)((uint64_t)(_arg_mask) >> 32)) & 0xFFFFFFFF)), \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), \ /* reload syscall number; all rules expect it in accumulator */ \ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \ offsetof(struct seccomp_data, nr))
This verifies that the second argument tommapis one of the following:
PROT_READ|PROT_WRITE|PROT_NONE
This prevents an attacker from creating an executable memory segment (PROT_EXEC) which makes it harder for an attacker to bypass theDEP/NXexploit mitigation.
Also, most shellcodes use theopen()系统调用来打开新文件描述符,但seccomp filter denies it, effectively minimizing a lot of local privilege escalation possibilities since an attacker would have to find already-open file descriptors in order to exploit a privilege escalation.
macOS Sandbox
Seccompis a Linux Kernel feature, meaning it does not exist on machines running macOS
let’s inspect thesandbox-darwin.cfile (Darwin is the core Unix system of macOS):
void ssh_sandbox_child(struct ssh_sandbox *box) { char *errmsg; struct rlimit rl_zero; debug3("%s: starting Darwin sandbox", __func__); if (sandbox_init(kSBXProfilePureComputation, SANDBOX_NAMED, &errmsg) == -1) fatal("%s: sandbox_init: %s", __func__, errmsg); /* * The kSBXProfilePureComputation still allows sockets, so * we must disable these using rlimit. */ rl_zero.rlim_cur = rl_zero.rlim_max = 0; if (setrlimit(RLIMIT_FSIZE, &rl_zero) == -1) fatal("%s: setrlimit(RLIMIT_FSIZE, { 0, 0 }): %s", __func__, strerror(errno)); if (setrlimit(RLIMIT_NOFILE, &rl_zero) == -1) fatal("%s: setrlimit(RLIMIT_NOFILE, { 0, 0 }): %s", __func__, strerror(errno)); if (setrlimit(RLIMIT_NPROC, &rl_zero) == -1) fatal("%s: setrlimit(RLIMIT_NPROC, { 0, 0 }): %s", __func__, strerror(errno)); }
OS X has a feature called Seatbelt- its own sandbox kernel extension.
There are 5 documented profiles:
kSBXProfileNoInternet– TCP/IP networking is prohibited.
kSBXProfileNoNetwork– All sockets-based networking is prohibited.
kSBXProfileNoWrite– File system writes are prohibited.
kSBXProfileNoWriteExceptTemporary– File system writes are restricted to the temporary folder /var/tmp and the folder specified by the confstr(3) configuration variable _CS_DARWIN_USER_TEMP_DIR.
kSBXProfilePureComputation– All operating system services are prohibited.
kSBXProfilePureComputationis the most restrictive mode.
When an application is launched with this profile, it is limited to accessing only the following resources:
- The application’s own code and resources
- System libraries and frameworks required for computation
- Shared memory
- Unix signals
- The network loopback interface
The application is prevented from accessing the file system, other network interfaces, user data, hardware peripherals, or any other resources that could potentially be used to modify the system or interfere with other applications.
We can see OpenSSH’s sandbox uses this profile – essentially restricting all OS services and thus minimizing OpenSSH’s attack surface.
OpenBSD Sandbox
The OpenBSD operating system also has sandbox styles that are specific to it and cannot be used on other platforms.
The first one issystrace, which monitors and controls an application’s access to the system by enforcing access policies for system calls, much like a primitiveseccomp.
It uses a pseudo-device,/dev/systrace, which allows userland processes to control the behavior ofsystracethrough anioctlinterface.
It was deprecated in favor ofpledge(originally namedtame) that was released in 2015.
It has a concept named Promises, which are sets of permissions that a process can request in order to perform its operations.
A promise is a declaration made by a process to the system that it will only use a specific list of system calls, thus restricting its syscall access to a predefined set of operations.
Promises can be requested by a process using the pledge() system call, and each promise is identified by a string that represents a specific category of syscalls that the process is allowed to use.
Some examples of promises includerpath(which allows the process to read its own executable and linked shared libraries) andinet(which permits network communication).
pledgecan’t filter file system paths or internet addresses. For example, if you enable a category likeinet, your process will be able to talk to any internet address.
We’ll inspect thesandbox-pledge.cfile:
void ssh_sandbox_child(struct ssh_sandbox *box) { if (pledge("stdio", NULL) == -1) fatal_f("pledge()"); }
We can see that OpenSSH uses the promise stdio.
This promise grants access to standard input/output, threads, and benign system calls.
Some of the major explicitly-allowed syscalls[4]:
close– used to release a file descriptor previously obtained by opening a file using theopenor openat system calls.
madvise– used to advise the kernel about the intended usage of a range of memory. Allows a program to communicate to the operating system how it plans to use a particular region of memory, which can help the kernel optimize its management of that memory.
mmap– used to map a region of memory into the calling process’s address space.
mprotect– used to modify the access permissions for a range of memory pages.
munmap-used to remove a memory mapping that was previously established using the mmap syscall. When a program no longer needs to access a memory mapping, it should call the munmap syscall to release the associated memory and remove the mapping.
pipe– used to create an inter-process communication channel, or “pipe”, between two related processes.
read– used to read data from a file or input stream.
recv– used to receive data from a connected socket.
send– used to send data over a connected socket.
write– used to write data from a buffer to a file descriptor.
This filter, much like the seccomp one, is very restrictive and denies anyPROT_EXECmappings or invokingopen().
Conclusion – Don’t mess with the defaults!
OpenSSH’s security mechanisms, namelyPrivilege SeparationandSandboxing, provide a robust and effective solution for enhancing the security of the OpenSSH server. These mechanisms work together to minimize the attack surface and prevent privilege escalation attacks by isolating and restricting access to critical system resources.
The one point of failure of those security mechanisms is the user configuration.
OpenSSH will enable all the restrictions by default (on a supported system), but a user can also choose to partially enable the restriction mechanisms or disable them completely:
- To only use the privilege separation (
UsePrivilegeSeparation=yes) without sandboxing. This may allow network attackers to fully compromise a system with privilege escalation. - Or to disable both the sandbox and privilege separation (
UsePrivilegeSeparation=no).
This leads to an insecure system. Once code execution is achieved in the pre-authentication phase, attackers may fully compromise the system.
By running parts of the SSH daemon in a separate, unprivileged process and by confining it to a sandboxed environment, OpenSSH can prevent attackers from exploiting vulnerabilities in the SSH server (like CVE-2023-25136 Double-Free) to gain privileged access to the system.
Following the research above, it is safe to say that at this time, organizations can deploy OpenSSH with confidence, knowing that the risk of code execution and privilege escalation attacks has been considerably mitigated. However, be sure to stay up-to-date with the latest version of OpenSSH and all of the latest security findings.
Stay up-to-date with JFrog Security Research
The security research team’s findings and research play an important role in improving the JFrog Software Supply Chain Platform’s software security capabilities. This manifests in the form of enhanced CVE metadata and remediation advice for developers, DevOps and security teams in theJFrog Xrayvulnerability database. And also as new security scanning capabilities used by JFrog Xray.
Follow the latest discoveries and technical updates from the JFrog Security Research team in ourresearch website,security research blog postsand on Twitter at@JFrogSecurity.
Appendix A – OpenSSH sandbox full syscall lists
[1]linux Sandbox – silently-denied syscalls:lstat, lstat64, fstat, fstat64, fstatat64, open, openat, newfstatat, stat, stat64, shmget, shmat, shmdt, ipc, statx
[2]linux Sandbox – explicitly-allowed syscalls that check for arguments:mmap, mmap2, mprotect, socketcall, ioctl (only on s390 architecture)
[3]linux Sandbox – explicitly-allowed syscalls without arguments checking:brk, clock_gettime, clock_gettime64, close, exit, exit_group, futext, futext_time64, geteuid, geteuid32, getpgid, getpid, getrandom, gettid, gettimeofday, getuid, getuid32, madvise, mremap, munmap, nanosleep, clock_nanosleep, clock_nanosleep_time64, clock_gettime64, newselect, ppoll, ppoll_time64, poll, pselect6, pselect6_time64, read, rt_sigprocmask, select, shutdown, sigprocmask, time, write
[4]OpenBSD Sandbox – explicitly-allowed syscalls:exit_group, close, dup, dup2, dup3, fchdir, fstat, fsync, fdatasync, ftruncate, getdents, getegid, getrandom, geteuid, getgid, getgroups, getitimer, getpgid, getpgrp, getpid, getppid, getresgid, getresuid, getrlimit, getsid, wait4, gettimeofday, getuid, lseek, madvise, brk, arch_prctl, uname, set_tid_address, clock_getres, clock_gettime, clock_nanosleep, mmap (PROT_EXEC and weird flags aren't allowed), mprotect (PROT_EXEC isn't allowed), msync, munmap, nanosleep, pipe, pipe2, read, readv, pread, recv, poll, recvfrom, preadv, write, writev, pwrite, pwritev, select, send, sendto (only if addr is null), setitimer, shutdown, sigaction (but SIGSYS is forbidden), sigaltstack, sigprocmask, sigreturn, sigsuspend, umask, socketpair, ioctl(FIONREAD), ioctl(FIONBIO), ioctl(FIOCLEX), ioctl(FIONCLEX), fcntl(F_GETFD), fcntl(F_SETFD), fcntl(F_GETFL), fcntl(F_SETFL)