62 pointsby pncnmnp3 days ago7 comments
  • eqvinox8 hours ago
    Using seccomp with a default-open filter is a terrible idea to begin with; it wasn't really designed for any of this. Seccomp in its most basic form didn't even have a filter list, it just allowed read() and write(). (And close() or something, don't quote me on the details, the point is it was a fixed list.) You're supposed to use it with a default-closed filter and fully enumerate what you need. (Yes, that's hard in a lot of cases, but still.)

    There have been other cases where syscalls got cloned, mostly to add new parameters, but either way seccomp with an "open" filter can only ever be defense-in-depth, not a critical line in itself.

    (Don't misunderstand, defense-in-depth is good, and keep using seccomp for it. But an open seccomp filter MUST be considered bypassable.)

  • deathanatos9 hours ago
    This seems like an instance of an anti-pattern I've seen, which is inflating "permission" and "API call" to the same thing.

    IIRC, AWS does this, where permission is by API call. As an example, you can have permission to call ssm:GetParameter n times, but if you try to combine those n API calls into a batch with GetParameters, that's a different IAM perm, even though exactly the same thing is occurring.

    • thayne2 hours ago
      I find that so frustrating. Another example is uploading an image to ECR (elastic container registry). You need like four different permissions to do it, which I think correspond to individual http requests, but it is usually just a single docker/podman/skopeo command, and I can't think of a situation where you would want to grant permission to initiate an upload but not complete it.

      Multipart uploads in s3 have a similar problem.

  • theamk2 hours ago
    I was thinking about how one would change io_uring design to be compatible with seccomp, and came up with a very simple one:

    A new io_uring fd comes with all operations disabled by default. User has to call "io_uring_register(fd, ENABLE_OP, op)" before operation is used for the first time. Then seccomp filter can easily filter enable_op calls to prohibit certain operations.

    It could even be added now in backward-compatible way - add a new feature to io_uring_setup that enables it. Then one could set seccomp filter to only accept setup requests with this feature set, and deny all others. Together, this should allow cooperating programs to pass seccomp filter, while programs that won't register ops could not use seccomp at all.

  • cpuguy839 hours ago
    Both Docker and containerd have started to block io_uring in the default profile for about a year now due to too many security issues with it.
    • bri3d6 hours ago
      And Google, in ChromeOS, Android, and purportedly, Google production servers, for around a year and a half, as well. For this reason it's also disabled in several of the kernelCTF configurations and in the ones where it remains (GKE), it only pays out at half-rate in bug bounty.
    • hinkley8 hours ago
      Has anyone speculated yet about how much slower a secure io_uring has to be? Is it still a net win once you lock it down fully?
      • JackSlateur5 hours ago
        As far as I know, io_uring is quite secure: a user cannot perform a syscall through it unless it has the privileges required to perform this syscall directly

        I would gladly get more details about the exact purpose of seccomp in a container environment. Reading a bit of internet, I find that docker "uses seccomp to block mount(2), which could be used to escape the container", which makes no sense to me because mount(2) requires CAP_SYS_ADMIN

      • cpuguy834 hours ago
        That would be impossible to know. The main thing with io_uring is it makes it so you don't need to context switch (ie make system calls) to perform a number of operations.
  • FridgeSeal5 hours ago
    Surely this is a seccomp shortcoming, or kernel auth shortcoming, rather than an io_uring problem?

    That is, seccomp is (apparently? I’ve never used it myself) capable of intercepting direct calls. Obviously, that design isn’t going to be able to handle “indirect” calls in its default implementation.

    Either seccomp needs a way to act on the buffer or intercept io_uring calls, or there’s a need for a new auth mechanism that’s capable of handling io_uring style API’s.

    Torpedoing the whole api (a la gcp) feels like throwing the baby out with the bath water.

    • tptacek3 hours ago
      That framing doesn't make sense. System calls and their arguments are an obvious security boundary and have been a sandboxing component for decades. io_uring blows that boundary apart. The "problem" is io_uring, not seccomp.

      If you want to make a case for io_uring being benign for security, the right argument is probably against all unmediated shared-kernel multitenancy (ie: multitenancy either through virtualization, or WASM/V8-type language runtimes, and nothing else). It doesn't make sense to say system call filters are flawed because someone came up with an omni-syscall that breaks those filters.

      • asveikau2 hours ago
        The syscall implementations themselves do checks and return EPERM/EACCES when appropriate. The mechanism for doing the syscall can change. I mean, in the 90s it happened via int 0x80, then we got sysenter, then the vdso. io_uring just moved part of it to user mode.

        It seems like a totally reasonable design to me to "just" put the right hooks into the filter mechanism and make it get called the same way regardless of the syscall mechanism.

  • leni53610 hours ago
    > But if you've got a separation of duties where a sysadmin sets up seccomp filtering generically across applications

    Is this even possible, regardless of io_uring?

    • amarshall9 hours ago
      Well the article brings up containers as an example. If the sysadmin controls “your” parent or root process (e.g. the login shell), they can just perform seccomp filtering there and it applies to everything within it (like any other sandbox).
      • 0x74696d4 hours ago
        (author here) I'm one of the maintainers of HashiCorp's Nomad, so that example was likely inspired by the separation of duties that's part of our security model. In that environment, there's a subset of task (ex. container) configuration that's controlled by the cluster admin and a subset that's controlled by the job author deploying onto the cluster.
    • klooney3 hours ago
      Yes- systemd will let you do that, as well docker/containerd/podman.
  • 0x74696d4 hours ago
    Author here! The motivating example of this post is frankly pretty lousy in retrospect (and was even so soon after writing, given the friendly reminder from Giovanni Campagna that `socket` wasn't one of the io_uring opcodes). At best this is an interesting limitation of seccomp. Maybe relevant if you were using gVisor?