There have been other cases where syscalls got cloned, mostly to add new parameters, but either way seccomp with an "open" filter can only ever be defense-in-depth, not a critical line in itself.
(Don't misunderstand, defense-in-depth is good, and keep using seccomp for it. But an open seccomp filter MUST be considered bypassable.)
IIRC, AWS does this, where permission is by API call. As an example, you can have permission to call ssm:GetParameter n times, but if you try to combine those n API calls into a batch with GetParameters, that's a different IAM perm, even though exactly the same thing is occurring.
Multipart uploads in s3 have a similar problem.
A new io_uring fd comes with all operations disabled by default. User has to call "io_uring_register(fd, ENABLE_OP, op)" before operation is used for the first time. Then seccomp filter can easily filter enable_op calls to prohibit certain operations.
It could even be added now in backward-compatible way - add a new feature to io_uring_setup that enables it. Then one could set seccomp filter to only accept setup requests with this feature set, and deny all others. Together, this should allow cooperating programs to pass seccomp filter, while programs that won't register ops could not use seccomp at all.
I would gladly get more details about the exact purpose of seccomp in a container environment. Reading a bit of internet, I find that docker "uses seccomp to block mount(2), which could be used to escape the container", which makes no sense to me because mount(2) requires CAP_SYS_ADMIN
seccomp is used for defense in depth. If someone managed to escalate privileges through some means the seccomp policy will still prevent them from doing nasty things or escalating further.
That is, seccomp is (apparently? I’ve never used it myself) capable of intercepting direct calls. Obviously, that design isn’t going to be able to handle “indirect” calls in its default implementation.
Either seccomp needs a way to act on the buffer or intercept io_uring calls, or there’s a need for a new auth mechanism that’s capable of handling io_uring style API’s.
Torpedoing the whole api (a la gcp) feels like throwing the baby out with the bath water.
If you want to make a case for io_uring being benign for security, the right argument is probably against all unmediated shared-kernel multitenancy (ie: multitenancy either through virtualization, or WASM/V8-type language runtimes, and nothing else). It doesn't make sense to say system call filters are flawed because someone came up with an omni-syscall that breaks those filters.
It seems like a totally reasonable design to me to "just" put the right hooks into the filter mechanism and make it get called the same way regardless of the syscall mechanism.
Is this even possible, regardless of io_uring?