I/O Multiplexing (select vs. poll vs. epoll/kqueue)(nima101.github.io)

152 pointsby pykello4 months ago13 comments

eqvinox4 months ago
> epoll/kqueue are replacements for their deprecated counterparts poll and select.
Neither poll nor select are deprecated. They're just not good fits for particular use patterns. But even select() is fine if you just need to watch 2 FDs in a CLI tool.
In fact, due to its footguns, I'd highly advise against epoll (particularly edge triggering) unless you really need it.
- loeg4 months ago
  > But even select() is fine if you just need to watch 2 FDs in a CLI tool.
  Only if those fds are below ~1024 or whatever. (If you're going to use one of the legacy interfaces, at least poll() doesn't have arbitrary limits on the numeric value of the fd.)
  - ahartmetz4 months ago
    The Winsock version doesn't have this limitation. It's a very weak select() because it only works on network sockets, but it doesn't care about numeric values of file descriptors. As on POSIX, file descriptors are added and removed to the select set using macros, and these work on a vector or linked list (I forgot) instead of a bitset.
    loeg4 months ago
    We’re talking about Linux/POSIX here.
  - cozzyd4 months ago
    1024 is the default fd limit anyway, isn't it?
    saurik4 months ago
    ...because of select, and how it became essentially impossible to ensure that a process wasn't using it anywhere, indirectly or incidentally.
- signa114 months ago
  or better yet, go with libevent (https://libevent.org, almost) always better than ’naked’ calls to low level routines, and cross-platform to boot.
  - gritzko4 months ago
    lubuv, libevent introduce a layer of abstraction, their own approach to buffer mgmt, etc. If poll() works, better stay with poll(). It is universally portable, things stay pretty clean and simple as a result.
    Right now I am working on JavaScript bindings for a project and doing it the node.js way (or Deno) is definitely a no-no. That would be one more layer in the architecture, if not two. Once you have more layers, you also have more layer interactions and that never stops.
    I mean, complexity begets complexity
    https://github.com/gritzko/librdx/blob/master/js/README.md
    Having more than 1000 conns per a thread is a very specific usecase.
    HackerThemAll4 months ago
    Very specific, but very common, such as a web server, forward proxy, reverse proxy, load balancer, etc. Used in high millions of instances. I'd say a toy CLI tool is a very specific usecase for a lot less people than the above.
    gritzko4 months ago
    More like a chat server or a sync server. Normal Web server connections are short lived. Flash crowds from HN would give 100K visits a day maybe, that is like 1 or 2 at a time unless you WebSocket them.
    Imagine what sort of traffic you need to saturate 1000 conns with HTTP. Can your single thread app handle it? If you are not using nginx, that must be something less trivial than a reverse proxy. Nginx can do lots of things, by the way.
- oconnor6634 months ago
  select() is at least kind of deprecated, in that its own man page says not to use it in new code.
  - toast04 months ago
    I don't see it in the man page?
    https://man.freebsd.org/cgi/man.cgi?select
    The man page also suggests how you might increase the FD limit if needed. I still use select for a small number of FDs where overhead isn't a real concern, and select is a good fit.
    hippo224 months ago
    https://man7.org/linux/man-pages/man2/select.2.html
    buckle80174 months ago
    In anything new you should use poll not select.
    They're basically identical apis but poll doesn't have a hard limit and works with high number fds.
    toast04 months ago
    Doesn't seem super relevant when the program will only have 5 FDs. Safe signal handling in poll does seem handy though.
    naniwaduni4 months ago
    Your program can be executed with fds 0-1023 already open.
- 4 months ago
  undefined
- dataflow4 months ago
  What's the footgun with edge triggering?
  - buckle80174 months ago
    The edge in epoll edge triggering is going from has data to doesn't have data.
    So the obvious loop using level triggering switched to edge will eventually lock up.
    You'll read 4092bbytesbwhen there is 4093 bytes leaving 1 behind and then never get a signal again.
    HackerThemAll4 months ago
    This is a blatant application bug, not an epoll issue, unless you prove otherwise.
    ahartmetz4 months ago
    It is a very, very easy mistake to make though. Nothing except edge-triggered I/O multiplexing makes it a problem not to read everything you possibly could, and it is often convenient to read less and let some other part of the code handle the rest. Forgot to call that part somehow? Oops, I/O on that socket is now screwed forever.
    nly4 months ago
    You just read() until it returns EWOULDBLOCK/EAGAIN before calling epoll_wait again. It's no less valid than level-triggering as a default mental model.
    ahartmetz4 months ago
    It's valid, just very error-prone. An advantage of the readiness model of doing I/O over the IOCP/io_uring model is that you keep control over when, where and how much data you read. If you always have to read everything, that advantage is greatly reduced. You still have a little more control and easier memory management. Performance is generally worse vs IOCP - you should at least get some convenience for it!
- MomsAVoxell4 months ago
  select() is great for embedded daemons and user space signals handling, and so on.
  Just don't try to solve the 10,000x problem with it, by putting it on the Internet.
  Or, if you do, build it out properly.
  Or use epoll or kqueue.
thasso4 months ago
This part is bewildering to me:
> Now, if you try to watch file descriptor 2000, select will loop over fds from 0 to 1999 and will read garbage. The bigger issue is when it tries to set results for a file descriptor past 1024 and tries to set that bit field in say readfds, writefds or errorfds field. At this point it will write something random on the stack eventually crashing the process and making it very hard to debug what happened since your stack is randomized.
I'm not too literate on the Linux kernel code, but I checked, and it looks like the author is right [1].
It would have been so easy to introduce a size check on the array to make sure this can't happen. The man page reads like FD_SETSIZE differs between platforms. It states that FD_SETSIZE is 1024 in glibc, but no upper limit is imposed by the Linux kernel. My guess is that the Linux kernel doesn't want to assume a value of FD_SETSIZE so they leave it unbounded.
It's hard to imagine how anyone came up with this thinking it's a good design. Maybe 1024 FDs was so much at the time when this was designed that nobody considered what would happen if this limit is reached? Or they were working on system where 1024 was the maximum number of FDs that a process can open?
[1]: The core_sys_select function checks the nfds argument passed to select(2) and modifies the fd_set structures that were passed to the system call. The function ensures that n <= max_fds (as the author of the post stated), but it doesn't compare n to the size of the fd_set structures. The set_fd_set function, which modifies the user-side fd_set structures, calls right into __copy_to_user without additional bounds checks. This means page faults will be caught and return -EFAULT, but out-of-bounds accesses that corrupt the user stack are possible.
- ajross4 months ago
  You (and the author) are misunderstanding. These are all userspace pointers. If the process passes the kernel a buffer and tells it to access it past the end, the kernel will happily do so. It applies all the standard memory protection rules, which means that if your pointer is unmapped or unwritable, the kernel will signal the error (as a SIGSEGV) just as if the process had touched the memory itself.
  It's no different that creating a 1024 byte buffer and telling read() to read 2048 bytes into it.
  To be fair there's an API bug here in that "fd_set" is a fixed-size thing for historical compatibility reasons, while the kernel accepts arbitrarily large buffers now. So code cutting and pasting from historical examples will have a essentially needless 1024 FD limit.
  Stated differently: the POSIX select() has a fixed limit of file descriptors, the linux implementation is extensible. But no one uses the latter feature (because at that scale poll and epoll are much better fits) and there's no formal API for it in the glibc headers.
  - thasso4 months ago
    I don't get where my misunderstanding lies. Didn't I point out that the __copy_to_user call returns EFAULT if the memory is unmapped or unwritable? The problem is that some parts of the user stack may be mapped and writable although they're past the end of the fd_set structure.
    > there's no formal API for it in the glibc headers
    The author claims you can pass nfds > 1024 to select(2).If you use the fd_set structure with a size of 1024, this may lead to memory corruption if an FD > 1023 becomes ready if I understand correctly.
    ajross4 months ago
    Once more, the kernel has never been responsible for managing userspace memory. If the userspace process directs the kernel to write to memory it didn't "intend" the kernel to write to, the kernel will happily do so. Think again on the example of the read() system call I mentioned. How do you propose to fix the problem there?
    The "problem", such as it is here, is that the POSIX behavior for select() (that it supports only a fixed size for fd_set) was extended in the Linux kernel[1] to allow for arbitrary file descriptor counts. But the POSIX API for select() was not equivalently extended, if you want to use this feature you need to call it with the Linux system call API and not the stuff you find in example code or glibc headers.
    [1] To be perfectly honest I don't know if this is unique to Linux. It's a pretty obvious feature, and I bet various BSDs or OS X or whatnot have probably done it too. But no one cares because at the 1024+ FD level System V poll() is a better API, and event-based polling is better still. It's just Unix history at this point and no one's going to fix it for you.
    thasso4 months ago
    Your example on read(2) is a good one. There's no way to fix it purely by changing the API because, by nature, the user chooses the size of the buffer.
    The difference is that fd_set is a structure that's not defined by the user. If fd_set had a standard size, the kernel could verify that nfds is within the allowed range for the fd_set structure. The select(2) system call would be harder to misuse then, although misuse would still be possible by passing custom buffers instead of pointers to fd_set structures. In that sense, I think we agree on the "problem".
    It's indeed just a bit of Unix history, but I was surprised by it nonetheless.
    loeg4 months ago
    I think ajross would argue that if anything, it is glibc's responsibility to check nfds <-> sizeof(fd_set), rather than the kernel.
- toast04 months ago
  > Maybe 1024 FDs was so much at the time when this was designed that nobody considered what would happen if this limit is reached? Or they were working on system where 1024 was the maximum number of FDs that a process can open?
  The article says select is from 1983. 1024 FDs is a lot for 1983. At least in current FreeBSD, it's easy to #define the setsize to be larger if you're writting an application that needs it larger. It's not so easy to manage if you're a library that might need to select larger FDs.
  Lots of socket syscalls include a size parameter, which would help with this kind of thing. But you still might buffer overflow with FD_SET in userspace.
mort964 months ago
I wish the UNIXes had gone together and standardized a modern alternative to poll, maybe as part of POSIX. It sucks that any time I want to listen to IO events, I have to choose between old, low performance, cross-platform APIs and the new, higher-performance but Linux-only epoll.
- ninjin4 months ago
  Which is why there is libevent [1]?
  [1]: https://libevent.org
  Unless I am mistaken, OpenBSD base even explicitly codes against the older libevent API internally and ships it with each release, despite at the very least supporting kqueue, and thus gains better portability for a number of their tools this way.
  Personally, I just go with Posix select for small programs where performance is not critical anyway.
  - eqvinox4 months ago
    There are a whole bunch of these — libevent, libev, glib's main loop, Qt's main loop, Apache's modular event loop, …
    …which is why there is libverto, a 2nd order abstraction.
    It'd be funny if it weren't also sad.
    loeg4 months ago
    libuv as well.
- usrnm4 months ago
  Aren't there enough wrapper libraries for all programming languages that take care of this under the hood? You don't have to rely on libc only
  - mort964 months ago
    Sure, there are wrapper libraries. But then I'm met with the question: do I add some big heavy handed IO wrapper library, or ... do I just call poll
    Galanwe4 months ago
    I wouldn't count uv/ev/etc as "big heavy IO wrapper library".
    mort964 months ago
    I would, especially when nothing else in the program uses it and you just introduce it for one small thing in place of calling poll(). It's over 40 000 loc, over 70 000 including tests.
    paulddraper4 months ago
    I certainly would
- ahartmetz4 months ago
  For sure. Though every platform does have it own high-performance alternative, with only kqueue shared by some less popular ones.
lukaslalinsky4 months ago
The trouble with I/O multiplexing in a language like C is that the callbacks and state machines get quite complex as you need more functionality. In C++ you can at least do closures, so it's easier to manage. I recently wanted to add networking to my Zig project and decided to do some yak shaving and implemented a fiber runtime with async I/O to avoid the callback complexity. https://github.com/lalinsky/zio
- spacechild14 months ago
  In C++20 you can use asio + coroutines. I find it pretty nice to work with.
- qudat4 months ago
  Wow nice! How does this compare to libxev?
  - jfadfwddas4 months ago
    I was curious as well and looks like this abstracts over libxev: https://github.com/lalinsky/zio/blob/main/build.zig#L7
    lukaslalinsky4 months ago
    Indeed, it's a translation of the callback-based libxev events to coroutines. I ended up temporarily forking libxev, to add support for vectored I/O and other small fixes, but all those changes will be upstreamed.
    jfadfwddas4 months ago
    Great stuff. I will be using this if/when I go back to zigging :)
- nesarkvechnep4 months ago
  As usual, no FreeBSD support.
sureglymop4 months ago
Good read but I wish it included io_uring as well.
- marginalia_nu4 months ago
  It's probably hard to include io_uring in something like this, without the article turning into an article mostly about io_uring. It's a cool API that can be incredibly fast, but it also comes with a very long list of caveats.
  - sureglymop4 months ago
    When I learned about epoll it was at first entirely from man pages, then by looking at source code of async runtimes like tokio and libuv. I only learned about io_uring a few years after that. So, just mentioning that it exists may be interesting for readers. Nothing in-depth.
drewg1234 months ago
> kqueue (on macOS)
Wish they'd give some credit to FreeBSD, where it originated..
lynx974 months ago
There is no mention of epoll in thsi other then the heading.
- lstodd4 months ago
  It's because epoll === kqueue mostly.
  Besides kqueue grew from FreeBSD, not OSX. Such ignorance saddens me much more.
tarruda4 months ago
I have implemented a simple asyncio compatible micro event loop library in python.
The goal was to understand the underlying mechanisms behind python's async/await and to help coworkers understand how event loops work under the hoods.
The end result is somewhat interesting, as unlike traditional event loop libraries, it doesn't use callbacks as the scheduling primitive: https://gist.github.com/tarruda/5b8c19779c8ff4e8100f0b37eb59...
kcexn4 months ago
poll is the POSIX specified I/O multiplexer so it has the advantage of being portable. Windows even supports a version of poll called WSAPoll.
If you must implement your own event loop and you want your application to be portable, poll is still a good place to begin.
O(N) demultiplexing time in the pollfd array is also not as brutal as it seems on modern hardware. The pollfd structure itself is only 8 bytes wide, so you can comfortably pack thousands of them into the L1 cache. Copying all of the elements that have an active event into a new smaller array before processing them is going to be fast enough for most cases.

quibono4 months ago

I'm assuming epoll is covered implicitly by the section on kqueue. Are there any differences between the two besides the name?

toast04 months ago

epoll returns a single value for events, and kqueue returns a struct.

   typedef union epoll_data {
       void    *ptr;
       int      fd;
       uint32_t u32;
       uint64_t u64;
   } epoll_data_t;

   struct kevent {
       uintptr_t  ident;       /* identifier for this event */
       short  filter;       /* filter for event */
       u_short  flags;        /* action flags for kqueue */
       u_int  fflags;       /* filter flag value */
       int64_t  data;        /* filter data value */
       void  *udata;       /* opaque user data identifier */
       uint64_t  ext[4];       /* extensions */
   };

For read/write events, ident is the FD and data is the number of bytes that can be read or written.

Luker884 months ago
I have vague memories of OSX kqueue not supporting all the usecases that FreeBSD kqueue does from many years ago.
Have they reached feature parity?
- nesarkvechnep4 months ago
  I doubt it because applications, using kqueue, written for OSX can’t easily be ported to FreeBSD. ghostty is one such app.
  - loeg4 months ago
    Ghostty uses Mach ports on OS X in addition to kqueue. Source is here:
    https://github.com/mitchellh/libxev/blob/main/src/backend/kq...
khaledh4 months ago
Needs "(2020)" in the title.
commandersaki4 months ago
Nice article, though a few spelling mistakes that I thought was to distinguish it from AI slop, only to realise this was written a few years before the AI/GPT craze.