Show HN: C discrete event SIM w stackful coroutines runs 45x faster than SimPy(github.com)

41 pointsby ambonvik6 hours ago6 comments

jerf3 hours ago
While that speed increase is real, of course, you're really just looking at the general speed delta between Python and C there. To be honest I'm a bit surprised you didn't get another factor of 2 or 3.
"Cimba even processed more simulated events per second on a single CPU core than SimPy could do on all 64 cores"
One of the reasons I don't care in the slightest about Python "fixing" the GIL. When your language is already running at a speed where a compiled language can be quite reasonably expected to outdo your performance on 32 or 64 cores on a single core, who really cares if removing the GIL lets me get twice the speed of an unthreaded program in Python by running on 8 cores? If speed was important you shouldn't have been using pure Python.
(And let me underline that pure in "pure Python". There are many ways to be in the Python ecosystem but not be running Python. Those all have their own complicated cost/benefit tradeoffs on speed ranging all over the board. I'm talking about pure Python here.)
- ambonvik2 hours ago
  Good point. The profiler tells me that the context switch between coroutines is the most time-consuming part, even if I tried to keep it as light as possible, so I guess the explanation for "only" getting 45x speed improvement rather than 100x is that it is spending a significant part of the time moving register content to and from memory.
  Any ideas for how to speed up the context switches would be welcome, of course.
anematode3 hours ago
Looks really cool and I'm going to take a closer look tonight!
How do you do the context switching between coroutines? getcontext/setcontext, or something more architecture specific? I'm currently working on some stackful coroutine stuff and the swapcontext calls actually take a fair amount of time, so I'm planning on writing a custom one that doesn't preserve unused bits (signal mask and FPU state). So I'm curious about your findings there
- ambonvik3 hours ago
  Hi, it is hand-coded assembly. Pushing all necessary registers to the stack (including GS on Windows), swapping the stack pointer to/from memory, popping the registers, and off we go on the other stack. I save FPU flags, but not more FPU state than necessary (which again is a whole lot more on Windows than on Linux).
  Others have done this elsewhere, of course. There are links/references to several other examples in the code. I mention two in particular in the NOTICE file, not because I copied their code, but because I read it very closely and followed the outline of their examples. It would probably taken me forever to figure out the Windows TIB on my own.
  What I think is pretty cool (biased as I am) in my implementation is the «trampoline» that launches the coroutine function and waits silently in case it returns. If it does, it is intercepted and the proper coroutine exit() function gets called.
  - anematode2 hours ago
    Interesting. How does the trampoline work?
    I'm wondering whether we could further decrease the overhead of the switch on GCC/clang by marking the push function with `__attribute__((preserve_none))`. Then among GPRs we only need to save the base and stack pointers, and the callers will only save what they need to
    ambonvik2 hours ago
    It is an assembly function that does not get called from anywhere. I pre-load the stack image with its intended register content from C, including the trampoline function address as the "return address". On the first transfer to the newly created coroutine, that gets loaded, which in turn calls the coroutine function that suddenly is in one of its registers along with its arguments. If the coroutine function ever returns, that just continues the trampoline function, which proceeds to call the coroutine_exit() function, whose address also just happens to be stored in another handy register.
    https://github.com/ambonvik/cimba/blob/main/src/port/x86-64/...
    https://github.com/ambonvik/cimba/blob/main/src/port/x86-64/...
    anematode2 hours ago
    Ahhh ok. Cool!
    Do sanitizers (ASan/UBSan/valgrind) still work in this setting? Also I'm wondering if you'll need some special handling if Intel CET is enabled
    ambonvik2 hours ago
    They probably do, but I have not used them. My approach has been "offensive programming", to put in asserts for preconditions, invariants, and postconditions wherever possible. If anything starts to smell, I'd like to stop it cold and fix it rather than trying to figure it out later. With two levels of concurrency in a shared memory space it could get hairy fast if bugs were allowed to propagate to elsewhere before crashing something.
    Not familiar with the details of Intel CET, but this is basically just implementing what others call fibers or "green threads", so any such special handling should certainly be possible if necessary.
    anematode2 hours ago
    Cool. I have faith in thorough testing for the coroutines' correctness, but sanitizers would be convenient for people debugging their own code that leverages this library. I know that ASan doesn't support getcontext et al., but maybe this is different
sovande2 hours ago
Didn’t read the code yet, but stuff like this tend to be brittle. Do you do something clever around stack overflow, function return overwrite or would that just mess up all coroutines using the same stack?
- ambonvik2 hours ago
  Each coroutine is running on its own stack. They are fixed size stacks, at least for now, so that could be a tender point, but I place some sentinel values at the end to try to capture it in an assert() instead of just letting it crash. I did not think it worth the effort and speed penalty to implement growing stacks yet. However, I do catch any coroutine function returns safely instead of letting them fall off the end of their stack.
quibono4 hours ago
I don't know enough about event simulation to talk API design in depth but I find the stackful coroutine approach super interesting so I'll be taking a look at the code later!
Do you plan on accepting contributions or do you see the repo as being a read-only source?
- ambonvik4 hours ago
  I would be happy accepting contributions, especially for porting to additional architectures. I think the dependency is relatively well encapsulated (see src/port), but code for additional architectures needs to be well tested on the actual platform, and there are limits to how much hardware fits on my desk.
measurablefuncan hour ago
How does this compare to Mojo?
Edit: nevermind. I answered the question for myself w/ vibe coding: https://pastes.io/mojo-event.
Workers: 1 | 2 | 4 | 8
Time: 12.01s | 8.01s | 5.67s | 5.49s
Events/sec: 16.7M | 25.0M | 35.3M | 36.4M
Obviously just a first pass & not optimized b/c I'm not a mojo expert.
- ambonvik15 minutes ago
  I am not familiar with Mojo, so I do not know.
  Compared to the coroutine implementations I do know, none of them quite met what I as looking for. The «trampoline» has been mentioned. I also needed a calling convention that fit the higher-level process model with a self pointer and a context, and a signal return value from each context switch. It also has to be thread safe to survive the pthreads. Not very difficult to do, but needs to be designed in from the beginning.
  Same thing with random number generators. It will not work very well to keep state between calls in a static local variable or some global construction, needs to be kept thread local somewhere. Not difficult, but needs to be there from the start both for the basic generator and for the distributions on top of it.
  Quite a bit more here: https://cimba.readthedocs.io/en/latest/background.html#
qotgalaxy4 hours ago
[dead]