For loops are the "goto":s of the parallel programming era.
Ditch them and the rest can be handled by the programming language abstraction.
Why? Because they 1. Enforce order of execution and 2. Allow breaking computation after a certain number of iterations.
Not sure how “break” would be interpreted in this context. Maybe it should make the program crash, or it could be equivalent to “continue” (in the programming model, all of the iterations would be happening in parallel anyway).
I vaguely feel like “for” would actually have been the best English word for this construct, if we stripped out the existing programming context. I mean, if somebody post gives you instructions like:
For each postcard, sign your name and put it in an envelope
You don’t expect there to be any non-trivial dependencies between iterations, right? Although, we don’t often give each other complex programs in English, so maybe the opportunity for non-trivial dependencies just doesn’t really arise anyway…
In math, usually when you encounter “for,” it is being applied to a whole set of things without any loop dependency implied (for all x in X, x has some property). But maybe that’s just an artifact of there being less of a procedural bias in math…
For example, if you need to respond to a request in 100ms and it depends on 100 service calls, you can make 100 calls with a 80ms timeout; get 90 quick responses, including two transient errors, and immediately retry the errors; get eight more successful responses and two timeouts; and then send the response within the SLA using the 98 responses you received.
Even if you try to do it with heuristics, go ask Itanium how that worked out for them and they tried a much simpler problem than what you’re proposing.
But it seems to me like this would be a safe space to experiment. With heuristics and pragmas as a fallback. Because with the right approach solutions would mostly be better than not doing anything.
And you could do it in runtime when you know the size of the input.
And what about applying the logic to places where you can see that the loop will end?
I believe query planners in for example Trino/BigQuery do this already?
https://en.wikipedia.org/wiki/Itanium#Market_reception
> A compiler must attempt to find valid combinations of instructions that can be executed at the same time, effectively performing the instruction scheduling that conventional superscalar processors must do in hardware at runtime.
> In an interview, Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."[222]
There were lots of people but the gist of it is that scheduling auto-optimization is not really possible to do statically at compile time and too much overhead to attempt to do generically at run time. In this case it was instruction parallelism, but the same fundamental issues apply to this kind of auto loop idea.
> I believe query planners in for example Trino/BigQuery do this already?
That is completely different from trying it on every for loop in a program. They do it at the broader task level to determine how much parallelism will help.
I am not convinced that this is not possible, just because a project from the early 2000s failed, and Knuth said it was a bad idea.
I am not talking about a general optimizer of poorly written arbitrary programs. Rather an optimizer for parts of the code written with a certain paradigm. Just like BigQuery does with SQL.
(Thank you for sticking with the thread)
I'm not saying it'll never happen, but to apply it to every for loop is very different from a framework that speeds up a kind of pattern (which as I said is already a thing). The vast majority of loops don't benefit from this which is why you don't see #openmp pragmas on top of every for loop. The gains are likely minimal from doing it for every loop, you can introduce serious performance regressions when you guess wrong because you don't know about the data pattern within the loop (e.g. you introduce cache contention because the multiple threads end up writing to the same cache line), & most loops don't benefit from parallelization in the first place meaning you're doing a lot of work for no benefit. Maybe bad timing but my hunch is that I'm unlikely to see something like this in my lifetime.
It may be a good idea to use a framework with explicitly stateless "tasks" and an orchestrator (parallel, distributed, or both). This is what Spark, Tensorflow, Beam and others do. Those will have a "parallel for" as well, but now in addition to threads you can use remote computers as well with a configuration change.
Uhhh... we don't? It seems to me like we do. This is a solved problem. Depending on what you're trying to do, there's map, reduce, comprehensions, etc.
For example, choosing Scala on the JVM because that's what I know best, the language provides a rich set of maps, folds, etc., and the major libraries for different approaches to concurrency (futures, actors, effect systems) all provide ways to transform a collection of computations into a collection of concurrent operations.
Curious if the poster who said "we don't have a really widely supported construct" works in a language that lacks a rich concurrency ecosystem or if they want support baked into their language.
"Break" is a dependency between iterations, and really only makes sense in a sequential iteration. In a parallel for loop, you can break from the current iteration, but the next is probably already running.
If you want any iteration to be able to cancel all others, they have to be linked somehow. Giving every task a shared cancellation token might be simplest. Or you turn your for loop into a sort of task pool that intelligently herds threads in the background and can consume and relay cancellation requests.
But I agree, we need a new paradigm for parallel programming. For loops just don't cut it, despite being one of the most natrual-feeling programming concepts.
C#'s Parallel.For and ForEach are a step in the right direction, but very unergonomic and unintuitive. I think we could get by with just bolting parallelism onto for loops, but we need a fundamentally parallel concept. I assume it'd look something like cuda programming but I really don't know.
https://gfxcourses.stanford.edu/cs149/fall24/lecture/datapar...