His example is:
sequence
.map(|x: T0| ...: T1)
.scan(|a: T1, b: T1| ...: T1)
.filter(|x: T1| ...: bool)
.flat_map(|x: T1| ...: sequence<T2>)
.collect()
It would be written in Futhark something like this: sequence
|> map (\x -> ...)
|> scan (\x y -> ...)
|> filter (\x -> ...)
|> map (\x -> ...)
|> flattenI haven't studied it in depth, but it's pretty readable.
But you're right, it would be interesting to see how the different approaches stack up to each other. The Pareas project linked above also includes an implementation using radix sort.
The example you showed is very much how I think about PRQL pipelines. Syntax is slightly different but semantics are very similar.
At first I thought that PRQL doesn't have scan but actually loop fulfills the same function. I'm going to look more into comparing those.
It is a joke, but an SQL engine can be massively parallel. You just don't know it, it just gives you what you want. And in many ways the operations resembles what you do for example in CUDA.
CUDA backend for DuckDB or Trino would be one of my go-to projects if i was laid off.
What could be good is relational + array model. I have some ideas on https://tablam.org, and building not just the language but the optimizer in tandem I think will be very nice.
• Datalog is much, much better on these axes.
• Tutorial D is also better than SQL.
It solves all the warts of sql while still being true to its declarative execution. Trailing commas, from statement first and reads as a a composable pipeline, temporary variables for expressions, intuitive grouping.
Sometimes I have a problem, I just generate bunch of "possible solutions" with a constraint solver (e.g. Minizinc) which generates GBs of CSVs describing bunch of solutions, then let DuckDB analyze which ones are suitable, DuckDB is amazing.
Term rewriting languages probably work better at this than I would expect? It is kind of sad how little experience with that sort of thing that I have built up. And I think I'm above a large percentage of developers out there.
Raph is a super nice guy and a pleasure to talk to. I'm glad we have people like him around!
Hardware architectures like Tera MTA were much more capable but almost no one could write effective code for them even though the language was vanilla C++ with a couple extra features. Then we learned how to write similar software architecture on standard CPUs. The same problem of people being bad at it remained.
The common thread in all of this is people. Humans as a group are terrible at reasoning about non-trivial parallelism. The tools almost don't matter. Reasoning effectively about parallelism involves manipulating a space that is quite evidently beyond most human cognitive abilities to reason about.
Parallelism was never about the language. Most people can't build the necessary mental model in any language.
To your point, we also didn't need a new language to adopt this paradigm. A library and a running system were enough (though, semantically, it did offer unique language-like characteristics).
Sure, it's a bit antiquated now that we have more sophisticated iterations for the subdomains it was most commonly used for, but it hit a kind of sweet spot between parallelism utility and complexity of knowledge or reasoning required of its users.
The syntax and semantics should constrain the kinds of programs that are easy to write in the language to ones that the compiler can figure out how to run in parallel correctly and efficiently.
That's how you end up with something like Erlang or Elixir.
throwing infiniband or IP on top is really structurally more of the same.
Chapel definitely can target a single GPU.
Overall, it seems to be a really interesting problem!
going the other direction, making channel runtimes run SIMD, is trivial
Disclaimer: I did not watch the video yet
Or basically a generic nestable `remote_parallel_map` for python functions over lists of objects.
I haven't had a chance to fully watch the video yet / I understand it focuses on lower levels of abstraction / GPU programming. But I'd love to know how this fit's into what the speaker is looking for / what it's missing (other than obviously it not being a way to program GPU's) (also full disclosure I am a co-founder).
P.S. I'm joking, I do love Go, even though it's by no means a perfect language to write parallel applications with
Parallelism is trivial and front-and-center.
And no it's not a niche language. Don't listen to the army of Python technicians.
Nothing yet? Damn...
However Erlang has very little to say about parallelization of loops, or in the levels between a single loop and a HTTP request.
Nor would it be a good base for such things; if you're worried about getting maximum parallel performance out of your CPUs you pretty much by necessity need to start from a base where single-threaded performance is already roughly optimal, such as with C, C++, or Rust. Go at the very outside, and that's already a bit of a stretch in my opinion. BEAM does not have that level of single-threaded performance. There's no point in making what BEAM does fully utilize 8 CPUs in this sort of parallel performance when all that does is get you back to where a single thread of Rust can run.
(I think this is an underappreciated aspect of trying to speed things up with multiple CPUs. There's no point straining to get 8 CPUs running in some sort of complicated perfect synchronization in your slow-ish language when you could just write the same thing in a compiled language and get it on one CPU. I particularly look at the people who think that GIL removal in Python is a big deal for performance and wonder what they're thinking... a 32-core machine parallelizing Python code perfectly, with no overhead, might still be outperformed by a single-core Go process and would almost certainly be beated by a single-core Rust process. And perfect parallelization across 32 cores is a pipe dream. Unless you've already maxed out single-core performance, you don't need complicated parallelization, you need to write in a faster language to start with.)
The thing i would really like to see is some research on how to run the Erlang concurrency model on a GPU.
Some of the operations Erlang does, GPUs don't even want to do at all, including basic things like pattern matching. GPUs do not want that sort of code at all.
"Erlang" is being over specific here. No conventional CPU language makes sense on a GPU at all.
Erlang is a concurrency-oriented language though its concurrency architecture (multicore/node/cluster/etc.) is different from that modeled by GPUs (Vectorized/SIMD/SIMT/etc.) Since share-nothing Processes (so-called Actor model) are at the heart of the Erlang Run Time System(ERTS)/BEAM it is easy to imagine a "group of Erlang processes" being mapped directly to a "group of threads in a warp on a GPU". Of course the Erlang scheduler being different (it is reduction based and not time sliced) one would need to rethink some fundamental design decisions but that should not be too out-of-the-way since the system as a whole is built for concurrency support. The other problem would be memory transfers between CPU and GPU (while still preserving immutability) but this is a more general one.
You can call out to CUDA/OpenCL/etc. from Erlang through its C interface (Kevin Smith did a presentation years ago) but i have seen no new research since then. However, there has been some new things in Elixir land notably "Nx" (Numerical Elixir) and "GPotion" (a DSL for GPU programming in Elixir).
But note that none of the above is aimed at modifying the Erlang language/runtime concurrency model itself to map to GPU models which is what i would very much like to see.