"1001 Ways to Write CUDA Kernels in Python"
https://www.youtube.com/watch?v=_XW6Yu6VBQE
"The CUDA Python Developer’s Toolbox"
https://www.nvidia.com/en-us/on-demand/session/gtc25-S72448/
"Accelerated Python: The Community and Ecosystem"
https://www.youtube.com/watch?v=6IcvKPfNXUw
"Tensor Core Programming in Python with CUTLASS 4.0"
https://www.linkedin.com/posts/nvidia-ai_python-cutlass-acti...
There is also Julia, as the black swan many outside Python community have moved into, with much more mature tooling, and first tier Windows support, for those researchers that for whatever reason have Windows issued work laptops.
https://info.juliahub.com/industries/case-studies
Mojo as programming language seems interesting as language nerd, but I think the judge is still out there if this is going to be another Swift, or Swift for Tensorflow, in regards to market adoption, given the existing contenders.
So going after people who need to build low latency high-throughput inference systems.
Also as someone else pointed out, they also target all kinds of hardware, not just NVidia.
The primary advantage of mojo seems to be Gil-free syntax that is as close to Python as possible.
In Mojo it's pretty much the whole point of the language. If you're only using CPUs, then yeah, PyO3 is a good choice.
In the same way that LLVM allows CPU code to target more than one CPU architecture, MLIR/Mojo allows GPU code to target multiple vendor's GPUs.
There is some effort required to write the backend for a new GPU architecture, and Lattner has discussed it taking about two months for them to bring up H100 support.
They also miss CPUs on Windows, unless using WSL.
This may appeal to people wanting to run their code on different hardware brands fro various reasons.
And for those that care, Julia is available today on different hardware brands, as there are other Python DSL JITs as well.
I agree they will get there, now the question is will they get there fast enough to matter, versus what the mainstream market cares about.
However, writing an LLVM backend for RISC-V sure did add support for a whole lot of different programming languages and the software you have access to through them in one fell swoop.
The same is true here.
Instead of rewiting all your GPU code every time you need to target a new GPU/TPU architecture, you just need a new backend.
Just like I would consider MIT and a few of the companies on that listing as relevant, doesn't need to always be a FAANG.
He also discussed open sourcing Mojo and where the company expects to make its money.
FAQ:
> Why not make Julia better? > We think Julia is a great language and it has a wonderful community, but Mojo is completely different. While Julia and Mojo might share some goals and look similar as an easy-to-use and high-performance alternative to Python, we’re taking a completely different approach to building Mojo. Notably, Mojo is Python-first and doesn't require existing Python developers to learn a new syntax.
https://docs.modular.com/mojo/faq/#why-not-make-julia-better
Now :
>We oversold Mojo as a Python superset too early and realized that we should focus on what Mojo can do for people TODAY, not what it will grow into. As such, we currently explain Mojo as a language that's great for making stuff go fast on CPUs and GPUs.
> Julia: Julia is another great language with an open and active community. They are currently investing in machine learning techniques, and even have good interoperability with Python APIs.
Maybe I just misunderstand it from the presentation format.
Yes many of the mainstream languages started as single company product, but lets put it this way, would anyone be writing one of such languages today, had those not been languages gatekeeped to access a specific platform?
So outside accessing Max and its value preposition as product enabler for XYZ, who would be rushing to write Mojo code, instead of something else.
We have quite fantastic GPU compilation stuff too, and julia functions can be compiled to Nvidia, AMD, Intel, and Apple GPUs through their respective GPU compiler packages, and one can use KernelAbstractions.jl to write code that is GPU vendor agnostic and works on all of them.
We're also getting an (experimental) fully ahead-of-time compiler built into the language with v1.12 that spits out an executable or dylib.
Dylan was going to be Newton's system programming language, and while the language group lost the the C++ team (Apple had two competing teams for the Newton OS), it was still NewtonScript for everything userspace, and it was getting a JIT by the time the project was canceled.
Objective-C is dynamically typed beyond the common subset with C, and was used even to write NeXTSTEP drivers.
I don't know how much of a chance Julia has against CUDA/ROCm/C++, especially now that everyone on the GPU space has decided to give feature parity to Python on their hardware, via day one bindings to the compute libraries and JIT DSLs, so that makes Mojo even less of a chance than Julia has.
Julia has an established ecosystem, and presence on the scientific community with ties to MIT.
Python is the champion, and most folks writing CUDA/ROCm/C++ are already using it.
So who would be reaching out to Mojo, instead of Python JIT DSLs/bindings or Julia, when having Fortran, C, C++ allergy?
We for example built software that generates kernels on-demand that embed user functions for all 4 of these systems and showed it's much faster than just CUDA bindings for array functions for certain nonlinear systems (https://www.sciencedirect.com/science/article/abs/pii/S00457...)
Here is a link to the episode listing for that podcast, which might help.
I'm logged into youtube, if you're not something with that perhaps?
I liked mojo as a python superset. Wanted to be able to run arbitrary python through it and selectively change parts to use the new stuff.
A "pythonic language" sounds like that goal has been dropped, at which point the value prop is much less clear to me.
It was highly aspirational goal, and practically speaking it's better right now to take inspiration from Python and have stronger integration hooks into the language (full disclosure, I work at Modular). We've specifically stopped using the "superset of Python" language to be more accurate about what the language is meant for right now.
Yep, that's right. Int behaving like a machine integer is very important for systems performance. Leaving the "int" namespace untouched allows us to have a object-based bigint in the future for compatibility with python. As others have mentioned above, it is still a goal to be compatible with python in time, just not a short term priority
Python is a hacky language that was never designed other than laying it's eggs against the grain of what we seem to mostly agree is godo — e.g. functional programming and composition.
Big tech spends a lot of money to avoid python in critical infra.
I'm just nervous how much VC funding they've raised and what kind of impacts that could have on their business model as they mature.
https://docs.modular.com/mojo/manual/get-started/
curl -fsSL https://pixi.sh/install.sh | sh
pixi init life \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd life
No, thanks. C++ is easier.We've run a few events this year (disclosing again that I work for Modular, to put my comments into context), and we've had some great feedback from people who have never done GPU programming about how easy it was to get started using Mojo.
C++ is a powerful and mature language, but it also has a steep learning curve. There's a lot of space for making GPU (and other high performance computing) easier, and platforms/languages like Triton and Julia are also exploring the space alongside Mojo. There's a huge opportunity to make GPU programming easier, and a bit part of that opportunity is in abstracting away as much of the device-specific coding as you can.
I was a HPC C++ programmer for a long time, and I always found recompiling for new devices to be one of the most painful things about working in C++ (for example, the often necessary nightmare of cmake). Mojo offers a lot of affordances that improve on the programming experience at the language, compiler, and runtime levels.
It's unclear to me from this comment, what exactly your pushback is.
The space that I am interested in is execution time compiled programs. A usecase of this is to generate a perfect hash data structure. Say you have a config file that lists out the keywords that you want to find, and then dynamically generate the perfect hash data structure compiled as if those keywords are compile time values (because they are).
Or, if the number of keywords is too small, fallback to a linear search method. All done in compile time without the cost of dynamic dispatch.
Of course, I am talking about numba. But I think it is cursed by the fact that the host language is Python. Imagine if Python is stronger typed, it would open up a whole new scale of optimization.
Sadly the contenders on the corner get largely ignored, so we need to contend with special cased JIT DSLs, or writing native extensions, as in many cases CPython is only implementation that is available.
Jitting like you mentioned is supported by the MAX graph API: https://docs.modular.com/max/tutorials/build-custom-ops. It could have a nicer syntax though to be more like Numba, I think you have an interesting idea there.
> The space that I am interested in is execution time compiled programs. A usecase of this is to generate a perfect hash data structure. Say you have a config file that lists out the keywords that you want to find, and then dynamically generate the perfect hash data structure compiled as if those keywords are compile time values (because they are).
I'm not sure I understand you correctly, but these two seem connected. If I were to do what you want to do here in Python I'd create a zig build-lib and use it with ctypes.
``` python program.py --config <change this> ```
It is basically a recompilation of the whole program at every execution taking into account the config/machine combination.
So if the config contains no keyword for lookup, then the program should be able to be compiled into a noop. Or if the config contains keyword that permits a simple perfect hash algorithm, then it should recompile itself to use that mechanism.
I dont think any of the typical systems programming allows this.
Is there a zero-copy interface for larger objects? How do object lifetimes work in that case? Especially if this is to be used for ML, you need to haul over huge matrices. And the GIL stuff is also a thing.
I wonder how Mojo handles all that.
That said, importing into Python this easily is a pretty big deal. I can see a lot of teams who just want to get unblocked by some performance thing, finding this insanely helpful!
This is not really true. Even though Mojo is adopting Python's syntax, it is a drastically different language under the hood. Mojo is innovating in many directions (eg: mlir integration, ownership model, comptime, etc). The creators didn't feel the need to innovate on syntax in addition to all that.
I didn't mean to undermine the ambitious goals the project has. I still wish it was a little bolder on syntax though, Python is a large and complex language as is, so a superset of Python is inherently going to be a very complicated language.
So, the message is that it is possible to create nice Python bindings from Mojo code, but only if your Mojo code makes the effort to create an interface that uses PythonObject.
Useful, but I don’t see how that’s different from C code coding the same, as bindings go.
Both make it easier to gradually move Python code over to a compiled language.
Mojo presumably will have the advantage that porting from Python to Mojo is much closer to a copy paste job than porting Python to C is.
In all fairness, their website now reads: "Mojo is a pythonic language for blazing-fast CPU+GPU execution without CUDA. Optionally use it with MAX for insanely fast AI inference."
So I suppose now is just a compiled language with superficially similar syntax and completely different semantics to Python?
That said, the upside is huge. If they can get to a point where Python programmers that need to add speed learn Mojo, because it feels more familiar and interops more easily, rather than C/CPP that would be huge. And it's a much lower bar than superset of python.
I'd argue that I am not sure what kind of Python programmer is capable of learning things like comptime, borrow checking, generics but would struggle with different looking syntax. So to me this seemed like a deliberate misrepresentation of the actual challenges to generate hype and marketing.
Which fair enough, I suppose this is how things work. But it should be _fair_ to point out the obvious too.
To first order, today every programmer starts out as a Python programmer. Python is _the_ teaching language now. The jump from Python to C/Cpp is pretty drastic, I don't think that it's absurd that learning Mojo concepts step by step coming from Python is simpler than learning C. Not syntactically but conceptually.
While I agree using Mojo is much preferable to writing C or C++ native extensions, back on my day people learned to program in K&R C or C++ ARM in high school, kids around 12 years old, hardly something pretty drastic.
Get hold of Retro Gamer magazine for some of their stories.
I'm learning Rust and Zig in the hope that I'll never have to write a line of C in my career.
Just read K&R “The C programming language” book. It’s fairly small and it’s a very good introduction to C.
Towards deployment is even harder. You can very easily end up writing exploitable, unsafe code in C.
If I were a Python programmer with little knowledge about how a computer works, I’d much prefer Go or Rust (in that order) to C.
But the term could also be used more generally to include stuff like pointer provenance, Rust's "stacked borrows" etc. In that case, Rust is more complicated than C-as-specified. But C-in-reality is much more complicated, e.g. see https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm
C is simpler than Rust, but C is also _much_ simpler than Python. If I solve a problem in Python I have a good standard library of data types, and I use concepts like classes, iterators, generators, closures, etc... constantly. So if I move to Rust, I have access to the similar high-level tools, I just have to learn a few additional concepts for ressource management.
In comaprison, C looks a lot more alien from that perspective. Even starting with including library code from elsewhere.
And as tip for pointers, regardless of the programming language, pen and paper, drawing boxes and arrows, are great learning tools.
> I'd argue that I am not sure what kind of Python programmer is capable of learning things like comptime, borrow checking
One who previously wrote compiled languages ;-). It's not like you forget everything you know once you touch Python.
"... but would struggle with different looking syntax"
Although Python has some seriously PERLesque YOLO moments, like "#"*3 == "###". This is admittedly useful, but funny nonetheless.
More broadly this is the same argument as whether overloading `+` for strings is a bad idea or not, and the associated points, e.g. the fact that this makes it non-commutative - the same all applies to `*` as well, and to lists as much as strings. At least Python is consistent here.
Although there is one particular aspect that is IMO just bad design: the way `x += y` and `x = y` work. To remind, for lists these are not equivalent to `x = x + y` and `x = x y` - instead of creating a new list, they mutate the existing one in place, so all the references observe the change. This is very surprising and inconsistent with the same operators for numbers, or indeed for strings and tuples.
We cannot deny that Python has some interesting solutions, such as the std lib namedtuple implementation. It's basically a code template & exec().
I don't think these are necessarily bad, but they're definitely funny.
What else was proclaimed just to raise VC money?
While much has changed since then, the architecture is effectively the same. Julia's native CUDA support simply boils down to compiling via the LLVM .ptx backend (Julia always generates LLVM IR, and the CUDA infrastructure "simply" retargets LLVM to .ptx, generates the binary, and then wraps that binary into a function which Julia calls), so it's really just a matter of the performance difference between the code generated by the LLVM .ptx backend vs the NVCC compiler.
I feel like that depends quite a lot on what exactly is in the non-subset part of the language. Being able to use a library from the superset in the subset requires being able to translate the features into something that can run in the subset, so if the superset is doing a lot of interesting things at runtime, that isn't necessarily going to be trivial.
(I have no idea exactly what features Mojo provides beyond what's already in Python, so maybe it's not much of an achievement in this case, but my point is that this has less to do with just being a superset but about what exactly the extra stuff is, so I'm not sure I buy the argument that the marketing you mention of enough to conclude that this isn't much of an achievement.)
And so his improvements in mojo and now calling mojo code from python just make a lot more net positive to the community than being, some other Ai infrastructure company.
So I do wish a lot of good luck to mojo. I have heard that mojo isn't open source but it has plans to do so. I'd like to try it once if its as fast / even a little slower than rust and comparable to understanding as python.
On the one had, as the other poster noted, no one raises $100M+ for a programming language; programming languages have no ROI that would justify that kind of money. So to get it, they had to tell VCs a story about how they're going to revolutionize AI. It can't just be "python superset with MLIR". That's not a $100M story.
On the other hand, they need to appeal to the dev community. For devs, they want open source, they want integration with their tools, they don't want to be locked into a IP-encumbered ecosystem that tries to lock them in.
That's where the tension is. To raise money you need to pretend you're the next Oracle, but to get dev buy-in you have to promise you're not the next Oracle.
So the line they've decided to walk is "We will be closed for now while figure out the tech. Then later once we have money coming in to make the VCs happy, we can try to make good on our promise to be open."
That last part is the thing people are having trouble believing. Because the story always goes: "While we had the best intentions to be open and free, that ultimately came secondary to our investors' goal of making money. Because our continued existence depends on more money, we have decided to abandon our goal of being open and free."
And that's what makes these VC-funded language plays so fraught for devs. Spend the time to learn this thing which may never even live up to its promises? Most people won't, and I think the Darklang group found that out pretty decisively.
> Further, we decided that the right long-term goal for Mojo is to adopt the syntax of Python (that is, to make Mojo compatible with existing Python programs) and to embrace the CPython implementation for long-tail ecosystem support
Which I don't think has changed.
But asking people to learn another language and migrate is a tall barrier.
Why not write python and transpile to mojo?
As part of this it has a stronger type/lifecycle/memory model than python.
Maybe you could write, some level of transpiler, but so much of the optimizations rely on things that python does not expose (types), and there are things that python can do that are not supported.
’’’ 3628800 Time taken: 3.0279159545898438e-05 seconds for mojo
3628800 Time taken: 5.0067901611328125e-06 seconds for python ’’’
[1] https://github.com/python/cpython/blob/0d9d48959e050b66cb37a...
[2] https://github.com/python/cpython/blob/0d9d48959e050b66cb37a...
If they can make calling Mojo from Python smooth it would be a great replacement for Cython. You also then get easy access to your GPU etc.
I’m always disappointed when I hear anything about mojo. I can’t even fully use it, to say nothing of examine it.
We all need money, and like to have our incubators, but the LLVM guy thinks like Jonathan Blow with jai?
I don’t see the benefit of joining an exclusive club to learn exclusively-useful things. That sounds more like a religion or MLM than anything RMS ever said :p
I would not compare Chris Lattner with Jonathan Blow. Lattner is a person with a reputation for delivering programming languages and infrastructure; whereas for Blow, it seems like an ego issue. He's built a nice little cult of personality around his unreleased language, and releasing it will kill a lot of that magic (running a language project with actual users that has to make good on promises is much different than running a language project with followers and acolytes that can promise anything and never deliver on it).
Lattner has a record of actually delivering dev products people can download and use. Mojo is closed source to make raising money easier, but at least you can actually use it. Jai isn't even available for people to use, and after a decade of Blow dangling it in front of people, it's not clear it'll ever be available, because I'm not sure he wants it to be available.
Lol wut. For the life of me I cannot fathom what design decision in their cconv/ABI leads to this.
add_function(PythonObject)
add_function(PythonObject, PythonObject)
add_function(PythonObject, PythonObject, PythonObject)
There was a similar pattern in the Guava library years ago, where ImmutableList.of(…) would only support up to 20 arguments because there were 20 different instances of the method for each possible argument count.
> As of February 2025, the Mojo compiler is closed source
And that's where the story begins and ends for me.
> Will Mojo be open-sourced?
> We have committed to open-sourcing Mojo in 2026. Mojo is still young, so we will continue to incubate it within Modular until more of its internal architecture is fleshed out.
Soon it might be though.
Has that stopped everyone before? Java, C#/.NET, Swift and probably more started out as closed-source languages/platforms, yet seemed to have been deployed to production environments before their eventual open-sourcing.
I guess the summary is that neither Java [at the time] nor .NET were profit centers for their owners, nor their only reason for existing
IDEs, implementations for embedded and phones, were all paid products, IDEs by developers or their employers, the others by OEMs.
My point is the early days, JCafe, Visual Age, Visual Studio, Forte, before free beer IDEs for them became common.
Java side with Eclipse/Netbeans, .NET side with the Visual Studio Express editions.
What would be comparable driver for Mojo?