Reading further down the page it says you have to compile the python code using CPython, then generate binary code for its custom ISA. That's neat, but it doesn't "execute python directly" - it runs compiled binaries just like any other CPU. You'd use the same process to compile for x86, for example. It certainly doesn't "take regular python code and run it in silicon" as claimed.
A more realistic claim would be "A processor with a custom architecture designed to support python".
However it doesn't seem like it does, the pyc still had to be further processed into machine code. So I also agree with the parent comment that this seems a bit misleading.
I could be convinced that that native code is sufficiently close to pyc that I don't feel misled. Would it be possible to write a boot loader which converts pyc to machine code at boot? If not, why not?
Anyway, the project is mega-cool, and very useful (in some specific applications). Is just that the title is a little bit confusing.
"runs directly on embedded hardware"
https://www.raspberrypi.com/documentation/microcontrollers/m...
I don't understand why they have the need to do this...
4o also ends many of its messages that way. It has to be related.
I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.
I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).
There are definitely some limitations beyond just memory or OS interaction. Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware. I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.
Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.
As for the design process — I’d love to share more! I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.
Taking the concrete example of the `struct` module as a use-case, I'm curious if you have a plan for it and similar modules. The tricky part of course is that it is implemented in C.
Would you have to rewrite those stdlib modules in pure python?
CPython's struct module is just a shim importing the C implementations: https://github.com/python/cpython/blob/main/Lib/struct.py
Pypy's is a Python(-ish) implementation, leveraging primitives from its own rlib and pypy.interpreter spaces: https://github.com/pypy/pypy/blob/main/pypy/module/struct/in...
The Python stdlib has enormous surface area, and of course it's also a moving target.
It'll also be interesting to see how OP deals with things like dictionaries and lists.
In today's world this is generally not great.
Interpreted languages often include bytecode instructions that actually do very complex things and so do not nicely map to operations that can be sanely implemented in hardware. So you end up with all the usual boring alu, branch etc operations implemented in hardware, and anything else traps and runs a software handler.
Separately, interpreted language bytecode is often a poor fit for hardware execution; e.g. for dotnet (and python) bytecode many otherwise trivial operations do not explicitly encode information about types, and therefore the hardware must track type information in order to do the right thing (floating point addition looks very very different from integer addition!)
A lot of effort has been spent on compiler optimisation for x86 and ARM code. JIT compilers benefit massively from this. Meanwhile, interpreted language bytecode is often very lightly optimised, where it is optimised at all (until relatively recently, explicit Python policy as set by Guido van Rossum was to never optimise!) Optimisation has the side effect of throwing away potentially valuable high level / semantic information; optimising at the bytecode level hinders debuggability for interpreted code (which is a primary goal in Python) and can also be detrimental to JIT output; and the results are underwhelming compared to JIT since your small team of plucky bytecode optimisers isn't really going to compete with decades of x86 compiler development; and so the incentive is to not do much of that.
So if you're running bytecode in hardware, on top of all the obvious costs, you are /running unoptimised code/. This is actually the thing that kills these projects - everything else can ultimately be solved by throwing more silicon at it, but this can only really be solved by JITting, and the existing JIT+x86 / JIT+ARM solution is cheap and battle tested.
https://en.m.wikipedia.org/wiki/Java_Card
To the point where most adult humans in the world probably own a Java-supported processor on a SIM card. Or at least an emulator (for eSIMs).
On example of a CPU arch used on JavaCard devices is the ARM926EJ-S that I believe can execute Java byte code.
In general I think the practical result is that x86 is like democracy. It’s not always efficient but there are other factors that make it the best choice.
The vast majority of computers in the world are not x86.
"Running a very small subset of python on an FPGA is possible with pyCPU. The Python Hardware Processsor (pyCPU) is a implementation of a Hardware CPU in Myhdl. The CPU can directly execute something very similar to python bytecode (but only a very restricted instruction set). The Programcode for the CPU can therefore be written directly in python (very restricted parts of python) ..."
I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.
The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...
It almost sounds like you're asking for Nim ( https://nim-lang.org/ ); and there are some projects using it for microcontroller programming, since it compiles down to C (for ESP32, last I saw).
What's the fundamental physical limits here? Namely, timing precision, latency and jitter? How fast could PyXL bytecode react to an input?
For info, there is ARTIQ: vaguely similar thing that effectively executes Python code with 'embedded level' performance:
https://m-labs.hk/experiment-control/artiq/
ARTIQ is quite common in quantum physics labs. For that you need very precise and determining timing. Imagine you're interfering two photons as they reach a piece of glass, so that they can interact. It doesn't get faster than photons! That typically means nanosecond timing, sub-microsecond latency.
How ARTIQ does it is also interesting. The Python code is separate from the FPGA which actually executes the logic you want to do. In a hand-wavy way, you're then 'as fast' as the FPGA. How, though? The catch is, you have to get the Python code and FPGA gateware talking to each other, and that's technically difficult and has many gotchas. In comparison, although PyXL isn't as performant, if it makes it simpler for the user, that's a huge win for everyone.
Congrats once again!
- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.
- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).
- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).
A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.
I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.
So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.
That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.
PyPy and GraalPy is where the fun is, however they are largely ignored outside their language research communities.
We are very far from having a full single user graphics workstation in CPython, even if those JITs aren't perfect.
Yes, there are a couple of ongoing attempts, while most in the community rather write C extensions.
I used those workstations back in the day—then rinsed and repeated with JITs and GCs for Self, Java, and on to finally Python in PyPy. They're fantastic! Love having them on-board. Many blessings to Deutsch, Ungar, et al. But for 40 years JIT's value has always been to optimize away the worst gaps, getting "close enough" to native to preserve "it's OK to use the highest level abstractions" for an interesting set of workloads. A solid success, but side by side with AOT compilation of closer-to-the-machine code? AOT regularly wins, then and now.
"Solved" should imply performance isn't a reason to utterly switch languages and abstractions. Yet witness the enthusiasm around Julia and Rust e.g. specifically to get more native-like performance. YMMV, but from this vantage, seeing so much intentional down-shift in abstraction level and ecosystem maturity "for performance" feels like JIT reduced but hardly eliminated the gap.
AFAIK there isn't an AOT compiler from JVM bytecode to native code that's competitive with either HotSpot or Graal, which are JIT compilers. But the JVM semantics are much less dynamic than Python or JS, whose JIT compilers don't perform nearly as well. Even Jython compiled to JVM bytecode and JITted with HotSpot is pretty slow.
However, LuaJIT does seem to be competitive with AOT-compiled C and with HotSpot, despite Lua being just as dynamic as Python and more so than JS.
AOT winning over JITs on micro benchmarks hardly wins in meaningful way for most business applications, especially when JIT caches and with PGO data sharing across runs is part of the picture.
Sure there are always going to be use cases that require AOT, and in most of them is due to deployment constraints, than anything else.
Most mainstream devs don't even know how to use PGO tooling correctly from their AOT toolchains.
Heck, how many Electron apps do you have running right now?
Some years ago there was an attempt to create a linux distribution including a Python userspace, called Snakeware. But the project went inactive since then. See https://github.com/joshiemoore/snakeware
When type annotations are available, it's already possible to compile Python to improve performance, using Mypyc. See for example https://blog.glyph.im/2022/04/you-should-compile-your-python...
Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.
Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.
If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.
Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.
For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.
Even type annotations would have to be anointed with semantics, which (IIUC) they have none today (w/CPython AFAIK). They are just annotations for use by static checkers.
Unless you can perform optimizations, the compilation can't make a whole bunch of progress beyond that bytecode.
* In fact, IIRC there was/is some "freeze" program that would do just that: compile your python program. Under the covers it would bundle libpython with your *.pyc bytecode.
I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.
Where’s the AOT compiler that handles the whole Python language?
It’s not routine because its not even an option, and people who are concerned either use the tools that let them compile a subset of Python within a larger, otherwise-interpreted program, or use a different language.
For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.
If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.
That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."
For most code, you don't need static typing for most overloaded operators to get decent performance, either. From my experience with Ur-Scheme, even a simple prediction that most arithmetic is on (small) integers with a runtime typecheck and conditional jump before inlining the integer version of each arithmetic operation performs remarkably well—not competitive with C but several times faster than CPython. It costs you an extra conditional branch in the case where the type is something else, but you need that check anyway if you are going to have unboxed integers, and it's smallish compared to the call and return you'll need once you find the correct overload to call. (I didn't implement overloading in Ur-Scheme, just exiting with an error message.)
Even concatenating strings is slow enough that checking the tag bits to see if you are adding integers won't make it much slower.
Where this approach really falls down is choosing between integer and floating point math. (Also, you really don't want to box your floats.)
And of course inline caches and PICs are well-known techniques for handling this kind of thing efficiently. They originated in JIT compilers, but you can use them in AOT compilers too; Ian Piumarta showed that.
Smaller binaries, faster execution, proper metaprogramming, actual type safety, and you don't need to bundle a whole interpreter just to say "hello world"
* Could you share the assembly language of the processor?
* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?
HDL: Verilog
Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining. Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.
Why not ARM/X86/etc... Existing CPUs are optimized for static, register-based compiled languages like C/C++. Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).
Right now, PyXL runs fully in-order with no speculative execution. This is intentional for a couple of reasons: First, determinism is really important for real-time and embedded systems — avoiding speculative behavior makes timing predictable and eliminates a whole class of side-channel vulnerabilities. Second, PyXL is still at an early stage — the focus right now is on building a clean, efficient architecture that makes sense structurally, without adding complex optimizations like speculation just for the sake of performance.
In the future, if there's a clear real-world need, limited forms of prediction could be considered — but always very carefully to avoid breaking predictability or simplicity.
Your example contains some integer arithmetic, I'm curious if you've implemented any other Python data types like floats/strings/tuples yet. If you have, how does your ISA handle binary operations for two different types like `1 + 1.0`, is there some sort of dispatch table based on the types on the stack?
Edit: Just want to mention that this sounds like a super interesting project. I have to admit that I struggled to see where python was run on the hardware when mentioning custom toolchains and a compilation step. But the important aspect is that your hardware runs this similar to how a vm would run it with all dynamic aspects of the language included. I wonder similar to a parent comment if something similar for wasm would be worth having.
Perhaps they don't need to be interruptible if there's no virtual memory.
How does it allocate memory? Malloc and free are pretty complex to do in hardware.
The main thing that appealed to me about this idea is that it would require a two-dimensional program counter. As I recall from the original specification, skipping through blank space is supposed to take O(1) time, but I didn't plan on implementing that. I did, however, imagine a machine with 256x256 bytes of memory, where some 80x25 (or 24?) region was reserved as directly memory-mapped to a character display (and protected at boot by surrounding it with jump instructions).
I’ve been using .NET since 2001 so maybe I have it confused with something else, but at the same time a lot of the web from that era is just gone, so it’s possible something like this did exist but didn’t gain any traction and is now lost to the ether.
https://en.m.wikipedia.org/wiki/Azure_Linux?utm_source=chatg...
Every time I see a project that has a great implementation on an FPGA, I lament the fact that Tabula didn’t make it, a truly innovative and fast FPGA.
fun :-)
but did I get it right?
It's a custom stack-based hardware processor tailored for executing Python programs directly. Instead of traditional microcode, it uses a Python-specific instruction set (PySM) that hardware executes.
The toolchain compiles Python → CPython Bytecode → PySM Assembly → hardware binary.
- CPython Bytecode is far from stable; it changes every version, sometimes changing the behaviour of existing bytecodes. As a result, you are pinned to a specific version of Python unless you make multiple translators.
- CPython Bytecode is poorly documented, with some descriptions being misleading/incorrect.
- CPython Bytecode requires restoring the stack on exception, since it keeps a loop iterator on the stack instead of in a local variable.
I recommend instead doing CPython AST → PySM Assembly. CPython AST is significantly more stable.
You're absolutely right that CPython bytecode changes over time and isn’t perfectly documented — I’ve also had to read the CPython source directly at times because of unclear docs.
That said, I intentionally chose to target bytecode instead of AST at this stage. Adhering to the AST would actually make me more vulnerable to changes in the Python language itself (new syntax, new constructs), whereas bytecode changes are usually contained to VM-level behavior. It also made it much easier early on, because the PyXL compiler behaves more like a simple transpiler — taking known bytecode and mapping it directly to PySM instructions — which made validation and iteration faster.
Either way, some adaptation will always be needed when Python evolves — but my goal is to eventually get to a point where only the compiler (the software part of PyXL) needs updates, while keeping the hardware stable.
- In Python 3.10, jumps changed from absolute indices to relative indices
- In Python 3.11, cell variables index is calculated differently for cell variables corresponding to parameters and cell variables corresponding to local variables
- In Python 3.11, MAKE_FUNCTION has the code object at the TOS instead of the qualified name of the function
For what it's worth, I created a detailed behaviour of each opcode (along with example Python sources) here: https://github.com/TimefoldAI/timefold-solver/blob/main/pyth... (for up to Python 3.11).
In case you weren't aware, they give you 200 x 150 um tile on a shared chip. There is then some helper logic to mux between the various projects on the chip.
Can you give me the scoop on Python, the language? I see things like this project, and it seems very impressive, but being an outsider to the language, I don't "get" it. More specifically: I'm curious to hear thoughts on a) what made this difficult prior to now (with Python), b) why Python is useful for this, and c) what are your thoughts on Python itself?
To add some more context:
I know a lot of developers who work with Python (Flask); Some love it, some hate it (as with any language). My experience has been mainly via homelab/OSS tools that all seem to embrace the language. And yet while the language itself seems very straight forward and easy to use, my experience with the Python _ecosystem_ (again, as an outsider) has been... difficult.
Python 2 vs 3, virtual environments, libraries for each version, etc. It feels as though anytime I've had to use it outside a pre-built Docker container, these issues result in throwing spaghetti at the wall trying to figure out how to even get it working at all. As a PHP/Go dev, it's one of the languages for which I could see myself having a real interest, but this has so far made me hesitant (and I don't want to be).
a) simple b) limited
The language really took off when developers took this simple limited language and pushed it to its very limits using C extensions. The data science explosion opened up the language to a very wide user base.
So to answer your 3 questions: a) Python is not a fast language by any means. There is a lot of overhead in every function call that makes it almost impossible for low latency/real-time use cases. b) I don't think Python is particularly the best language for this. This is just a demonstration of someone building their own custom toolchain to show what is possible with just pure Python. The author has highlighted why they think this is interesting on the website. c) I keep thinking Python will go away soon, and we will see a much better alternative. But the reality is Python is entrenched deeply just like JavaScript. Lot of smart people are putting in a lot of effort to make it better. Personally the ecosystem and packaging story does not annoy me much, but the lack of proper threading (GIL) has hurt my projects more than once.
For your particular pain point, the current community recommended solution is to use uv (https://github.com/astral-sh/uv). There were several detours (pip, pyenv, pipenv, poetry etc.) the community took before they got behind this.
Python is going in the right directions in terms of all the deployability and big issues but it should have been where it is now 7 years ago. Specifically, I sketched out a system that worked like uv but was written in pure Python, I didn't start on it for two reasons: (a) the bootstrapping problem that I couldn't ever stop devs from trashing the Python that it runs in, and (b) from lots of trying it didn't seem possible to convince most Pythoners that pip was broken or that it mattered... uv solved (a) by removing Python from the bootstrap and (b) by being crazy fast.
c) It’s a monstrous dumpster fire and getting worse over time, but so is everything else (in the same space). I like Go, but I can see how it’s not for everyone.
To answer your questions in order,
a) I haven't done much work with embedded Python, but like any dynamically-typed language that runs in a VM there's a lot of runtime infrastructure that adds latency, complexity, energy consumption, bundle size, etc. It sounds like this project aims to remove the vast majority of that. So take startup time, for instance: Normal Python takes ~50ms to fire up the interpreter and get into actual user code. If I'm understanding it correctly, with PyXL that would be vastly lower. Although I guess the ARM chip still has to load the code onto the FPGA, so maybe not, idk.
b) and c) are kind of the same question, to me - at least, "why use Python for embedded" is a subset of "why use Python at all."
For me, Python more than any other language is great at getting out of its own way, so that you can spend your precious brain energy on whatever problem you're solving and less on the tool you're using to solve it. This is maybe less true in recent years, as later Pythons have added a lot more complex features (like async/await, for instance, which I actually really like in Python but definitely adds complexity to the language).
Finally, I think a lot of it comes down to personal style/taste/chance (i.e. if Python is the first language you encounter, you're probably more likely to end up liking Python.) The Zen of Python[0], which you may have seen, does a good job of explaining the Python way of approaching problems, although like I said a few of those principles have been less-rigidly adhered to in recent years (like "there should be only one way to do it.")
If you hang out in Python circles, you'll probably come across the phrase "Python fits your brain." I'm not sure where it was originally coined but it very definitely describes my experience with Python: it (mostly) just works like I expect it to, whether that's with regard to syntax, semantics, stdlib, etc.
Not that it doesn't have its bad points, of course. Dependency management, as you mentioned, can be a bit hellish at times. A lot of it comes down to the fact that dependencies in Python were originally conceived as systemwide state, much like dynamically-loaded C libs on Linux. This works fine until you need to use two different, mutually-incompatible versions of the same lib, at which point all hell breaks loose. There have been various attempts to improve on this more recently, so far uv[1] looks pretty promising, but time will tell.
The one saving grace of Python dependencies is that it has a very rich standard library, so the average Python project tends to have way fewer total dependencies than the average project in, say, JS or Rust.
The typing story for Python is also a bit lacking. Yes, there are now optional type hints and things like MyPy to make use of them, but even if your own code is all completely typed, in my experience it's usually not long before you need to call out to something that isn't well-typed and then your whole house of cards starts to fall apart.
Anyway, just my rambling $0.02.
And yet for simple little standalone programs and notebooks, particularly for science, it is super simple and natural to turn to it.
1) Perl kind of shooting itself in the foot 20 years ago and Python becoming the de facto scripting language for Linux distributions that needed to do anything more complicated than was suitable for shell scripts but didn't require entirely new compiled software projects.
2) The above meant Python is almost always available and a good tool to have handy if you need to do something one-off and simple but more complicated than what you can do with a built-in calculator app. For instance, ever curious if you can pull the exponents off of x509 certificates and manually verify signatures by hand? Pretty easy to do in Python.
3) The C API and compiled modules made it possible to link against pre-existing BLAS implementations, and the extensible syntax and user-defined operators made it possible to mimic the style of MATLAB and R. Thus, Python became a popular choice as a lingua franca for engineers, scientists, and stats geeks who just wanted to do some data exploration or modeling and weren't trying to create shippable software.
4) MIT decided to make Python its primary teaching language in the early 2000s or so and a lot of CS programs in the US followed suit.
5) It became possible at some point to write Microsoft Office macros in Python, giving marginally technical business types a nice option to learn that was more broadly useful than VB script to automate their own workflows.
Why it ever became so popular among actual software developers I have a harder time answering, but for research, exploratory work, prototyping, scripting, workflow automation, it's as good as anything else you can come up with, usually already available, and it has an extremely "batteries included" standard library that means you probably don't need to worry about the kind of ecosystem dependency hell you're envisioning here.
Possibly some factors include the rise of LeetCode, as Python's "executable pseudocode" style means it is very easy to find or translate examples of algorithm implementations into Python solutions for learning, and the fact that a large trend of the post big data era is trying to turn exploratory data analysis pipelining tasks into real software, along with people who used to brand themselves as "data scientists" deciding to become software developers instead, and already knowing Python.
Python also gives you a pretty good first order approximation of a solution when you want to turn some researcher's data model into a service, provided your app is also written in Python. This has become far less important these days with data APIs, ML APIs, standardized formats for model serialization, but previously, a very popular solution to the so-called "two language problem" was just making Python fast enough to let it be both languages itself rather than trying to add web app frameworks to Julia.
The ecosystem is massive and the core team just keeps adding more and more dubious language features and syntax.
Realistically, Python should have been "done" after async/await and fixing str vs bytes.
Good question. In theory, you can compile anything Turing-complete to anything else — ARM and Python are both Turing-complete. But practically, Python's model (dynamic typing, deep use of the stack) doesn't map cleanly onto ARM's register-based, statically-typed instruction set. PySM is designed to match Python’s structure much more naturally — it keeps the system efficient, simpler to pipeline, and avoids needing lots of extra translation layers.
So, no using C libraries. That takes out a huge chunck of pip packages...
That said, in future designs, PyXL could work in tandem with a traditional CPU core (like ARM or RISC-V), where C libraries execute on the CPU side and interact with PyXL for control flow and Python-level logic.
There’s also a longer-term possibility of compiling C directly to PyXL’s instruction set by building an LLVM backend — allowing even tighter integration without a second CPU.
Right now the focus is on making native Python execution viable and efficient for real-time and embedded systems, but I definitely see broader hybrid models ahead.
> Runs a subset of Python
What's the advantage of using a new custom toolchain, custom instruction set and custom processor over existing tools that compile a subset of Python for existing CPUs? - e.g. Cython, Nuitka etc?
I'd definitely be interested in how this project progresses, particularly if it adds support for integration to the CPU. Some tie-in to the Pynq project could be super fun.
having this, the next tempting step is to make `print` function work, then the filesystem wrapper etc.
btw - what i'm missing is a clear information of limitations. it's definitely not true that i can take any Python snippet and run it using PyXL (for example threads i suppose?)
Peripheral drivers (like UART, SPI, etc.) are definitely on the roadmap - They'd obviously be implemented in HW. You're absolutely right — once you have basic IO, you can make things like print() and filesystem access feel natural.
Regarding limitations: you're right again. PyXL currently focuses on running a subset of real Python — just enough to show it's real python and to prove the core concept, while keeping the system small and efficient for hardware execution. I'm intentionally holding off on implementing higher-level features until there's a real use case, because embedded needs can vary a lot, and I want to keep the system tight and purpose-driven.
Also, some features (like threads, heavy runtime reflection, etc.) will likely never be supported — at least not in the traditional way — because PyXL is fundamentally aimed at embedded and real-time applications, where simplicity and determinism matter most.
Do you plan to have AMBA or Wishbone Bus support?
PyXL already communicates with the ARM side over AXI today (Zynq platform).
Is it tied to a particular version of python?
Right now, PyXL is tied fairly closely to a specific CPython version's bytecode format (I'm targeting CPython 3.11 at the moment).
That said, the toolchain handles translation from Python source → CPython bytecode → PyXL Assembly → hardware binary, so in principle adapting to a new Python version would mainly involve adjusting the frontend — not reworking the hardware itself.
Longer term, the goal is to stabilize a consistent subset of Python behavior, so version drift becomes less painful.
Maybe not AWS Lambda specifically, but definitely server-side acceleration — especially for machine learning feature generation, backend control logic, and anywhere pure Python becomes a bottleneck.
It could definitely get there — but it would require building a full-scale deployment model and much broader library and dynamic feature support.
That said, the underlying potential is absolutely there.
What's missing so you could create a demo for vc's or the relevant companies , proving the potential of this as competitive server-class core ?
PyXL today is aimed more at embedded and real-time systems.
For server-class use, I'd need to mature heap management, add basic concurrency, a simple network stack, and gather real-world benchmarks (like requests/sec).
That said, I wouldn’t try to fully replicate CPython for servers — that's a very competitive space with a huge surface area.
I'd rather focus on specific use cases where deterministic, low-latency Python execution could offer a real advantage — like real-time data preprocessing or lightweight event-driven backends.
When I originally started this project, I was actually thinking about machine learning feature generation workloads — pure Python code (branches, loops, dynamic types) without heavy SIMD needs. PyXL is very well suited for that kind of structured, control-flow-heavy workload.
If I wanted to pitch PyXL to VCs, I wouldn’t aim for general-purpose servers right away. I'd first find a specific, focused use case where PyXL's strengths matter, and iterate on that to prove value before expanding more broadly.
Right now I'm doing this with a dsl with an fpga talking to a computer.
Does your python implementation let you run at speeds like that?
If yes, is there any overhead left for dsp - preferably fp based?
You're right that dynamic typing makes high-frequency execution tricky, and modern OoO cores are incredibly good at hiding latencies. But PyXL isn't trying to replace general-purpose CPUs — it's designed for efficient, predictable execution in embedded and real-time systems, where simplicity and determinism matter more than absolute throughput. Most embedded cores (like ARM Cortex-M and simple RISC-V) are in-order too — and deliver huge value by focusing on predictability and power efficiency. That said, there’s room for smart optimizations even in a simple core — like limited lookahead on types, hazard detection, and other techniques to smooth execution paths. I think embedded and real-time represent the purest core of the architecture — and once that's solid, there's a lot of room to iterate upward for higher-end acceleration later.
IMO JavaCard doesn't really make sense either. There's clearly space for another language here, though I suspect most people would much rather just use Rust than learn a new language.
>Standard_NP10s instance, 1x AMD Alveo U250 FPGA (64GB)
Would be curious to see how this benchmarks on a faster FGPA since I imagine clock frequency is the latency dictator - while memory and tile can determine how many instances can run in parallel.
To run PyXL on a server-class FPGA (like Azure instances), some adaptations would be needed — the system would need to repurpose the host CPU to act as the orchestrator, handling memory, IO, etc.
The question is: what's the actual use case of running on a server? Besides testing max frequency -- for which I could just run Vivado on a different target (would need license for it though)
For now, I'm focusing on validating the core architecture, not just chasing raw clock speeds.
I have a Paralella board here with a Zynq.
I
This is still an early-stage project — it's not completed yet, and fabricating a custom chip would involve huge costs.
I'm a solo developer worked on this in my spare time, so FPGA was the most practical way to prove the core concepts and validate the architecture.
Longer term, I definitely see ASIC fabrication as the way to unlock PyXL’s full potential — but only once the use case is clear and the design is a little more mature.
I find the idea of a processor designed for a specific very high level language quite interesting. What made you choose python and do you think it's the "correct" language for such a project? It sure seems convenient as a language but I wouldn't have thought it is best suited for that task due to the very dynamic nature of it. Perhaps something like Nim which is similar but a little less dynamic would be a better choice?
Like if I could buy a Cortex board and write Python, hit compile, and have the thing run, this would be INSANELY useful to me, cause cortex chips have pretty great A/D converters for sensing.
It could probably be adapted into one.
But I ultimately decided to build it as its own clean design because I wanted the flexibility to rethink the entire execution model for Python — not just adapt an existing register-based architecture.
FPGA is for prototyping. although this could probably be used as a soft core. But looking forward, ASIC is definitely the way to go.
I have what may be a dumb question, but I've heard that Lua can be used in embedded contexts, and that it can be used without dynamic memory allocation and other such things you don't want in real time systems. How does this project compare to that? And like I said it's likely a dumb question because I haven't actually used Lua in an embedded context but I imagine if there's something there you've probably looked at it?
That said... awesome work! I wish I could get to PyCon this year to see your talk.
Are you planning to post your core so others can replicate your work?
It's still early days and there’s a lot more work ahead, but I'm very excited about the possibilities.
I definitely see areas like embedded ML and TinyML as a natural fit — Python execution on low-power devices opens up a lot of doors that weren't practical before.
In theory, you could build a CPU that directly interprets Python bytecode — but Python bytecode is quite high-level and irregular compared to typical CPU instructions. It would add a lot of complexity and make pipelining much harder, which would hurt performance, especially for real-time or embedded use.
By compiling the Python bytecode ahead of time into a simpler, stack-based ISA (what I call PySM), the CPU can stay clean, highly pipelined, and efficient. It also opens the door in the future to potentially supporting other languages that could target the same ISA!
Absolutely incredible.
This is amazing, great work!
I think building a CPU that can only do this is a really novel idea and am really interested in seeing when you eventually disclose more implementation details. My only complaint is that it isn't Lua :P
This is so cool, I have dreamt about doing this but wouldn't know where to start. Do you have a plan for releasing it? What is your background? Was there anything that was way more difficult than you thought it would be? Or anything that was easier than you expected?
Right now, the plan is to present it at PyCon first (next month) and then publish more about the internals afterward. Long-term, I'm keeping an open mind, not sure yet.
My background is in high-frequency trading (HFT), high-performance computing (HPC), systems programming, and networking. I didn't come from HW background — or at least, I wasn't when I started — but coming from the software side gave me a different perspective on how dynamic languages could be made much more efficient at the hardware level.
Difficult - adapting the Python execution model to my needs in a way that keeps it self-coherent if it makes sense. This is still fluid and not finalized...
Easy - Not sure if categorize as easy, but more surprising: The current implementation is rather simple and elegant (at least I think so :-) ), so still no special advanced CPU design stuff (branch prediction, super-scalar, etc). So even now, I'm getting a huge improvement over CPython or MicroPython VMs in the known python bottlenecks (branchings, function calls, etc)
Alright well those dots are begging me to ask what they mean, or at least one specific story for the nerds :-)
Long-term, I'm keeping an open mind, not sure yet.
Well please consider open source, even if you charge for access to your open source code. And even if you don't go open source, atleast make it cheap enough that a solo developer could afford to build on it without thinking.
Do you have any open source code available for this yet?
Are you planning to release this as open source? If not, do you have a rough idea for how you plan to commercial license this tech?
I’m guessing due to the lack of JIT, it’s executed on the host?
If you refer to the ARM part as the host (did you?) it's just orchestrating the whole thing, it doesn't run the actual Python program
This is a great approach for many applications, but it doesn’t fit all use cases.
PyXL is a hardware solution — a custom processor designed specifically to run Python programs directly.
It's currently focused on embedded and real-time environments where JIT compilation isn't a viable option due to memory constraints, strict timing requirements, and the need for deterministic behavior.
> No VM, No C, No JIT. Just PyXL.
Is the main goal to achive C-like performance with the ease of writing python? Do you have a perfomance comparision against C? Is the main challenge the memory management?
> PyXL runs on a Zynq-7000 FPGA (Arty-Z7-20 dev board). The PyXL core runs at 100MHz. The ARM CPU on the board handles setup and memory, but the Python code itself is executed entirely in hardware. The toolchain is written in Python and runs on a standard development machine using unmodified CPython.
> PyXL skips all of that. The Python bytecode is executed directly in hardware, and GPIO access is physically wired to the processor — no interpreter, no function call, just native hardware execution.
Did you write some sort of emulation to enable testing it without the physical Arty board?
Performance comparison against C: I don't have a formal benchmark directly against C yet. The early GPIO benchmark (480ns toggle) is competitive with hand-written C on ARM microcontrollers — even when running at a lower clock speed. But a full systematic comparison (across different workloads) would definitely be interesting for the future.
Main challenge: Yes — memory management is one of the biggest challenges. Dynamic memory allocation and garbage collection are tricky to manage efficiently without breaking real-time guarantees. I have a roadmap for it, but would like to stick to a real use case before moving forward.
Software emulation: I am using Icarus (could use Verilator) for RTL simulation if that's what you meant. But hardware behavior (like GPIO timing) still needs to be tested on the real FPGA to capture true performance characteristics.
How? XLWings is not a similar name to pyxl. However, even so, the name is... Heavily overloaded:
https://pyxl.com/ (some kind of strategy/CRM/AI thing)
https://pyxl.ai/ (AI website builder)
https://www.pyxl.pro/ (AI image generator)
https://github.com/dropbox/pyxl (Inline HTML extension for Python)
https://openpyxl.readthedocs.io/en/stable/ (A Python library to read/write Excel files)
https://www.pyxll.com/ (Excel Add-in to support add-ins written in Python)
As for the future, I’m keeping an open mind. It would be exciting if it grew into something bigger, but my main focus for now is making sure the foundation is as solid and clean as possible.
That then makes me wonder if someone could implement Excel in hardware! (Or something like it)
- Lisp/lispm
- Ada/iAPX
- C/ARM
- Java/Jazelle
Most don't really take off or go in different directions as the language goes out of fashion.
Also there are no languages that reflect what modern CPUs are like, because modern CPUs obfuscate and hide much of how the way they work. Not even assembly is that close to the metal anymore, and it even has undefined behavior these days. There was an attempt to make a more explicit version of the hardware with Itanium, and it was explicitly a failure for much of the same reason than iAPX432 was a failure. So we kept the simpler scalar register machine around, because both compilers and programmers are mostly too stupid to work with that much complexity. C didn't do shit, human mental capacity just failed to evolve fast enough to keep up with our technology. Things like Rust are more the descendant of C than the modern design of a CPU.
Text files seem a bit too sequential in structure, maybe we can figure out a way to represent the dependency graphs directly.
The end result would look nothing like any other programming language and would die in obscurity, to be honest. But holy shit it would be really fucking cool.
By the way, my introduction to C was via RatC, with the complete listing on A Book on C, from 1988, bought in 1990.
Intel failures tend to be more political than technical, as root cause.
It depends on what you mean by that. The PDP-11's dialect of B's major changes were more ergonomic handling of strings to no longer required repacking cells, and pointers became byte-aligned rather than word-aligned. C adopted these changes from the PDP-11 dialect of B, but that's the extent of influence the PDP-11 ever had.[1] The compiler size restrictions imposed by the PDP-7 and the GE-635 are far more influential on the semanticalities of the family.
In this rhetoric, what I'll call the "Your computer is not a fast PDP-11" dialogue, I find that people will imply things like pointer arithmetic, granular availability of memory as a flat array, etc. were invented in 1973, as though these are special quirks of the PDP-11 that C thrusted upon the programmer. They're just a normal part of computing, really. All the same criticisms leveraged at C can be leveraged at Forth for example, which isn't even in this class of register machine.
> Intel failures tend to be more political than technical
In the case of Itanium and iAPX432? Absolutely not. Read through the manual of the latter for a lark[2], there was never any chance in hell this thing could have succeeded. You couldn't pay me to maintain code for such a machine, sufficiently smart compiler or not. Itanium was a repeat of the same blunder, only this time Intel didn't even try to base their design on any existing infrastructure.
[1] - https://web.archive.org/web/20150611114355/https://www.bell-...
[2] - http://www.bitsavers.org/components/intel/iAPX_432/171860-00...
https://mn416.github.io/reduceron-project/
These range from a few instructions to accelerate certain operations, to marking memory for the garbage collector, to much deeper efforts.
Historically their performance is underwhelming. Sometimes competitive on the first iteration, sometimes just mid. But generally they can't iterate quickly (insufficient resources, insufficient product demand) so they are quickly eclipsed by pure software implementations atop COTS hardware.
This particular Valley of Disappointment is so routine as to make "let's implement this in hardware!" an evergreen tarpit idea. There are a few stunning exceptions like GPU offload—but they are unicorns.
Right now the only reason why we don't have new generations of these eating the lunch of general purpose CPUs is that you'd need to organize a few billion transistors into something useful. That's something a bit beyond what just about everyone (including Intel now apparently) can manage.
They are the tar pit. Transistor counts skyrocket, but the principles and obstacles have not changed one iota in over 50 years.
A processor from 2015 is good enough for most daily tasks in 2025. Try saying that about one from 1985 to 1995.
The issue today isn't that by the time you get to market with SOTA manufacturing on a custom 10x design you only have two years before general purpose chips are just as fast.
It's getting to the market in the first place.
You're right that it can definitely be faster — there's real room for optimization.
When I have time, I may write a blog post that will explain where the cycles go, why it's different from raw assembler toggling, and how it could be improved.
Also, just to keep things in perspective — don't forget to compare apples to apples: On a Pyboard running MicroPython, a simple GPIO roundtrip takes about 14 microseconds. PyXL is already achieving 480 nanoseconds, so it’s a very different baseline.
Thanks for raising it — it's a very good point.
Clearly you know a lot about both low level Python internals and a fair amount about hardware design to pull this off.
Almost every question I had, you already answered in the comments. The only one remaining at the moment: How long exactly have you been working on PyXL?
why are we not doing this for a standard python? i think LLVM is just for that, no?
Very cool project still