140 pointsby dmpetrov7 hours ago20 comments
  • simonw5 hours ago
    I got a WebAssembly build of this working and fired up a web playground for trying it out: https://simonw.github.io/research/monty-wasm-pyodide/demo.ht...

    It doesn't have class support yet!

    But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

    Notes on how I got the WASM build working here: https://simonwillison.net/2026/Feb/6/pydantic-monty/

    • yikebfhw2 hours ago
      [flagged]
    • dhdjfhfjfn38 minutes ago
      > But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

      It seems that your response to accusations of becoming a vapid propagandist are to lean into it so far that people start thinking you’re joking.

      Very odd strategy. Very odd way to manage a reputation that you built over 20 years, but I guess it’s a great lesson in how AI psychosis can affect the best of us.

      • simonw29 minutes ago
        You're really stretching things here to classify me pointing out that LLMs can handle syntax errors caused by partial implementations of Python as "being a vapid propagandist".

        (This kind of extremely weak criticism often seems to come from newly created Hacker News accounts, which makes me wonder if it's mostly the same person using sockpuppets.)

    • issat982an hour ago
      Why do Rust proponents apparently svvat Rust critics like Rene Rebe? Do they even have souls?
  • avaer5 hours ago
    This feels like the time I was a Mercurial user before I moved to Git.

    Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.

    Now, everyone is writing agent `exec`s in Python, when I think TypeScript/JS is far better suited for the job (it was always fast + secure, not to mention more reliable and information dense b/c of typing).

    But I think I'm gonna lose this one too.

    • giancarlostoroan hour ago
      Having been doing Python for over a decade and JavaScript. I would pick Python any day of the week over JavaScript. JavaScript is beautiful, and also the most horrific programming language all at once. It still feels incomplete, there's too many oddities I've run into over the years, like checking for null, empty, undefined values is inconsistent all around because different libraries behave differently.
    • nine_k2 hours ago
      For historical reasons (FFI), Python has access to excellent vector / tensor mathematics (numpy / scipy / pandas / polars) and ML / AI libraries, from OpenCV to PyTorch. Hence the prevalence of Python in science and research. "Everybody knows Python".

      I do like Typescript (not JS) better, because of its highly advanced type system, compared to Python's.

      TS/JS is not inherently fast, it just has a good JIT compiler; Python still ships without one. Regarding security, each interpreter is about as permissive as the other, and both can be sealed off from environment pretty securely.

    • shoeb00m4 hours ago
      A big benefit of letting agents run code is they can process data without bloating their context.

      LLMs are really good at writing python for data processing. I would suspect its due to Python having a really good ecosystem around this niche

      And the type safety/security issues can hopefully be mitigated by ty and pyodide (already used by cf’s python workers)

      https://pyodide.org/en/stable/

      https://github.com/astral-sh/ty

      • DouweM3 hours ago
        (Pydantic AI lead here) That’s exactly what we built this for: we’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 which will use Monty by default, with abstractions to use other runtimes / sandboxes.

        Monty’s overhead is so low that, assuming we get the security / capabilities tradeoff right (Samuel can comment on this more), you could always have it enabled on your agents with basically no downsides, which can’t be said for many other code execution sandboxes which are often over-kill for the code mode use case anyway.

        For those not familiar with the concept, the idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see (all of) the intermediate value. Every step that depends on results from an earlier step requires a new LLM turn, limiting parallelism and adding a lot of overhead, expensive token usage, and context window bloat.

        With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

        These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

        • solidasparagus20 minutes ago
          Why do you think python without access to the library ecosystem is a good approach? I think you will end up with small tool call subgraphs (i.e. more round trips) or having to generate substantially more utility code.
        • 4b11b42 hours ago
          "But MCP is still useful, because it is uniform"

          Yes, I was also thinking.. y MCP den

          But even my simple class project reveals this. You actually do want a simple tool wrapper layer (abstraction) over every API. It doesn't even need to be an API. It can be a calculator that doesn't reach out anywhere.

          as the article puts it: "MCP makes tools uniform"

        • 4b11b42 hours ago
          lol "agents are better at writing code that calls MCP, then using mcp itself"

          In hindsight, it's pretty funny and obvious

    • rzerowan4 hours ago
      Tangentially i wonder if the recent changes in the GIL will percolate to mercurial as any improvements.

      Yep still using good old hg for personal repos - interop for outside project defaults to git since almost all the hg host withered.

    • piskov4 hours ago
      Can we please make as little js as possible?

      Why would one drag this god forsaken abomination on server-side is beyond me.

      Even effing C# nowdays can be run in script-like manner from a single file.

      Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.

      • aryonoco4 hours ago
        My favourite languages are F# and OCaml, and from my perspective, TypeScript is a far better language than C#.

        Typescript’s types are far more adaptable and malleable, even with the latest C# 15 which is belatedly adding Sum Types. If I set TypeScript to its most strict settings, I can even make it mimic a poor man’s Haskell and write existential types or monoids.

        And JS/TS have by far the best libraries and utilities for JSON and xml parsing and string manipulation this side of Perl (the difference being that the TypeScript version is actually readable), and maybe Nushell but I’ve never used Nushell in production.

        Recently I wrote a Linux CLI tool for managing podman/quadlett containers and I wrote it in TypeScript and it was a joy to use. The Effect library gave me proper Error types and immutable data types and the Bun Shell makes writing shell commands in TS nearly as easy as Bash. And I got it to compile a single self contained binary which I can run on any server and has lower memory footprint and faster startup time than any equivalent .NET code I’ve ever written.

        And yes had I written it in rust it would have been faster and probably even safer but for a quick a dirty tool, development speed matters and I can tell you that I really appreciated not having to think about ownership and fighting the borrow checker the whole time.

        TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.

      • IshKebab4 hours ago
        I would say the same about Python, a language that has clearly got far too big for its boots.
  • imfing3 hours ago
    This is a really interesting take on the sandboxing problem. This reminds me of an experiment I worked on a while back (https://github.com/imfing/jsrun), which embedded V8 into Python to allow running JavaScript with tightly controlled access to the host environment. Similar in goal to run untrusted code in Python.

    I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks

    • scolvin3 hours ago
      Can't be sure where this might end, but the primary goal is to enable codemode/programmatic tool calling, using the external function call mechanism for anything more complicated.

      I think in the near term we'll add support for classes, dataclasses, datetime, json. I think that should be enough for many use cases.

    • ushakov3 hours ago
      there’s no way around VMs for secure, untrusted workloads. everything else, like Monty has too many tradeoffs that makes it non-viable for any real workloads

      disclaimer: i work at E2B, opinions my own

      • scolvin3 hours ago
        As discussed on twitter, v8 shows that's not true.

        But to be clear, we're not even targeting the same "computer use" use case I think e2b, daytona, cloudflare, modal, fly.io, deno, google, aws are going after - we're aiming to support programmatic tool calling with minimal latency and complexity - it's a fundamentally different offering.

        Chill, e2b has its use case, at least for now.

        • ushakov2 hours ago
          we’re not disagreeing here - i meant for general use-case VMs are better, for some application-specific calls Monty this might suffice.

          although you’d still need another boundary to run your app in to prevent breaking out to other tenants.

  • zahlman19 hours ago
    > Instead, it let's you run safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

    Perhaps if the interpreter is in turn embedded in the executable and runs in-process, but even a do-nothing `uv` invocation takes ~10ms on my system.

    I like the idea of a minimal implementation like this, though. I hadn't even considered it from an AI sandboxing perspective; I just liked the idea of a stdlib-less alternative upon which better-thought-out "core" libraries could be stacked, with less disk footprint.

    Have to say I didn't expect it to come out of Pydantic.

    • preciousoo5 hours ago
      Pydantic + FastAPI are my two favorite python shops right now, they’re always dropping fun new projcts
    • Cyphase4 hours ago
      uv is written in Rust, not Python.
  • wewewedxfgdfan hour ago
    If I say my code is secure does hat make it secure?

    Or is all Rust code secure unquestionably?

  • JoshPurtell3 hours ago
    Monty is the missing link that's made me ship my rust-based RLM implementation - and I'm certain it'll come in handy in plenty of other contexts.

    Just beware of panics!

    • JoshPurtell3 hours ago
      • scolvin2 hours ago
        Please report any panics, we'll fix them!
        • IhateAI34 minutes ago
          Why do SWE build tools in the open that are openly hostile to their own trade? Like I can understand someone selfishly building tools for themselves, but by contributing to these efforts you're basically donating free software tools to companies that will only be used to shrink their own engineering teams by making llms more capable/efficient.

          While I think all LLMs are shit, they probably eventually will not be shit, and it will because people like you contributed to their progress. Nothing good will come of it for you or your peers. The Billionaires who own everything will kick you out to the curb as soon as you train your replacement that doesn't sleep, eat or complain. Have some class solidarity.

  • _joel5 hours ago
    Well I love the name, so definitely trying this out later, but first...

    And now for something, completely different.

  • SafeDusk2 hours ago
    Sandboxing is going to be of growing interests as more agents go “code mode”.

    Will explore this for https://toolkami.com/, which allows plug and play advanced “code mode” for AI agents.

  • c2xlZXB53 hours ago
    Maybe a dumb question, but couldn't you use seccomp to limit/deny the amount of syscalls the Python interpreter has access to? For example, if you don't want it messing with your host filesystem, you could just deny it from using any filesystem related system calls? What is the benefit of using a completely separate interpreter?
    • oofbey3 hours ago
      Yours is a valid approach. But you always gotta wonder if there’s some way around it. Starting with runtime that has ways of accessing every aspect of your system - there are a lot of ways an attacker might try to defeat the blocks you put in place. The point of starting with something super minimal is that the attack surface is tiny. Really hard to see how anything could break out.
      • ushakov3 hours ago
        agree. you still need a secure boundary like VM to isolate the tenants in case the model breaks out of the sandbox.

        everything that you don’t want your agent to access should live outside of the sandbox.

  • geysersam3 hours ago
    Is ai running regular python really a problem? I see that in principle there is an issue. But in practice I don't know anyone who's had security issues from this. Have you?
    • scolvin2 hours ago
      No one is going to let an LLM get prompted by end users to write python code I just run on my server, there's no real debate on that.
      • ushakov2 hours ago
        i think there’s a confusion around what use-case Monty is solving (i was confused as well). this seems to isolate in a scope of execution like function calls, not entire Python applications
  • rienbdj5 hours ago
    If we’re going to have LLMs write the code, why not something more performant? Like pages and pages of Java maybe?
    • scolvin4 hours ago
      this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.

      Of course it's slow for complex numerical calculations, but that's the primary usecase.

      I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.

      • catlifeonmars3 hours ago
        Seems like we should fix the LLMs instead of bending over backwards no?
        • redman257 minutes ago
          They’re good at it because they’ve learned from the existing mountains of python and javascript.
  • kodablah14 hours ago
    I'm of the mind that it will be better to construct more strict/structured languages for AI use than to reuse existing ones.

    My reasoning is 1) AIs can comprehend specs easily, especially if simple, 2) it is only valuable to "meet developers where they are" if really needing the developers' history/experience which I'd argue LLMs don't need as much (or only need because lang is so flexible/loose), and 3) human languages were developed to provide extreme human subjectivity which is way too much wiggle-room/flexibility (and is why people have to keep writing projects like these to reduce it).

    We should be writing languages that are super-strict by default (e.g. down to the literal ordering/alphabetizing of constructs, exact spacing expectations) and only having opt-in loose modes for humans and tooling to format. I admit I am toying w/ such a lang myself, but in general we can ask more of AI code generations than we can of ourselves.

    • bityard4 hours ago
      I think the hard part about that is you first have to train the model on a BUTT TON of that new language, because that's the only way they "learn" anything. They already know a lot of Python, so telling them to write restricted and sandboxed Python ("you can only call _these_ functions") is a lot easier.

      But I'd be interested to see what you come up with.

  • Retr0id3 hours ago
    I'm enjoying watching the battle for where to draw the sandbox boundaries (and I don't have any answers, either!)
    • ushakov3 hours ago
      best answer is probably to have a layered approach - use this to limit what the generated code can do, wrap it in a secure VM to prevent leaking out to other tenants.
  • krick3 hours ago
    I don't quite understand the purpose. Yes, it's clearly stated, but, what do you mean "a reasonable subset of Python code" while "cannot use the standard library"? 99.9% of Python I write for anything ever uses standard library and then some (requests?). What do you expect your LLM-agent to write without that? A pseudo-code sorting algorithm sketch? Why would you even want to run that?
    • impulser_3 hours ago
      They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.
      • DouweM3 hours ago
        (Pydantic AI lead here) We’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 with support for Monty and abstractions to use other runtimes / sandboxes.

        The idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see the intermediate value. Every step that depends on results from an earlier step also requires a new LLM turn, limiting parallelism and adding a lot of overhead.

        With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

        These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

    • notepad0x903 hours ago
      It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.

      For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:

      for key,val in mydict.items():

      ..if key == "operation":

      ....logging.info("Executing operation %s",val)

      ..if val == "drop_table":

      ....self.drop_table()

      This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.

      In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.

      EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.

  • dmpetrov7 hours ago
    I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?
  • spacedatuman hour ago
    There is no reason to continue writing Python in 2026. Tell Claude to write Rust apriori. Your future self will thank you.
  • falcor844 hours ago
    Wow, a start latency of 0.06ms
  • OutOfHere5 hours ago
    It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.
    • simonw3 hours ago
      How do I sandbox CPython using OS features?

      (Genuine question, I've been trying to find reliable, well documented, robust patterns for doing this for years! I need it across macOS and Linux and ideally Windows too. Preferably without having to run anything as root.)

      • OutOfHerean hour ago
        Docker and other container runners allow it. https://containers.dev/ allows it too.

        https://github.com/microsoft/litebox might somehow allow it too if a tool can be built on top of it, but there is no documentation.

        • simonw7 minutes ago
          Every time I use Docker as a sandbox people warn me to watch out for "container escapes".

          I trust Firecracker more because it was built by AWS specifically to sandbox Lambdas, but it doesn't work on macOS and is pretty fiddly to run on Linux.

    • bityard3 hours ago
      Python already has a lot of half-baked (all the way up to nearly-fully-baked) interpreters, what's one more?

      https://en.wikipedia.org/wiki/List_of_Python_software#Python...

    • avaer5 hours ago
      The repo does make a case for this, namely speed, which does make sense.
      • sd2k4 hours ago
        True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!

        [1] https://github.com/eryx-org/eryx

      • OutOfHerean hour ago
        Speed is not a feature if there isn't even syntax parity with CPython.
  • 3 hours ago
    undefined
  • 2 hours ago
    undefined