What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.
Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.
Which I find even more impressive, considering the 3060 is the most used GPU (on Steam) and that M4 Air and future SoCs are/will be commonplace too.
(Q4_K_M with filesize=18GB)
Conversely, the 4B model actually seemed to work really well and gave results comparable to Gemini 2.0 Flash (at least in my simple tests).
I haven't evaled these tasks so YMMV. I'm exploring other possibilities as well. I suspect it might be decent at autocomplete, and it's small enough one could consider finetuning it on a codebase.
The /think and /no_think commands are very convenient.
Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding
The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
[1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B
[2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
[3] llama.cpp: https://github.com/ggml-org/llama.cpp
I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.
As a Python person I've found uv + MLX to be pretty painless on a Mac too.
Soon some AMD Ryzen AI Max PCs will be available, with unified memory as well. For example the Framework Desktop with up to 128 GB, shared with the iGPU:
- Product: https://frame.work/us/en/desktop?tab=overview
- Video, discussing 70B LLMs at around 3m:50s : https://youtu.be/zI6ZQls54Ms
edit: ok.. i am excited.
The latter is super easy. Just download the model (thru the GUI) and go.
This part, yes. I assume the setting a complete environment is a little more involved than the 4 commands sibling is also refers to.
E.g.: I go a little bit overboard for the average macOS user:
- custom system- and app-specific keyboard mappings (ultra-modifier on caps-lock; custom tabbing-key-modifier) via Karabiner Elements
- custom trackpad mappings via BetterTouchTool
- custom Time Machine schedule and backup logic; you can vibe-code your install script once and re-use it in the future; just make it idempotent
- custom quake-like Terminal via iTerm
- shell customizations
- custom Alfred workflows
- etc.
If all you need is just a sensible package manager and the terminal to get started, just set up Time Machine with default settings, Homebrew, your shell, and optionally iTerm2, and you're good to go. Other noteworthy power-user tools:
- Hammerspoon
- Syncthing / Resilio Sync
- Arq. Naturally, the usual backup tools also run on macOS: Borg, Kopia, etc.
- Affinity suite for image processing
- Keyshape for animations on web, mobile, etc.
As a Windows/MacOS/Linux dweller kinto is a godsend so I can have macos keyboard (but you could have linux or windows by default) on all OSes https://kinto.sh/
It's using 20GB of memory according to ollama.
I doubt it can perform well with actual autonomous tasks like reading multiple files, navigating dirs and figuring out where to make edits. That’s at least what I would understand under “vibe coding”
It is a viable way of making software. People have made working software with it. It will likely only ever be more prevalent but might be renamed to just plain old making apps.
ollama-mcp-bridge: A TypeScript implementation that "connects local LLMs (via Ollama) to Model Context Protocol (MCP) servers. This bridge allows open-source models to use the same tools and capabilities as Claude, enabling powerful local AI assistants"
simple-mcp-ollama-bridge: A more lightweight bridge connecting "Model Context Protocol (MCP) servers to OpenAI-compatible LLMs like Ollama"
rawveg/ollama-mcp: "An MCP server for Ollama that enables seamless integration between Ollama's local LLM models and MCP-compatible applications like Claude Desktop"
How you route would be an interesting challenge, presumably could just tell it to use the mcp for certain tasks, thereby offloading locally.
I started by giving it a reference Python MCP server and asking it to modify the code to do that. Now I have 3-4 tools that give me reproducible results.
"create a python decorator that uses a trie to do mqtt topic routing”
phi4-reasoning works, but I think the code is buggy
phi4-mini-reasoning freaks out
qwen3:30b starts looping and forgets about the decorator
mistral-small gets straight to the point and the code seems sane
https://mastodon.social/@rcarmo/114433075043021470
I regularly use Copilot models, and they can manage this without too many issues (Claude 3.7 and Gemini output usable code with tests), but local models seem to not have the ability to do it quite yet.
The prompt was:
"Create a Python decorator that registers functions as handlers for MQTT topic patterns (including + and # wildcards). Internally, use a trie to store the topic patterns and match incoming topic strings to the correct handlers. Provide an example showing how to register multiple handlers and dispatch a message to the correct one based on an incoming topic."
If you are using Ollama, try explicitly setting the `num_ctx` parameter in your request to something higher like 16k or 32k, and then see if you still encounter the looping. I haven't run into that behavior once with this model.
Create a Python decorator that registers functions as handlers for MQTT topic patterns (including + and # wildcards). Internally, use a trie to store the topic patterns and match incoming topic strings to the correct handlers. Provide an example showing how to register multiple handlers and dispatch a message to the correct one based on an incoming topic.
In the post I saw there’s gemma3 (multimodal) and qwen3 (not multimodal). Could they be used as above?
How does localforge know when to route a prompt to which agent?
Thank you
I do think you should disclose that Localforge is your own project though.
Even if I chill out a bit here, a refurbished Nvidia laptop would pay for itself within a year. I am a bit disappointed Ollama can't handle the full flow yet, IE it could be a single command.
ollama code qwen3
The problem is the best models barely clear the bar for some stuff in terms of coherence and reliability; anything else just isn't particularly usable.
Fron what I've seen most of them are good at writing new code from scratch.
Refactoring is very difficult.
You're generally correct though, that from-scratch gets better results. This is a huge constraint of them: I don't want a model that will write something its way. I've already gone through my design and settled on the style/principles/libraries I did for a reason; the bot working terribly with that is a major flaw and I don't see saying "let the bot do things its preferred way" as a good answer. Some systems, things like latency matters, and the bot's way just isn't good enough.
The vast majority of man-hours are maintaining and extending code, not green-fielding new stuff. Vendors should be hyper-focused on this, on compliance with user directions, not with building something that makes a react todo-list app marginally faster or better than competitors.
I was trying to get postgres working with a project the other day, and Claude decided that it was going to just replace it with SQL lite when it couldn't get the build to work.
All I want is "I don't know how to do this." But now these tools would rather just do it wrong.
They also have a very very strong tendency to try and force unoptimized solutions. You'll have 3 classes that do the exact same thing with only minor variable differences. Something a human would do in one class.
For my latest project I'm strongly tempted to just suck it up and code the whole thing by hand.
Seems like a great way to roll out AI to a medium sized team where a very small team can coordinate access to the best available tools so the entire team doesn’t need to keep pace at the current break-neck speed.
I have models on external drive because Apple and through Ollama server they interact really well with Cline or Roo code or even Bolt, but I found Bolt really not working well.
Someone in another comment on this post mentioned using one of the micro models (Qwen 0.6B I think?) and having decent results. Maybe you can try that and then progressively move upwards?
EDIT: “Queen” -> “Qwen”
The vast majority of people run between 4-6 bit depending on system capability. The extra accuracy above 6 tends to not be worth it relative to the performance hit.
I have 64GB and I can only just fit a bunch of Firefox and VS Code windows at the same time as running a 27B model.
And so people who are vibe coding are getting paid multiple six figure salaries .... that's not sustainable anyone at any age and in any country can vibe code.
Looks like we are embracing the demise of our skill-sets, careers and livelihoods quickly!
Qwen2.5-32B, Cogito-32B and GLM-32B remain the best options for local coding agents, even though the recently released MiMo model is also quite good for its size.
If you're on Debian AFAIK AMD is paying someone to experience the pain in your place, so that is an option if you're building something from scratch, but my openSUSE Tumbleweed installation predates the existence of llama.cpp by a few years and i'm not subjecting myself to the horror that is Python projects (mis)managed by AI developers[1] :-P.
EDIT: my mistake, ROCm isn't needed (or actually, supported) by koboldcpp, it uses Vulkan. ROCm is available via a fork. Still, with Vulkan it is fast too.
[0] ...and more than once as after some OS upgrade it might break, like mine
[1] ok, i did it once, because recently i wanted to try out some tool someone wrote that relied on some AI stuff and i was too stubborn to give up - i had to install Python from source on a Debian docker container because some dependency 2-3 layers deep didn't compile with a newer minor version release of Python. It convinced me to thank yet again to thank Georgi Gerganov for making AI-related tooling that enables people to stick with C++
llama.cpp can be built using Debian-supplied libraries with ROCm backend enabled.
I used to use Debian at the past but when i was about to install my current OS i already had the openSUSE Tumbleweed installer in a USB so i went with that. Ultimately i just needed "a Linux" and didn't care which. I do end up building more stuff from source than when i used Debian but TBH the only time that annoyed me was with ROCm because it is broken into 2983847283 pieces, many of them have their own flags for the same stuff, some claim they allow to install them anywhere but in practice can only work via the default in "/opt", and a bunch of them have their own special snowflake build process (including one that downloads some random stuff via a script through the build process - IIRC a Gentoo packager made a bug report about it to remove the need to download stuff, but i'm not sure if it has been addressed or not).
If i was doing a fresh OS install i'd probably go with Gentoo - it packages ROCm like Debian, but AFAICT (i haven't tried it) it also provides some tools for you to make bespoke patches to packages you install that survive updates and i'd like to do some customizations on stuff i install.
My laptop has dual (and dueling) graphics chips, Intel and Quadro K1200M with 4 GB of RAM. I will need to learn more about LLM setup, so maybe I can torture myself getting the Nvidia driver working on Linux and experiment with that.
1. Sovereignty over data, your outputs can't be stolen or trained on
2. Just for fun / learning / experiment on
3. Avoid detection that you're using AI
4. No Internet connection, in the woods at your cabin or something
This Qwen3-30B-A3B-8bit - how much RAM under MLX?
My 16GB Apple Silicon Macs want to play.
I wonder how far this can go?
My hunch is that there's still some remaining optimization fruit to be harvested but I expect we may be nearing a plateau. I may have to upgrade from 64GB of RAM this year.
Why isn't using localforge enough as it ties into models?
Alternatively, LMStudio has MLX support you can use as well.
Q: who won the cricket worldcup 2023?
A: The *2023 ICC Cricket World Cup* was won by *Australia*. They defeated *New Zealand* in the final on *November 19, 2023*, at the *M. A. Chidambaram Stadium in Chennai, India*.
### Key Details: - *Final Score*: Australia scored *334/5* in their 50 overs, while New Zealand reached *199/8* in their chase. - *Man of the Match*: Pat Cummins (Australia) with 5 wickets for 34 runs. - *Tournament Host*: India. - *Notable Performance*: Australia's aggressive batting and disciplined bowling sealed the victory, marking their *sixth Cricket World Cup title* (tying them with West Indies for the most titles).
This was the 13th edition of the ICC Cricket World Cup (50-over format). The T20 World Cup is a separate tournament, with India winning the 2024 edition. Let me know if you need further details!
total duration: 34.911890735s load duration: 65.034584ms prompt eval count: 20 token(s) prompt eval duration: 305.460537ms prompt eval rate: 65.47 tokens/s eval count: 511 token(s) eval duration: 34.539660654s eval rate: 14.79 tokens/s
Comment: Wrong - https://www.espncricinfo.com/series/icc-cricket-world-cup-20...
Q: give last 10 movies to win the oscar best picture award
Here is a clear and accurate list of the *last 10 Best Picture winners* at the Academy Awards, from *2014 to 2023*, including the ceremony year, the winning film, and the film's release year:
---
1. *2023 (95th Oscars)* - *Winner*: Everything Everywhere All at Once - *Release Year*: 2022
2. *2022 (94th Oscars)* - *Winner*: The Power of the Dog - *Release Year*: 2021
3. *2021 (93rd Oscars)* - *Winner*: Nomadland - *Release Year*: 2020
4. *2020 (92nd Oscars)* - *Winner*: Parasite - *Release Year*: 2019
5. *2019 (91st Oscars)* - *Winner*: Green Book - *Release Year*: 2018
6. *2018 (90th Oscars)* - *Winner*: The Shape of Water - *Release Year*: 2017
7. *2017 (89th Oscars)* - *Winner*: Moonlight - *Release Year*: 2016
8. *2016 (88th Oscars)* - *Winner*: Spotlight - *Release Year*: 2015
9. *2015 (87th Oscars)* - *Winner*: 12 Years a Slave - *Release Year*: 2013
10. *2014 (86th Oscars)* - *Winner*: Argo - *Release Year*: 2012
Comment: Answer number 2 ans 9 are wrong. (https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture)
I would have expected it to get things which are such big events right at least.
https://en.m.wikipedia.org/wiki/Vibe_coding
> a programming paradigm dependent on artificial intelligence (AI), where a person describes a problem in a few sentences as a prompt to a large language model (LLM) tuned for coding.
> A key part of the definition of vibe coding is that the user accepts code without full understanding.