Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.
LM Studio is newish, and it's not a perfect interface yet, but it's fantastic at what it does which is bring local LLMs to the masses w/o them having to know much.
There is another project that people should be aware of: https://github.com/exo-explore/exo
Exo is this radically cool tool that automatically clusters all hosts on your network running Exo and uses their combined GPUs for increased throughput.
Like HPC environments, you are going to need ultra fast interconnects, but it's just IP based.
Get the RTX Pro 6000 for 8.5k with double the bandwidth. It will be way better
The whole point of spending that much money for them is to run massive models, like the full R1, which the Pro 6000 cant
And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.
If you’re using it for background tasks and not coding it’s a different story
Am I the only person that gives aider instructions and leaves it alone for a few hours? This doesn't seem that difficult to integrate into my workflow.
Probably not, but in my experience, if it takes longer than 10-15 minutes it's either stuck in a loop or down the wrong rabbit hole. But I don't use it for vibe coding or anything "big scope" like that, but more focused changes/refactors so YMMV
If the primary use case is input heavy, which is true of agentic tools, there’s a world where partial GPU offload with many channels of DDR5 system RAM leads to an overall better experience. A good GPU will process input many times faster, and with good RAM you might end up with decent output speed still. Seems like that would come in close to $12k?
And there would be no competition for models that do fit entirely inside that VRAM, for example Qwen3 32B.
No, Pro 6000 pulls max 600W, not sure where you get 1500W from, that's more than double the specification.
Besides, what is the token/second or second/token, and prompt processing speed for running DeepSeek R1 671B on a Mac Studio with Q4? Curious about those numbers, because I have a feeling they're very far off each other.
You get around 6 tokens/second which is not great but not terrible. If you use very long prompts, things get bad.
Probably should just use llama.cpp server/ollama and not waste a gig of memory on Electron, but I like GUIs.
https://www.pcgamer.com/apple-vp-says-8gb-ram-on-a-macbook-p...
Wow, that is probably analogous to 48GB on other systems then, if we were to ask an Apple VP?
Here is a nice article with some info about what memory compression is and how it works: https://arstechnica.com/gadgets/2013/10/os-x-10-9/#page-17
It's been a hard technical problem but is pretty much solved by now since its first debut in 2012-2013.
Doesn't such a claim... need stronger evidence?
macOS virtual memory works well on swapping in and out stuff to SSD.
Using local LLMs for this I don't worry about the price at all, I can leave it doing three tries per "task" without tripling the cost if I wanted to.
It's true that there is an upfront cost but way easier to get over that hump than on-demand/per-token costs, at least for me.
Alternatively, you could just set up your own (cheaper?) VPN relay on the tiniest VPS you can rent on AWS or IBM Cloud, right?
Oof you were NOT joking
https://www.techspot.com/news/106159-apple-m5-silicon-rumore...
LM Studio isn't FOSS though.
I did enjoy hooking up OpenWebUI to Firefox's experimental AI Chatbot. (browser.ml.chat.hideLocalhost to false, browser.ml.chat.provider to localhost:${openwebui-port})
Of course, for folks used to terminals, daemons and so on it makes sense from the get go, but for others it seemingly doesn't, and it doesn't help that Ollama refuses to communicate what people should understand before trying to use it.
I'm interested in using models for code generation, but I'm not expecting much in that regard.
I'm planning to attempt fine tuning open source models on certain tool sets, especially MCP tools.
I haven’t been using it much. All it has on it is LM Studio, Ollama, and Stats.app.
> Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.
lol, yup. same.
I'm considering ordering one of these today: https://www.newegg.com/p/N82E16816139451?Item=N82E1681613945...
It looks like it will hold 5 GPUs with a single slot open for infiniband
Then local models might be lower quality, but it won't be slow! :)
Just wondering if Claude 3.7 has seemed differently lately for anyone else? Was my go to for several months, and I'm no fan of OpenAI, but o3 has been rock solid.
Prompts + tools matter.
What cards are you gonna put in that chassis?
I have one running locally with this config:
{
"mcpServers": {
"coderunner": {
"url": "http://coderunner.local:8222/sse"
}
}
}
1. CodeRunner: https://github.com/BandarLabs/coderunner (I am one of the authors)> "MCP Host": applications (like LM Studio or Claude Desktop) that can connect to MCP servers, and make their resources available to models.
I think everyone else is calling this an "MCP Client", so I'm not sure why they would want to call themselves a host - makes it sound like they are hosting MCP servers (definitely something that people are doing, even though often the server is run on the same machine as the client), when in fact they are just a client? Or am I confused?
Some more discussion on the confusion here https://github.com/modelcontextprotocol/modelcontextprotocol... where they acknowledge that most people call it a client and that that's ok unless the distinction is important.
I think host is a bad term for it though as it makes more intuitive sense for the host to host the server and the client to connect to it, especially for remote MCP servers which are probably going to become the default way of using them.
The MCP standard seems a mess, e.g take this paragraph from here[1]
> In the Streamable HTTP transport, the server operates as an independent process that can handle multiple client connections.
Yes, obviously, that is what servers do. Also, what is "Streamable HTTP"? Comet, HTTP2, or even websockets? SSE could be a candidate, but it isn't as it says "Streamable HTTP" replaces SSE.
> This transport uses HTTP POST and GET requests.
Guys, POST and GET are verbs for HTTP protocol, TCP is the transport. I guess they could say that they use HTTP protocol, which only uses POST and GET verbs (if that is the case).
> Server can optionally make use of Server-Sent Events (SSE) to stream multiple server messages.
This would make sense if there weren't the note "This replaces the HTTP+SSE transport" right below the title.
> This permits basic MCP servers, as well as more feature-rich servers supporting streaming and server-to-client notifications and requests.
Again, how is streaming implemented (what is "Streaming HTTP")?. Also, "server-to-client .. requests"? SSE is unidirectional, so those requests are happening over secondary HTTP requests?
--
And then the 2.0.1 Security Warning seems like a blob of words on security, no reference to maybe same-origin. Also, "for local servers bind to localhost and then implement proper authentication" - are both of those together ever required? Is it worth it to even say that servers should implement proper authentication?
Anyway, reading the entire documentation one might be able to put a charitable version of the MCP puzzle together that might actually make sense. But it does seem that it isn't written by engineers, in which case I don't understand why or to whom is this written for.
[1] https://modelcontextprotocol.io/specification/draft/basic/tr...
As far as I can tell, unsurprisingly, the MCP specification was written with the help of LLMs, and seemingly hasn't been carefully reviewed because as you say, a bunch of the terms have straight up wrong definitions.
https://modelcontextprotocol.io/specification/2025-03-26/arc...
I'm not bullish on MCP, but at the least this approach gives a good way to experiment with it for free.
You gotta help me out. What do you see holding it back?
For my 16gb of VRAM, those models do not include anything that's good at coding, even when I provide the API documents via PDF upload (another thing that LM Studio makes easy).
So, not really, but LM Studio at least makes it easier to find that out.
Upon installing the first model offered is google/gemma-3-12b - which in fairness is pretty decent compared to others.
It's not obvious how to show the right sidebar they're talking about, it's the flask icon which turns into a collapse icon when you click it.
I set the MCP up with playwright, asked it to read the top headline from HN and it got stuck on an infinite loop of navigating to Hacker News, but doing nothing with the output.
I wanted to try it out with a few other models, but figuring out how to download new models isn't obvious either, it turned out to be the search icon. Anyway other models didn't fare much better either, some outright ignored the tools despite having the capacity for 'tool use'.
I also tried the recent deepseek 8b distill, but it was much worse for tool calling than qwen3 8b.
I'd love to learn more about your MCP implementation. Wanna chat?
Nice to have a local option, especially for some prompts.
I have a 48GB macbook pro and Gemma3 (one of the abliterated ones) fits my non-code use case perfectly (generating crime stories which the reader tries to guess the killer).
For code, I still call Google to use Gemini.
Any suggestions?
[>_] -> [.* Settings] -> Serve on local network ( o)
Any OpenAI-compatible client app should work - use IP address of host machine as API server address. API key can be bogus or blank.just added the `Add to LM Studio` button to the anytype mcp server, looks nice: https://github.com/anyproto/anytype-mcp
What I like about ollama is that it provides a self-hosted AI provider that can be used by a variety of things. LM Studio has that too, but you have to have the whole big chonky Electron UI running. Its UI is powerful but a lot less nice than e.g. BoltAI for casual use.
If you're just working as a single user via the OpenAI protocol, you might want to consider koboldcpp. It bundles a GUI launcher, then starts in text-only mode. You can also tell it to just run a saved configuration, bypassing the GUI; I've successfully run it as a system service on Windows using nssm.
https://github.com/LostRuins/koboldcpp/releases
Though there are a lot of roleplay-centric gimmicks in its feature set, its context-shifting feature is singular. It caches the intermediate state used by your last query, extending it to build the next one. As a result you save on generation time with large contexts, and also any conversation that has been pushed out of the context window still indirectly influences the current exchange.
Worse I'd say, considering what people use LM Studio for, is the VRAM it occupies up even when the UI and everything is idle. Somehow, it's using 500MB VRAM while doing nothing, while Firefox with ~60 active tabs is using 480MB. gnome-shell itself also sits around 450MB and is responsible for quite a bit more than LM Studio.
Still, LM Studio is probably the best all-in-one GUI around for local LLM usage, unless you go terminal usage.
Are you sharing any of your revenue from that $79 license fee with the https://ollama.com/ project that your app builds on top of?