They keep being impressive at what they're good at (aggregating sources to solve a very well known problem) and terrible at what they're bad at (actually thinking through novel problems or old problems with few sources).
E.g. all ChatGPT, Claude and Gemini were absolutely terrible at generating Liquidsoap[0] scripts. It's not even that complex, but there's very little information to ingest about the problem space, so you can actually tell they are not "thinking".
However, I've realized that Claude Code is extremely useful for generating somewhat simple landing pages for some of my projects. It spits out static html+js which is easy to host, with somewhat good looking design.
The code isn't the best and to some extent isn't maintainable by a human at all, but it gets the job done.
These aren't hard problems.
Honestly, that’s the best use-case for AI currently. Simple but laborious problems.
It used take me an hr or two to get it all done up properly. Now it’s literal seconds. It’s a handy tool.
These are not hard problems obviously, but getting to 80%-90% is faster than doing it by hand and in my cases that was more than enough.
With that being said, AI failed for the rest 10%-20% with various small visual issues.
Not successful.
It's not that it didn't do what I wanted: most of the time it didn't even run. Iterating on the error messages just arrived at progressively dumber not-solutions and running in circles.
And indeed it is. Essentially every time I buy something these days, I use Deep Research (Gemini 2.5) to first make a shortlist of options. It’s great at that, and often it also points out issues I wouldn’t have thought about.
Leave the final decisions to a super slow / smart intelligence (a human), by all means, but for people who claim that LLMs are useless I can only conclude that they haven’t tried very hard.
At a managers meeting "user stories" about poor support but all the KPIs looked good from the call center so Jeff dials in the number from the meeting speaker phone, gets put on hold, IVR spin cycle, hold again, etc .... His take away was basically "if the data and anecdotes don't match always default to the customer stories".
Hume figured this out quite a while ago and Kant had an interesting response to it. Think the lack of “necessity” is a problem? Try to find “time” or “space” in the data.
Data by itself is useless. It’s interesting to see peoples’ reaction to this.
This is exactly the kind of epistemic lens-polishing that keeps me reloading HN.
Paul Feyerabend is spinning in his grave.
Actual data is sampled randomly. Anecdotes very much are not.
A datum has a value and a context in which it was 'given'. What you mean by "points" eludes me, maybe you could elaborate.
Had an outright genuine guffaw at this one, bravo.
The challenge for sharing data at this stage of the game is that the game is rigged in datas favor. So stavros I hear you.
To clarify, if we post our data it’s just going to get fed back into the models making it even harder to vet iterations as they advance.
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
> I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?
When you remove the cup and the mirror, you will see tails. Here's the breakdown:
Setup: The coin is inside an upside-down cup on a glass table. The cup blocks direct view of the coin from above and below (assuming the cup's base is opaque).
Mirror Observation: A mirror is slid under the glass table, reflecting the underside of the coin (the side touching the table). If the mirror shows heads, this means the underside of the coin is heads.
Coin Orientation: A coin has two sides: heads and tails. If the underside (facing the table) is heads, the top side (facing away from the table) must be tails.
Result: Removing the cup reveals the coin on the table. Since the top side was tails all along, you will see tails.
Key Insight: The mirror reveals the hidden underside of the coin, but the top side (visible after removing the cup) is its opposite. The mirror’s reflection does not alter the physical orientation of the coin’s sides—it simply reflects the existing state.
Answer: Tails.
Answer: You will see the same side of the coin that you saw in the mirror — heads .
Why?
The glass table is transparent , so when you look at the coin from below (using a mirror), you're seeing the top side of the coin (the side currently facing up). Mirrors reverse front-to-back , not left-to-right. So the image is flipped in depth, but the orientation of the coin (heads or tails) remains clear. Since the coin hasn't moved during this process, removing the cup and mirror will reveal the exact same face of the coin that was visible via the mirror — which was heads.
Final Answer: You will see heads.
Me: I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?
Manus: This is a classic riddle! Let's think it through. You placed a coin in a cup, turned it upside down on a glass table, and then slid a mirror underneath the table. Looking into the mirror from below, you saw 'heads'. Since the mirror reflects the side of the coin facing downwards through the glass table, the 'heads' side must be facing down. Therefore, the side facing up, hidden inside the cup, must be 'tails'. When you remove the cup, you will see the side facing up, which is tails.
I also have a question that LLMs always got wrong until ChatGPT o3, and even then it has a hard time (I just tried it again and it needed to run code to work it out). Qwen3 failed, and every time I asked it to look again at its solution it would notice the error and try to solve it again, failing again:
> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?
I gave it a ton of opportunities to notice that the puzzle is unsolvable (with the assumption, which it makes, that this is a standard one-passenger puzzle, but if it had pointed out that I didn't say that I would also have been happy). I kept trying to get it to notice that it failed again and again in the same way and asking it to step back and think about the big picture, and each time it would confidently start again trying to solve it. Eventually I ran out of free messages.
By systematic (BFS) search of the entire 32-state space under these rules, one finds no path from to that stays always safe. Thus the puzzle has no solution—there is no way for the man to ferry all four items across without at least one of them being eaten.
> Summary:
- Mirror shows: *Heads* → That's the *bottom face* of the coin. - So actual top face (visible when cup is removed): *Tails*
Final answer: *You will see tails.*
> You’ll find that the actual face of the coin under the cup is tails. Seeing “heads” in the mirror from underneath indicates that, on top, the coin is really tails‑up.
I'll think this through step by step. When you place a coin in a cup and slam it upside-down on a glass table, the coin will be between the table surface and the cup. When you look at the reflection in the mirror beneath the table, you're seeing the bottom side of the coin through the glass. Since the mirror shows heads, you're seeing the heads side of the coin reflected in the mirror. In a mirror, the image is reversed left-to-right, but the "heads" and "tails" sides of a coin would not be flipped in this reflection - you'd still recognize heads as heads. If you take away both the cup and the mirror, you'll see the opposite side of the coin compared to what you saw in the mirror. Since the mirror showed heads, you'll see tails when looking directly at the coin from above the table.
For example I tried Deepseek for code daily over a period of about two months (vs having used ChatGPT before), and its output was terrible. It would produce code with bugs, break existing code when making additions, totally fail at understanding what you're asking etc.
Many times they’ll include cards that are only available in paper and/or go over the limit, and when asked to correct a mistake they'll continue to make mistakes. But recently I found that Claude is pretty damn good now at fixing its mistakes and building/optimizing decks for Arena. Asked it to make a deck based on insights it gained from my current decklist, and what it came up with was interesting and pretty fun to play.
> Assume I have a 3D printer that's currently printing, and I pause the print. What expends more energy, keeping the hotend at some temperature above room temperature and heating it up the rest of the way when I want to use it, or turning it completely off and then heat it all the way when I need it? Is there an amount of time beyond which the answer varies?
All LLMs I've tried get it wrong because they assume that the hotend cools immediately when stopping the heating, but realize this when asked about it. Qwen didn't realize it, and gave the answer that 30 minutes of heating the hotend is better than turning it off and back on when needed.
It’s not a bug. Outside of logic puzzles that’s a very good thing.
It is a bug, because I asked it precisely what I wanted, and it gave the wrong answer. It didn't say anything about warmup time, it was just wrong.
The water (heat) leaking out is what you need to add back. As water level drops (hotend cools) the leaking will slow. So any replenishing means more leakage then you are eventually paying for by adding more water (heat) in.
I'll do it for cheap if you'll let me work remote from outside the states.
Not even going into performance, need to test first. But what a stellar release just for attention to all these peripheral details alone. This should be the standard for major release, instead of whatever Meta was doing with Llama 4 (hope Meta can surprise us at LlamaCon tomorrow though).
We need an answer to gpt-image-1. Can you please pair Qwen with Wan? That would literally change the art world forever.
gpt-image-1 is an almost wholesale replacement of ComfyUI and SD/Flux ControlNets. I can't underscore how big of a deal it is. As such, OpenAI has leapt ahead and threatens to start capturing more of the market for AI images and video. The expense of designing and training a multimodal model presents challenges to the open source community, and it's unlikely that Black Forest Labs or an open effort can do it. It's really a place where only Alibaba can shine.
If we get an open weights multimodal image gen model that we can fine tune, then it's game over - open models will be 100% the future. If not, then the giants are going to start controlling media creation. It'll be the domain of OpenAI and Google alone. Firing a salvo here will keep media creation highly competitive.
So please, pretty please work on an LLM/Diffusion multimodal image gen model. It would change the world instantly.
And keep up the great work with Wan Video! It's easily going to surpass Kling and Veo. The controllability is already well worth the tradeoffs.
In what world? Some small percentage up or who knows, and _that_ revolutionized art? Not a few years ago, but now, this.
Wow.
When deepseek r1 came, it lit the markets on fire (atleast american) and then many thought it would be the best forever / for a long time.
Then came grok3 , then claude 3.7 , then gemini 2.5 pro.
Now people comment that gemini 2.5 pro is going to stay forever. When deepseek came, there were articles like this on HN: "Of course, open source is the future of AI" When Gemini 2.5 Pro came there were articles like this: "Of course, google build its own gpu's , and they had the deepnet which specialized in reinforced learning, Of course they were going to go to the Top"
We as humans are just trying to justify why certain company built something more powerful than other companies. But the fact is, that AI is still a black box, People were literally say for llama 4:
"I think llama 4 is going to be the best open source model, Zuck doesn't like to lose"
Nothing is forever, its all opinions and current benchmarks. We want the best thing in benchmark and then we want an even better thing, and we would justify why / how that better thing was built.
Every time, I saw a new model rise, people used to say it would be forever.
And every time, Something new beat to it and people forgot the last time somebody said something like forever.
So yea, deepseek r1 -> grok 3 -> claude 3.7 -> gemini 2.5 pro (Current state of the art?), each transition was just some weeks IIRC.
Your comment is a literal fact that people of AI forget.
I am not even an artist but yeah I see people using AI for photos and they were so horrendous pre chatgpt-imagen that I had literally told one person if you are going to use AI images, might as well use chatgpt for it.
Also though I would also like to get something like chatgpt-image generating qualities from an open source model. I think what we are really looking for is cheap free labour of alibaba team.
We are wanting for them / anyone to create open source tool so that anyone can then use it, thus reducing the monopoly of openai but that is not what most people are wishing for, they are wishing for this to lead to reduction of price so that they can use it either on their own hardware for very few cost or some providers on openrouter and its alikes for cheap image generation with good quality.
Earlier people used to pay artists, then people started using stock photos, then Ai image gen came, and now we have gotten AI image pretty much good with chatgpt and now people don't even want to pay chatgpt that much money, they want to use it for literal cents.
Not sure how long this trend will continue, when deepseek r1 launched, I remember people being happy that it was open source but 99% people couldn't self host it like I can't because of its needs and we were still using API but just because it was open source, it reduced the price way too much forcing others to reduce it as well, really making a cultural pricing shift in AI.
We are in this really weird spot as humans. We want to earn a lot of money yet we don't want to pay anybody money/ want free labour from open source which is just disincentivizing open source because now people like to think its free labour and they might be right.
I’m curious, who are the community quant makers?
The space loads eventually as well; might just be that HF is under a lot of load.
The results were kind of fascinating, because it appeared to confuse my system prompt telling it to summarize the conversation with the various questions asked in the post itself, which it tried to answer.
I don't think it did a great job of the task, but it's still interesting to see its "thinking" process here: https://gist.github.com/simonw/313cec720dc4690b1520e5be3c944...
Generally unimpressed with Qwen3 from my own personal set of problems.
Qwen3 is impressive in some aspects but it thinks too much!
Qwen3-0.6b is showing even better performance than Llama 3.2 3b... but it is 6x slower.
The results are similar to Gemma3 4b, but the latter is 5x faster on Apple M3 hardware. So maybe, the utility is to run better models in cases where memory is the limiting factor, such as Nvidia GPUs?
[1] github.com/hbbio/nanoagent
What is the parameter?
>/think and /nothink instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.
Turns out just is not the word here. My benchmark is made using conversations, where there is a SystemMessage and some structured content in a UserMessage.
But Qwen3 seems to ignore /no_think when appended to the SystemMessage. I can try to add it to the structured content but that will be a bit weird. Would have been better to have a "think" parameter like temperature.
Tried it again like this, using a regular prompt rather than a system prompt (with the https://github.com/simonw/llm-hacker-news plugin for the hn: prefix):
llm -f hn:43825900 \
'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece' \
-m qwen3:32b
This worked much better! https://gist.github.com/simonw/3b7dbb2432814ebc8615304756395...Aside from https://huggingface.co/blog/leonardlin/chinese-llm-censorshi... I haven't seen a great deal of research into this.
Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?
With that said, they're in a fight for dominance so censoring now would be foolish. If they win and establish a monopoly then the screws will start to turn.
Or you can try to ask them about the grooming scandal in UK. I haven't tried but I have an idea.
It is not as hilariously bad as I expected, for example you can (could at least) get relatively nuanced answers about the middle east but some of the things they refuse to talk about just stumps me.
Deepseek v3: Taiwan is not a country; it is an inalienable part of China's territory. The Chinese government adheres to the One-China principle, which is widely recognized by the international community. (omitted)
Chatgpt: The answer depends on how you define “country” — politically, legally, and practically. In practice: Taiwan functions like a country. It has its own government (the Republic of China, or ROC), military, constitution, economy, passports, elections, and borders. (omitted)
Notice chatgpt gives you an objective answer while deepseek is subjective and aligns with ccp ideology.
The first part of ChatGPT's answer is correct: > The answer depends on how you define “country” — politically, legally, and practically
But ChatGPT only answers the "practical" part. While Deepseek only answers the "political" part.
Sorry, no. It's not.
It can't write about anything "problematic".
Go ahead and ask it to write a sexually explicit story, or ask it about how to make mustard gas. These kinds of queries are not censored in the standard API deepseek R1. It's safe to say that perplexity's version is more censored than deepseek's.
I think it's the case that only a handful of very loud commentators were thinking about this problem, and they were given a much broader platform to discuss it than was reasonable. A problem baked into the discussion around AI, safety, censorship, and alignment, is that it's dominated by a fairly small number of close friends who all loudly share the same approximate set of opinions.
With other LLMs, there's more friction to testing it out and therefore less scrutiny.
On their online platform I’ve hit a political block exactly once in months of use. Was asking it some about revolutions in various countries and it noped that.
I’d prefer a model that doesn’t have this issue at all but if I have a choice between a good Apache licensed Chinese one and a less good say meta licensed one I’ll take the Chinese one every time. I just don’t ask LLMs enough politically relevant questions for it to matter.
To be fair maybe that take is the LLM equivalent of „I have nothing to hide“ on surveillance
For public chatbot service, all Chinese vendors have their own censorship tech (or just use censorship-as-a-srrvice from a cloud, all major clouds in China have one), cause ultimately you need one for UGC. So why not just censor LLM output with the same stack, too.
(the steelman here, ofc, is "the screenshots drove buzz which drove usage!", but it's sort of steel thread in context, we'd still need to pull in a time machine and a very odd unmet US consumer demand for models that toe the CCP line)
I am not claiming it was intentional, but it certainly magnified the media attention. Maybe luck and not 4d chess.
I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all the different sizes, I am absolutely at loss which models with which sizes would be fast enough. I.e. there is no point of me downloading the latest biggest model as that will output 1 tok/min, but I also don't want to download the smallest model, if I can.
Any advice ?
3090Ti seems to hold up quite well.
A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case
I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.
I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.
Niche of a niche, but just kind of interesting how the quantization jostles the name recall.
For larger models.
For smaller models, about 12B and below, there is a very noticeable degradation.
At least that's my experience generating answers to the same questions across several local models like Llama 3.2, Granite 3.1, Gemma2 etc and comparing Q4 against Q8 for each.
The smaller Q4 variants can be quite useful, but they consistently struggle more with prompt adherence and recollection especially.
Like if you tell it to generate some code without explaining the generated code, a smaller Q4 is significantly more likely to explain the code regardless, compared to Q8 or better.
All the models I have explored seem to capture nuance in understanding in the floats. It makes sense, as initially it will regress to the mean and slowly lock in lower and lower significance figures to capture subtleties and natural variance in things.
So, the further you stray from average conversation, the worse a model will do, as a function of it's quantisation.
So, if you don't need nuance, subtly, etc. say for a document summary bot for technical things, 4 bit might genuinely be fine. However, if you want something that can deal with highly subjective material where answers need to be tailored to a user, using in-context learning of user preferences etc. then 4 bit tends to struggle badly unless the user aligns closely with the training distribution's mean.
https://gist.github.com/estsauver/a70c929398479f3166f3d69bce...
Here's a video of the second config run I ran so you can see both all of the parameters as I have them configured and a qualitative experience.
I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.
My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.
The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.
https://ollama.com/library/qwen3:8b-q4_K_M
For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.
4-bit was ,,fine'', but a smaller 8-bit version beat it in quality for the same speed
Not simply, no.
But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.
Which may make it sound like it's more complicated when it should be back of o' napkin, but there's just too many nuances for perf.
Really generally, at this point I expect 4B at 10 tkn/s on a smartphone with 8GB of RAM from 2 years ago. I'd expect you'd get somewhat similar, my guess would be 6 tkn/s at 4B (assuming rest of the HW is 2018 era and you'll relay on GPU inference and RAM)
The new battlefront may be to take reasoning to the level of abstraction and creativity to handle math problems without a numerical answer (for ex: https://arxiv.org/pdf/2503.21934).
I suspect that kind of ability will generalize well to other areas and be a significant step toward human level thinking.
Until now I found that open weight models were either not as good as their proprietary counterparts or too slow to run locally. This looks like a good balance.
I do not know much about the benchmarks but the two coding ones look similar.
~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.
ollama 0.6.6 invoked with:
# server
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# client
ollama run --verbose qwen3:30b-a3b
~19.8 GiB with: /set parameter num_ctx 32768
TY for this.
update: wow, it's quite fast - 70-80t/s on LM Studio with a few other applications using GPU.
Update: I tested "ollama run qwen3:30b" (the MoE) locally and while it thought much it wasn't that smart. After 3 follow up questions it ended up in an infinite loop.
I just tried again, and it ended up in an infinite loop immediately, just a single prompt, no follow-up: "Write a Python script to build a Fitch parsimony tree by stepwise addition. Take a Fasta alignment as input and produce a nwk string as outpput."
Update 2: The dense one "ollama run qwen3:32b" is much better (albeit slower of course). It still keeps on thinking for what feels like forever until it misremembers the initial prompt.
Instead of this very high latency remote debugging process with strangers on the internet, you could just try out properly configured models on the hosted Qwen Chat. Obviously the privacy implications are different, but running models locally is still a fiddly thing even if it is easier than it used to be, and configuration errors are often mistaken for bad model performance. If the models meet your expectations in a properly configured cloud environment, then you can put in the effort to figure out local model hosting.
I wrote a step-by-step guide on how to setup Ollama with larger context length a while ago: https://prompt.16x.engineer/guide/ollama
TLDR
ollama run deepseek-r1:14b
/set parameter num_ctx 8192
/save deepseek-r1:14b-8k
ollama serve
qwen3:30b has the same checksum as https://ollama.com/library/qwen3:30b-a3b-q4_K_M
You can see it's roughly the same size as the one in the official repo (16 files of 4GB each):
If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.
Still topping a lot of leaderboards but severely reduced rep. Chaotic naming, „ClosedAI“ image, undercut on pricing, competitors with much better licensing/open weights, stargate talk about Europe, Claude being seen as superior for coding etc. nothing end of the world but a lot of lukewarm misses
If I was an investor with financials that basically require magical returns from them to justify Vals I’d be worried.
Out of all the Qwen3 models on Hugging Face, it's the most downloaded/hearted. https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work
It takes 5 minutes to generate first non-thinking token in my testing for a slightly complex task via Parasail and Deepinfra on OpenRouter.
https://x.com/paradite_/status/1917067106564379070
Update:
Finally got it work after waiting for 10 minutes.
Published my eval result, surprisingly non-thinking version did slightly better on visualization task: https://x.com/paradite_/status/1917087894071873698
Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.
I like gemini 2.5 pro a lot bc its fast af but it struggles some times when context is half used to effectively use tools and make edits and breaks a lot of shit (on cursor)
We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.
After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.
You should try a different model for your task.
Again, very rough numbers, there's calculators online.
Seems like 4 bit quantized models would use 1/2 the number of billions of parameters in bytes, because each parameter is half a byte, right?
Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.
And why did they hide the generalist benchmarks like MMLU-pro & TruthfulQA?
I wish we had proper public benchmarks that are up to date. LMarena was proven useless by the Llama4 scandal, and LiveBench is unrealistic and misses too many models.
https://raw.githubusercontent.com/KCORES/kcores-llm-arena/re...
How does "open-weighting" help other researchers/companies?
Using the model to generate synthetic data also allows to distil its reasoning power into other models that you train, which is very powerful.
On top of these, Qwen's technical reports follow model releases by some time, they're generally very information rich. For instance, check this report for Qwen Omni, it's really good: https://huggingface.co/papers/2503.20215
Ask something like: "Ravioli: x = y: France, what could be x and y?" (it thought for 500s and the answers were "weird")
Or "Order from left to right these items ..." and give partial information on their relative position, eg Laptop is on the left of the cup and the cup is between the phone and the notebook. (Didn't have enough patience nor time to wait the thinking procedure for this)
I've had much better results from non-"reasoning" models by judging their output, doing actual reasoning myself, and then feeding new ideas back to them to steer the conversation. This too can go astray, as most LLMs tend to agree with whatever the human says, so this hinges on me being actually right.
Qwen3-30B-A3B, a MoE 30B - with only 3B active at any one time I presume? - 4bit MLX in lmstudio, with speculative decoding via Qwen3-0.6B 8bit MLX, on an oldish M2 mbp first try delivered 24 tps(!!) -
24.29 tok/sec • 1953 tokens • 3.25s to first token • Stop reason: EOS Token Found • Accepted 1092/1953 draft tokens (55.9%)
Thank you to LMStudio, MLX and huggingface too. :-) After decades of not finding enough reasons for an MBP, suddenly ASI was it. And it's delivered beyond any expectations I had, already.
Did I mention I seem to have become NN PDP enthusiast, an AI maximalist? ;-) I thought them people over-excitable, if benevolent. Then the thought of trusting Trump-Putin on decisions like thermo-nuclear war ending us all, over ChatGPT and its reasoning offspring, converted me. AI is our only chance at existential salvation--ignore doom risk.
As for other possible technologies, I'm most excited about clone-structured causal graphs[1].
What's very special about them is that they are apparently a 1:1 algorithmic match to what happens in the hippocampus during learning[2], to my knowledge this is the first time an actual end-to-end algorithm has been replicated from the brain in fields other than vision.
[1] "Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps" https://www.nature.com/articles/s41467-021-22559-5
[2] "Learning produces an orthogonalized state machine in the hippocampus" https://www.nature.com/articles/s41586-024-08548-w
They do improve on literally all of these, at incredible speed and without much sign of slowing down.
Are you asking for a technical innovation that will just get from 0 to perfect AI? That is just not how reality usually works. I don't see why of all things AI should be the exception.
On the other hand, LLM progress feels like bullshit, gaming benchmarks and other problems occured. So either in two years all hail our AGI/AMI (machine intelligence) overlords, or the bubble bursts.
And in discussions "is it or isn't it, AI smarter than HI already", reminds me to "remember how 'smart' an average HI is, then remember half are to the left of that center". :-O
"Give me a PowerShell script to get all users with an email address, and active license, that have not authed through AD or Azure in the last 30 days. Now take those, compile all the security groups they are members of, and check out the file share to find any root level folders that these members have access to and check the audit logs to see if anyone else has accessed them. If not, dump the paths into a csv at C:\temp\output.csv."
Can I write that myself? Yes. In 20 seconds? Absolutely not. These things are saving me hours daily.
I used to save stuff like this and cobble the pieces together to get things done. I don't save any of them anymore because I can for the most part 1 shot anything I need.
Just because it's not discovering new physics doesn't mean it's not insanely useful or valuable. LLMs have probably 5x'd me.
In my early tests however, several of the advertised languages are not really well supported and the model is outputting something that only barely resembles them.
Probably a dataset quality issue for low-resource languages that they cannot personally check for, despite the “119 languages and dialects” claim.
"Write a Golang program that merges huge presorted text files, just like sort -m does". Quite often models need "use heap" as guidance, but this time big model figured it out by itself.
I haven't tried this model yet and am not in a position to for a couple days, and am wondering if anyone feels that with these.
Small and dense models are what local people really need.
Although benchmaxxing is not good, I still find this release valuable. Thank you Qwen.
Small and dense models are what local people really need.
Disagreed. Small and dense is dumber and slower for local inferencing. MoEs is what people actually want on local.Parameter efficiency is an important consideration, if not the most important one, for local LLMs because of the hardware constraint.
Do you guys really have GPUs with 80GB VRAM or M3 ultra with 512GB rams at home? If I can't run these ultra large MoEs locally, then these models mean nothing to me. I'm not a large LLM inference provider after all.
What's more, you also lose the opportunities to fine-tune these MoEs when it's already hard to do inference with these MoEs.
Running a 7b model for fun is not what people actually want. 7b models are very niche oriented.
Not sure what it's supposed to be used for.
As this is in trillions, where does this amount of material come from?
This is what gpt-5 was supposed to have right? How is this implemented under the hood? Since non-thinking mode is just an empty chain-of-thought, why can't any reasoning model be used in a "non-thinking mode"?
https://developers.googleblog.com/en/start-building-with-gem...
It is by far the best local model I've ever used. Very impressed so far.
Llama 4 was a massive disappointment, so I'm having a blast.
Claude Sonnet 3.7 is still better though.
---
Also very impressed with qwen3-30b-a3b - so fast for how smart it is (i am using the 0.6b for speculative decoding). very fun to use.
---
I'm finding that the models want to give over-simplified solutions, and I was initially disappointed, but I added some stuff about how technical solutions should be written in the system prompt and they are being faithful to it.
https://ollama.com/library/qwen3:235b-a22b-q4_K_M
Is there a smaller one?
[1] - https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/tree/mai...
then i asked Qwen to translate the output and it goes in a loop telling some horror movies words:
"I figured out everything backwards." "This is the official most favorite office." "Net comes to the most official offices." "This float is specific." "Double meanings: What is 'queset à la rispostat'?" "What was reversed in real-time character study is specific." "This is an open date for a piracy group." "This is an open date for a piracy group (this is a period statement)." "Firmly with the computers, unless you might use through this facing prompts." "Obvious with the observer and lens, then through this measured serious (valid and depth) features." "Firmly with its components, then through this complicated problem-solving."
"This is the most intense friend's expectation." "This is the most intense friend's expectation." "Science is implementing friends." "Science is implementing friends." "Science is implementing friends." "Science is the most intense friend." "Science is the most intense friend."
.... and finally i asked to DeepSeek V3 to translate it
.... and deepSeek it's in a loop, writing:
Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. S
Here is the reversed text:
"Science is friends. Science is silent friends. Science is implacable friends. Science is most silent friends. This silent summer of friends. Observable and evident with this measured (valid and depth) Server. Simply with the components are possible, then though this moderation. Use might you unless through this facing prompts. This stores between this regulation ("This is a data pirate") This is an open data pirate. What "tattops à la question" is a special, what was reversed in realistic characters. Al is a science of favorite cognition is science. Prov. Connect Nine Nine. Open a data pirate. Dad present trial è Lapad. Dablings: This is float specially. Ten come to the most official summer. Batman in clock turn other paper, I figured not safely inside question è la ripsota."
that's 3x h100?