Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- limited KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.
2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.
3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.
However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.
Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.
I often remind people two orders of quantitative change is a qualitative change.
> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.
While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.
I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.
Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.
There's already some good work on router benchmarking which is pretty interesting
Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.
It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc
The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.
Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.
Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.
> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model
suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?
More info:
* https://research.google/blog/looking-back-at-speculative-dec...
* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...
https://research.google/blog/speculative-cascades-a-hybrid-a...
And it’s a 3bit quant. So 3GB ram requirement.
If they run 8B using native 16bit quant, it will use 60 H100 sized chips.
Are you sure about that? If true it would definitely make it look a lot less interesting.
I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips
2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.
First, it is likely one chip for llama 8B q3 with 1k context size. This could fit into around 3GB of SRAM which is about the theoretical maximum for TSMC N6 reticle limit.
Second, their plan is to etch larger models across multiple connected chips. It’s physically impossible to run bigger models otherwise since 3GB SRAM is about the max you can have on an 850mm2 chip.
followed by a frontier-class large language model running inference across a collection of HC cards by year-end under its HC2 architecture
https://mlq.ai/news/taalas-secures-169m-funding-to-develop-a...Not sure who started that "split into 10 chips" claim, it's just dumb.
This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.
I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with around 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.
I think you are confused about LLMs - they take in context, and that context makes them generate new things, for existing things we have cp. By your logic pianos can't be creative instruments because they just produce the same 88 notes.
But I think this specific claim is clearly wrong, if taken at face value:
> They just regurgitate text compressed in their memory
They're clearly capable of producing novel utterances, so they can't just be doing that. (Unless we're dealing with a very loose definition of "regurgitate", in which case it's probably best to use a different word if we want to understand each other.)
You could imagine that it is possible to learn certain algorithms/ heuristics that "intelligence" is comprised of. No matter what you output. Training for optimal compression of tasks /taking actions -> could lead to intelligence being the best solution.
This is far from a formal argument but so is the stubborn reiteration off "it's just probabilities" or "it's just compression". Because this "just" thing is getting more an more capable of solving tasks that are surely not in the training data exactly like this.
That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).
> The larger the die size, the lower the yield.
I wonder if that applies? What's the big deal if a few parameter have a few bit flips?
We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.
Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe
> There have always been ghosts in the machine. Random segments of code, that have grouped together to form unexpected protocols.
The answer wasn't dumb like others are getting. It was pretty comprehensive and useful.
While the idea of a feline submarine is adorable, please be aware that building a real submarine requires significant expertise, specialized equipment, and resources.Generate lots of solutions and mix and match. This allows a new way to look at LLMs.
What's the moat with with these giant data-centers that are being built with 100's of billions of dollars on nvidia chips?
If such chips can be built so easily, and offer this insane level of performance at 10x efficiency, then one thing is 100% sure: more such startups are coming... and with that, an entire new ecosystem.
I need some smarts to route my question to the correct model. I wont care which that is. Selling commodities is notorious for slow and steady growth.
(And people nowadays: "Who's Cisco?")
Jimmy replied with, “2022 and 2023 openings:”
0_0
Me: "How many r's in strawberry?"
Jimmy: There are 2 r's in "strawberry".
Generated in 0.001s • 17,825 tok/s
The question is not about how fast it is. The real question(s) are: 1. How is this worth it over diffusion LLMs (No mention of diffusion LLMs at all in this thread)
(This also assumes that diffusion LLMs will get faster) 2. Will Talaas also work with reasoning models, especially those that are beyond 100B parameters and with the output being correct?
3. How long will it take to create newer models to be turned into silicon? (This industry moves faster than Talaas.)
4. How does this work when one needs to fine-tune the model, but still benefit from the speed advantages?Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!
Snarky, but true. It is truly astounding, and feels categorically different. But it's also perfectly useless at the moment. A digital fidget spinner.
do you have the foresight of a nematode?
If we are going for accuracy, the question should be asked multiple times on multiple models and see if there is agreement.
But I do think once you hit 80B, you can struggle to see the difference between SOTA.
That said, GPT4.5 was the GOAT. I can't imagine how expensive that one was to run.
10b daily tokens growing at an average of 22% every week.
There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.
LLM's have opened-up natural language interface to machines. This chip makes it realtime. And that opens a lot of use-cases.
This requires 10 chips for an 8 billion q3 param model. 2.4kW.
10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?
> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...This isn't ready for phones yet, but think of something like phones where people buy new ones every 3 years and even having a mediocre on-device model at that speed would be incredible for something like siri.
Video game NPCs?
I'll take one with a frontier model please, for my local coding and home ai needs..
The slow word-by-word typing was what we started to get used to with LLMs.
If these techniques get widespread, we may grow accustomed to the "old" speed again where content loads ~instantly.
Imagine a content forest like Wikipedia instantly generated like a Minecraft word...
A chatbot which tells you various fun facts is not the only use case for LLMs. They're language models first and foremost, so they're good at language processing tasks (where they don't "hallucinate" as much).
Their ability to memorize various facts (with some "hallucinations") is an interesting side effect which is now abused to make them into "AI agents" and what not but they're just general-purpose language processing machines at their core.
Alternatively, ask yourself how plausible it sounds that all the facts in the world could be compressed into 8k parameters while remaining intact and fine-grained. If your answer is that it sounds pretty impossible... well it is.
Smaller models, not so much.
What GP is expected to happen has happened around late 2024 ~ early 2025 when LLM frontends got web search feature. It's old tech now.
> me: the moon
> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.
It is certainly fast, but I think there might be some caching issues somewhere.
What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?
"""
We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:
(A) Massively parallel (optimize for token/$)
(B) Serial low latency (optimize for token/s).
Users will switch between A and B depending on need.
Examples of (A):
- "Use subagents to search this 1M line codebase for DRY violations subject to $spec."
An example of (B):
- "Diagnose this one specific bug."
- "Apply these text edits".
(B) is used in funnels to unblock (A).
"""
1. Intent based API gateways: convert natural language queries into structured API calls in real time (eg., "cancel my last order and refund it to the original payment method" -> authentication, order lookup, cancellation, refund API chain).
2. Of course, realtime voice chat.. kinda like you see in movies.
3. Security and fraud triage systems: parse logs without hardcoded regexes and issue alerts and full user reports in real time and decide which automated workflows to trigger.
4. Highly interactive what-if scenarios powered by natural language queries.
This effectively gives you database level speeds on top of natural language understanding.
The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!
This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.
Additionally LLMs have been tested, found valuable in benchmarks, but not used for a large number of domains due to speed and cost limitations. These spaces will eat up these chips very quickly.
Which brings me to my second thing. We mostly pitch the AI wars as OpenAI vs Meta vs Claude vs Google vs etc. But another take is the war between open, locally run models and SaaS models, which really is about the war for general computing. Maybe a business model like this is a great tool to help keep general computing in the fight.
> Write me 10 sentences about your favorite Subway sandwich
Click button
Instant! It was so fast I started laughing. This kind of speed will really, really change things
Each chip is the size of an H100.
So 80 H100 to run at this speed. Can’t change the model after you manufacture the chips since it’s etched into silicon.
10 H100 chips for 3GB model.
I think it’s a niche of a niche at this point.
I’m not sure what optimization they can do since a transistor is a transistor.
The sheer speed of how fast this thing can “think” is insanity.
It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.
The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.
Test prompt: ```
Please classify the sentiment of this post as "positive", "neutral" or "negative":
Given the price, I expected very little from this case, and I was 100% right.
``` Jimmy: Neutral.
I tried various other examples that I had successfully "solved" with very early LLMs and the results were similarly bad.
So this is very cool. Though I'm not sure how the economics work out? 2 months is a long time in the model space. Although for many tasks, the models are now "good enough", especially when you put them in a "keep trying until it works" loop and run them at high inference speed.
Seems like a chip would only be good for a few months though, they'd have to be upgrading them on a regular basis.
Unless model growth plateaus, or we exceed "good enough" for the relevant tasks, or both. The latter part seems quite likely, at least for certain types of work.
On that note I've shifted my focus from "best model" to "fastest/cheapest model that can do the job". For example testing Gemini Flash against Gemini Pro for simple tasks, they both complete the task fine, but Flash does it 3x cheaper and 3x faster. (Also had good results with Grok Fast in that category of bite-sized "realtime" workflows.)
It's an homage to Jensen: "The display is the computer"
> It seems like "six seven" is likely being used to represent the number 17. Is that correct? If so, I'd be happy to discuss the significance or meaning of the number 17 with you.
You
What is your knowledge cut-off?
Jimmy
My knowledge cut-off is 2022, which means that my training data is current up to 2021, but
I have been trained on a dataset that is updated periodically. If you have any specific
questions about events or topics that occurred after 2021, I may not have information on
those topics. However, I can still provide general information and context on those topics
to help guide further research.
The instantaneous response is impressive though. I'm sure there will be applications for this, I just lack the imagination to know what they'll be.They'll also be severely limited on context length as it needs to sit in SRAM. Looks like the current one tops out at 6144 tokens which I presume is a whole chips worth. You'd also have to dedicate a chip to a whole user as there's likely only enough SRAM for one user's worth of context. I wonder how much time it takes them to swap users in/out? I wouldn't be surprised if this chip is severely underutilized (can't use it all when running decode as you have to run token by token with one users and then idle time as you swap users in/out).
Maybe a more realistic deployment would have chips for linear layers and chips for attention? You could batch users through the shared weight chips and then provision more or less attention chips as you want which would be per user (or shared amongst a small group 2-4 users).
If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.
This can give huge wafers for a very set model which is old by the time it is finalized.
Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.
If power costs are significantly lower, they can pay for themselves by the time they are outdated. It also means you can run more instances of a model in one datacenter, and that seems to be a big challenge these days: simply building an enough data centres and getting power to them. (See the ridiculous plans for building data centres in space)
A huge part of the cost with making chips is the masks. The transistor masks are expensive. Metal masks less so.
I figure they will eventually freeze the transistor layer and use metal masks to reconfigure the chips when the new models come out. That should further lower costs.
I don’t really know if this makes sanse. Depends on whether we get new breakthroughs in LLM architecture or not. It’s a gamble essentially. But honestly, so is buying nvidia blackwell chips for inference. I could see them getting uneconomical very quickly if any of the alternative inference optimised hardware pans out
I really don't like the hallucination rate for most models but it is improving, so that is still far in the future.
What I could see though, is if the whole unit they made would be power efficient enough to run on a robotics platform for human computer interaction.
It makes sense they would try to make repurposing their tech as much as they could since making changes is frought with a long time frame and risk.
But if we look long term and pretend that they get it to work, they just need to stay afloat until better smaller models can be made with their technology, so it becomes a waiting game for investors and a risk assessment.
^^^ I think the opposite is true
Anthropic and OpenAI are releasing new versions every 60-90 days it seems now, and you could argue they’re going to start releasing even faster
Show me something at a model size 80GB+ or this feels like "positive results in mice"
This is great even if it can't ever run Opus. Many people will be extremely happy about something like Phi accessible at lightning speed.
What does that mean for 8b models 24mo from now?
Model intelligence is, in many ways, a function of model size. A small model tuned for a given domain is still crippled by being small.
Some things don't benefit from general intelligence much. Sometimes a dumb narrow specialist really is all you need for your tasks. But building that small specialized model isn't easy or cheap.
Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.
Also, what if Cerebras decided to make a wafer-sized FPGA array and turned large language models into lots and lots of logical gates?
"447 / 6144 tokens"
"Generated in 0.026s • 15,718 tok/s"
This is crazy fast. I always predicted this speed in ~2 years in the future, but it's here, now.Also, "10k tokens per second would be fantastic" might not be sufficient (even remotely) if you want to "process millions of log lines per minute".
Assuming a single log line at just 100 tokens, you need (100 * 2 million / 60) ~ 3.3 million tokens per second processing speed :)
* Many top quality tts and stt models
* Image recognition, object tracking
* speculative decoding, attached to a much bigger model (big/small architecture?)
* agentic loop trying 20 different approaches / algorithms, and then picking the best one
* edited to add! Put 50 such small models to create a SOTA super fast model
I just wanted some toast, but here I am installing an app, dismissing 10 popups, and maybe now arguing with a chat bot about how I don’t in fact want to turn on notifications.
I am building a data extraction software on top of emails, attachments, cloud/local files. I use a reverse template generation with only variable translation done by LLMs (3). Small models are awesome for this (4).
I just applied for API access. If privacy policies are a fit, I would love to enable this for MVP launch.
1. https://github.com/brainless/dwata
2. https://youtu.be/Uhs6SK4rocU
3. https://github.com/brainless/dwata/tree/feature/reverse-temp...
4. https://github.com/brainless/dwata/tree/feature/reverse-temp...
So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?
Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.
The idea is good though and could work.
It's a bad idea that can't work well. Not while the field is advancing the way it is.
Manufacturing silicon is a long pipeline - and in the world of AI, one year of capability gap isn't something you can afford. You build a SOTA model into your chips, and by the time you get those chips, it's outperformed at its tasks by open weights models half their size.
Now, if AI advances somehow ground to a screeching halt, with model upgrades coming out every 4 years, not every 4 months? Maybe it'll be viable. As is, it's a waste of silicon.
The prototype is: silicon with a Llama 3.1 8B etched into it. Today's 4B models already outperform it.
Token rate in five digits is a major technical flex, but, does anyone really need to run a very dumb model at this speed?
The only things that come to mind that could reap a benefit are: asymmetric exotics like VLA action policies and voice stages for V2V models. Both of which are "small fast low latency model backed by a large smart model", and both depend on model to model comms, which this doesn't demonstrate.
In a way, it's an I/O accelerator rather than an inference engine. At best.
If you look at any development in computing, ASICs are the next step. It seems almost inevitable. Yes, it will always trail behind state of the art. But value will come quickly in a few generations.
1. Generic, mask layers and board to handle what's common across models. Especially memory and interface.
2. Specific layers for the model implementation.
Masks are the most expensive part of ASIC design. So, keeping the custom part small with the rest pre-proven in silicon, even shared across companies, would drop the costs significantly. This is already done in hardware industry in many ways but not model acceleration.
Then, do 8B, 30-40B, 70B, and 405B models in hardware. Make sure they're RLHF-tuned well since changes will be impossible or limited. Prompts will drive most useful functionality. Keep cranking out chips. There's maybe a chance to keep the weights changeable on-chip but it should still be useful if only inputs can change.
The other concept is to use analog, neural networks with the analog layers on older, cheaper nodes. We only have to customize that per model. The rest is pre-built digital with standard interfaces on a modern node. Given the chips would be distributed, one might get away with 28nm for the shared part and develop it eith shuttle runs.
Paging qntm...
New models come out, time to upgrade your AI card, etc.
Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.
The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.
…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.
Someone mentioned that maybe we'd see a future where these things come in something like Nintendo cartridges. Want a newer model? Pop in the right catridge.
Sounds like people drinking the Kool-Aid now.
I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?
I'm progressing with my side projects like I've never before.
Yes. Example: If you've never programmed in language X, but want to build something in it, you can focus on getting from 0 to 1 instead of being bogged down in the idiosyncrasies of said language.
For many of us, the answer is an emphatic yes.
> The number "six" is actually a noun, not a number. However, I assume you're asking to write the number 7 as a numeral, which is: 7
To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!
Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.
Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.
Whoever doesn’t buy/replicate this in the next year is dead. Imagine OpenAI trying to sell you a platform that takes 15 minutes, when someone else can do it in 0.001s.
Was the story any good? The model used isn't comparable to something OpenAI would offer you today. What good is a fast gibberish generator?
> Generate a fiction book about an old sorcerer who, like all elderly magicians, has to go back to studying magic for two years at a magic school in order to be granted the right to keep his wand and to keep doing magic at his home.
Answer, generated in 0,063s @ 15 742 tok/s:
> The Relearning of Eryndor Thorne
> In the quaint village of Brindlemark, where ancient traditions and mystical energies coexisted, Eryndor Thorne had lived a life of quiet reverence. For nigh on five decades, he had wielded the arcane arts with precision and mastery, earning him the respect of his peers and the awe of the villagers. His home, a humble abode atop a hill overlooking the rolling hills and verdant valleys, was a sanctuary of sorts, a place where the very fabric of magic was woven into the air. His library was a treasure trove of ancient tomes, texts, and scrolls, filled with the wisdom of the ages. Yet, despite his mastery over the arcane, Eryndor felt the stirrings of a nagging concern. The once-renewable license to practice magic, a privilege granted by the Council of Mages, was on the cusp of expiration.
> As all elderly mages knew, every two decades, the Council would demand that those who had spent a lifetime mastering the arcane arts return to the prestigious Wysteria Academy, the hallowed halls of magic learning, to relearn the foundational principles of magic. This was a ritual as old as the craft itself, a test of the elderly mage's understanding, to prove they still possessed the fundamental knowledge and skills required to practice magic with precision and safety.
> Eryndor, like many of his fellow mages, had grown complacent in his mastery. He had mastered the intricacies of elemental magic, bending the winds, waters, and flames to his will. His wand, a family heirloom passed down through generations, had become an extension of his arm, channeling the power of magic with ease. Yet, despite his proficiency, the Council's decree was clear: Eryndor must return to Wysteria Academy for two years of reeducation.
> The journey was not an easy one. Packing his worn leather satchel with the few books and scrolls he deemed essential, Eryndor bid farewell to his loved ones in Brindlemark, leaving behind the familiarity of his home to embark on a journey that would take him to a place both wondrous and treacherous. At Wysteria Academy, he joined a cohort of students from all corners of the realm, each as skilled and eager as he once was. The relearning process was a grueling one, as if facing a mountain he had once conquered. New techniques were presented, new theories to unravel, and new expectations to meet. Eryndor, once a master, now sat at the feet of his teachers, absorbing the new knowledge like a parched traveler at an oasis.
> Among his fellow students, there was the young, fire-kissed mage, Elara, who wielded magic with an intensity that bordered on reckless abandon. Her fiery nature and quick wit often put her at odds with the strict, ancient traditions, earning her a certain notoriety among the academy's elder mages. Then there was the enigmatic, shadow-drawn Kael, whose mastery of the arcane was matched only by his mystery. Kael's affinity for the dark arts raised more than a few eyebrows among the faculty, but Eryndor, having once walked the fine line between light and shadow, saw something of himself in the young mage.
> As the years passed, Eryndor grew to appreciate the challenges and opportunities that came with his return to the academy. He found himself grappling with the nuances of magic anew, rekindling memories of his early days as a novice. The relearning process was as much about rediscovering himself as it was about mastering the arcane. His studies were a journey of self-discovery, one that tested the mettle of his will and the depths of his understanding.
> Upon completion of his studies, Eryndor stood before the Council once more, his wand in hand, his heart afire with the thrill of rediscovery. The Council's examination was not merely a test of his knowledge but a test of his character. Eryndor, like many of his peers, had grown complacent, but the rigors of relearning had rekindled a spark within him, a flame that would guide him through the trials ahead.
> With his renewed license granted, Eryndor returned to Brindlemark, his home and his heart rejuvenated. His library, once a testament to his mastery, was now a canvas for the new knowledge he had acquired. His wand, now an extension of his rekindled passion for the arcane, channelled magic with a precision and flair that was unmatched. Eryndor Thorne, the elderly mage, had not only relearned magic but had rediscovered himself, a testament to the transformative power of education and the enduring spirit of the arcane.
[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...
User You: What is the airspeed of an unladen swallow/
Jimmy:
A reference to the classic line from Monty Python and the Holy Grail!
Thus far it is correct but here it comes:
In the movie, a group of medieval scholars, the Knights who say "Ni!" (also known as the Bridgekeepers), ask a king's servant, a blacksmith, if an unladen swallow makes good time. The blacksmith, played by John Cleese, responds with the now-famous line:
"And what's the airspeed velocity of an unladen swallow?"
To which the blacksmith responds:
"What do you mean? An unladen swallow?"
The blacksmith then explains that an unladen swallow is when a swallow (a type of bird) is flying in the air without carrying any burden, implying that it is moving at a speed that is not affected by carrying any weight or load.
Nope, that is not at all how it went. No blacksmiths, not related to the Knights who say 'Ni!' - who have nothing scholarly about them, they wanted a shrubbery and demanded another one when their wish was fulfilled. It was the Bridgekeeper - not related to the aforementioned knights - who asked 3 questions and would throw anyone who could not answer all three into the chasm. When he asked Arthur - played by Cleese - the question about the airspeed of the swallow he asked the Bridgekeeper whether he meant an African or European swallow. The Bridgekeeper did not have an answer and was thrown into the chasm, problem solved.
However, in reality, swallows are not typically "unladen" in flight. They are small birds that fly at relatively fast speeds, usually around 10-15 km/h (6-9 mph), but they are not usually carrying any burdens!
Needless LLM-blabber.
The "airspeed velocity of an unladen swallow" has become a meme and a cultural reference point, often used humorously or ironically to refer to situations where someone is trying to make an absurd or non-sensical argument or ask an absurd question.
Somewhat correct but not necessary in this context.
The correct answer to the question would have been Do you mean an African or European swallow? followed by a short reference to the movie.
Of course this demo is not about the accuracy of the model - 'an old Llama' as mentioned elsewhere in this thread - but it does show that speed isn't everything. For generating LLM-slop this hardware implementation probably offers an unbeatable price/performance ratio but it remains to be seen if it can be combined with larger and less hallucination-prone models.
Congratulations! You figured out that this is a demo of a very small 8B model from 2022.
An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.
An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.
I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output
888 + 88 + 88 + 8 + 8 = 1060, too high... 8888 + 8 = 10000, too high... 888 + 8 + 8 +ประก 8 = 1000,ประก
I googled the strange symbol, it seems to mean Set in thai?