208 pointsby datadrivenangel5 hours ago47 comments
  • bastawhiz5 hours ago
    This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.

    But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!

    It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.

    Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.

    • datadrivenangel4 hours ago
      Rounding everything down in the most optimistic setting got me to $0.40 per million tokens, and openrouter has the same model at $.38/mtok.
      • nativeitan hour ago
        But once all that is done you still own a Mac in one case, and you don’t in the other, correct?
        • odo1242an hour ago
          Yea this; it’s the same reason why mortgaging is cheaper than renting
          • ericpauleyan hour ago
            This is far from a universal truth: https://www.nytimes.com/interactive/2024/upshot/buy-rent-cal...

            Real estate is only a clearly good investment if you ignore opportunity cost.

            • sgt32 minutes ago
              Articles like that still miss a bit of the nuance. Imagine having your house paid for, and you grow old and you have no rent to pay. Yes, you could have invested but likely you would have spent some of that money on something else, or your investments might have not worked out so well, or any other reason. Human reasons, to be specific. Owning property is like a lock.
            • seanmcdirmidan hour ago
              You also need to pay close attention to rent vs purchase ratios. A lot of cities are cheap to rent but expensive to buy (eg beijing 10 years ago).
              • mantas23 minutes ago
                Key word being „ago“.
          • BoorishBears15 minutes ago
            Except one day the hype will catch up to reality that was always true, people will realize their $20,000 Mac is has less utility as a "way to learn AI" than some kids 3090 fortnite machine, and it'll be back to below MSRP.
      • 650REDHAIR4 hours ago
        I’ll keep my data local over a $.02/mtok difference.
        • quietsegfault4 hours ago
          It’s more than just data locality. OpenRouter is faster, no? I have an M4 pro, and anything but the smallest dumbest models are unusably slow for interactive use. I personally haven’t yet found a good use case for offline/non-interactive LLM work locally.
          • datadrivenangel3 hours ago
            Yeah. The speed is the biggest issue. The intelligence of open models is good enough for serious work (though still worse than the frontier models), but the cloud models are often 3-7 times faster, and you can get more parallelization and so get speeds on the order of hundreds of tokens per second, which makes things fast!
            • freeopinionan hour ago
              Even extremely slow LLMs can generate Part B faster than I can audit Part A. So the LLM can generate Part A while I look over my email. Then it can worry over Part B while I look over Part A.

              It can worry over Part C while I have my 10:30 group meet. And it can worry over Part D while I do whatever other silly, time-wasting thing all humans do in almost all organizations. Then I still haven't reviewed Part B, yet, so the extremely slow AI is waiting on me.

              Maybe someday I'll be good enough to need faster AI so I can rewrite something like Bun in a few days. Right now, slow and local fits my use case very well.

              • quietsegfault11 minutes ago
                I don’t think it matters if you’re “good enough” or not. Much of AI development is iterative. If you context switch between A from project 1 to B from project 2 back to check A, then maybe C while B finishes up, you will lose the flow state that AI assistance can enable with speed for those who are not fluent coders.

                Sure, I can wait hours for my local model to finish, or I can spend basically as much and get the answer right away

                There’s a lot of exciting stuff with local LLMs despite the speed, but for me I don’t have the discipline and working memory to jump from project to project.

          • threatofrainan hour ago
            And continuing the argument of "more than just...", if you stopped inferencing on your Mac you still have a generally nice computer. The difference between rent vs buy.
      • formerly_proven2 hours ago
        What is it with AI SaaS naming themselves "openxyz" when there is 0% open about them?
        • em5002 hours ago
          They learnt from ooenai that naming yourself open-xyz doesn't actually require opening anything.
        • debugnikan hour ago
          It's the next co-opted buzzword after "democratize".
    • faitswulff4 hours ago
      The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?
      • mpyne3 hours ago
        They're responding to the people doing things like buying the most expensive Mac they can find specifically to do local inference for their AI agents.

        Some do it to have control over their ability to use AI. Some do it because they think it will be cheaper to not have to pay a SaaS to generate tokens for them.

        But for those interested in the latter case, it seems like it's not actually cheaper after all, at least at current prices. But then I don't expect prices to drastically jump because of how much competition there is in model development.

        • datadrivenangel2 hours ago
          It's worth paying a premium for the privacy (assuming that llama.cpp and ollama aren't sending my sessions back to the cloud regardless...), and for the concerns about not getting a surprise bill.
          • 12 minutes ago
            undefined
        • dcrazyan hour ago
          You also have control over your costs. It is reasonable to assume that tokens will cost significantly more in the near to medium future as the market consolidates and subsidies decline.
      • sheepscreek2 hours ago
        No, that’s not the point. I think this is to help people who are thinking about getting a beefier Mac so they can run their LLMs on it too. Some in particular want a dedicated Mac Mini or Studio for this purpose. The breakdown, even if slightly flawed, offers a good insight into the economics of it.

        For most people, they might be better off with OpenRouter models and providers supporting Zero Data Retention. On the cloud, that’s as good as it gets for privacy - your data is never retained beyond the life of the request.

      • tuwtuwtuwtuw4 hours ago
        I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.
        • FuckButtons3 hours ago
          But the people who want to do local inference are putting some amount of value on privacy that’s not captured by the raw monetary value so just comparing the price is somewhat beside the point, it’s also true that, if you have eg a Mac and you use that as your main computing device then you would have spent money on it anyway, so you can’t even really compare its value to spend on something that’s not general purpose.
          • apf6an hour ago
            That's a lot of assumptions. I think there are also people buying new hardware specifically for this purpose, and their motivation to do it is thinking it will be cheaper in the long run. Privacy is not necessarily the motivation.
          • datadrivenangel2 hours ago
            My overall opinion is that the smart thing is not to upgrade to the maximum memory for AI purposes. It's worth quantifying how much extra we pay for privacy.
          • tuwtuwtuwtuw2 hours ago
            I replied to a comment asking why the article exists.

            As for privacy, I'm sure there are many people that are not so interested in that aspect.

    • statestreet1233 hours ago
      Rounded up, yes, and oddly inefficient for someone obsessed with inefficiency. One could buy a brand new 64gb M5 macbook for well over 4k. Another could buy a scratched up but functioning M1 Max 64gb off of ebay for a little over 1k—and somehow get the same 10-20 t/s with 31b that the author does with an M5. Or better yet, have a frontier model do the planning and judging, and have a local MOE model execute at 50 t/s. All of this achievable by a former English major with too much free time.
    • dist-epoch4 hours ago
      using it 24/7 brings the average cost down, not up.

      the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use

      • bastawhiz3 hours ago
        That's the point: why would you buy a device that's specifically not optimized to be used for 24/7 inference? It's expensive hardware that's not designed to be used in that situation! The power use for inference isn't especially good and you're not getting even a fraction of the benefit from the hardware that you're paying for.
        • apf643 minutes ago
          Good question but people are doing it anyway. It's a fact that right now tons of people are buying Mac Minis specifically for this use case, to treat them as their personal data center for agents. The concept of "power use for inference" is foreign. Those people are the ones that motivated this blog post I think.
        • dist-epochan hour ago
          > why would you buy a device that's specifically not optimized to be used for 24/7 inference

          because it costs $1k-$2k instead of $10k-30k+ for optimized devices

      • groundzeros20154 hours ago
        The hardware has multiple uses for the same cost. The pay-per-use server does not.
    • llm_nerd3 hours ago
      Your post makes sense if you bought the hardware for other reasons, and maybe run models occasionally as a novelty.

      That isn't the case for many, though, and there is a whole social media space where people are hyping up the latest homebrew options for running models, believing it frees them from the yoke of big AI.

      Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs. Someone rationally running the numbers is a good thing.

      • atq2119an hour ago
        Let's not get ahead of ourselves. Millions, really? I can believe there are a lot of enthusiasts doing this, but "millions" needs a citation.
    • cyanydeez4 hours ago
      nothing about the current data center craze looks efficient.
      • bastawhiz3 hours ago
        Whether you think building data centers or not is a good idea it's inarguable that the per-token efficiency (power, hardware, etc) is FAR higher in a data center. That's literally what it's designed for.
        • cyanydeezan hour ago
          im talking per value. look at the efgiency of chinese open source models; then look at SOTA sucking gigawatts, then the proposals.

          America is basically proposing AI using the equivalent bloatware of Windows 11.

      • trollbridge3 hours ago
        Probably because lots of data centres are being built (or half-built) which are sitting idle.
        • mpyne3 hours ago
          If there are datacenters sitting idle right now then you could probably make a lot of money selling that capacity to Anthropic at this point...
    • espadrinean hour ago
      [dead]
  • applfanboysbgon5 hours ago
    Unless I'm misunderstanding, this is counting the entire laptop in the cost of generating tokens. The calculation seems to omit that, in addition to receiving LLM output, you have also received a laptop in exchange for your money. If you intend to put this machine in a dark corner and run it solely as a token-munching server, a laptop would be an exceptionally poor choice of technology for this purpose. But if you intend to use the laptop as a laptop, having a laptop is a pretty big benefit over not having a laptop.

    You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).

    • andai5 hours ago
      Yeah, a better metric might be, the difference in cost between the laptop you need to run local models, and the laptop you would have bought anyway.
      • fwipsy3 hours ago
        The base 14" m5 MacBook pro is $1700 with 16gb/1tb. The author's spec is $4300 - $2600 more.

        It depends on how often you use it (and your tolerance for slow inference) and whether you would have otherwise bought a higher spec. For my needs, this costs a LOT more.

    • BoorishBears7 minutes ago
      OP is giving you the absolute best case compared to most of the people who've been overcome with psychosis hoarding Macs.

      An unreasonable number of these people spent $10,000+ for Mac Studios that are still compute bottlenecked and don't have anything more efficient than Gemma 4 to run.

    • 3 hours ago
      undefined
    • xienze2 hours ago
      > in addition to receiving LLM output, you have also received a laptop in exchange for your money

      And, since it's a Mac, whenever you're ready to upgrade it'll still have a fairly decent resale value.

    • dist-epoch4 hours ago
      > control over the model used

      but you lose access to the most capable models, you can run only the small ones

      • bel8an hour ago
        And they run slower and quantized.
  • dijit3 hours ago
    Frontier AI companies are selling at a loss.

    Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that they get to be the only one left.

    If I spend $10 growing Oranges and sell them to you for $1; then of course it's more expensive for you to do the growing.

    I feel like I'm taking crazy pills. These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.

    [0]: https://news.ycombinator.com/item?id=48168433

    • vanviegen3 hours ago
      That seems unlikely. There are many providers for open models on openrouter. It seems unlikely that they are throwing money away for each token they sell.

      Also, there a good technical reasons for inference being much more efficient at scale.

      • dijit3 hours ago
        The providers on OpenRouter serving open models aren't "throwing money away", agreed.

        But that's not the point I'm making. (or, it kind of is, but it's more high level than that).

        They're running spot and preemptible GPU instances (60-80% cheaper than on-demand), paying wholesale industrial electricity rates, and running at multi-tenant utilisation densities that make your MacBook look like a bonfire. Of course they're not individually loss-making on inference, they're aggregating cheap commodity compute and skimming a margin, and on paper that's what makes it seem like a good idea, certainly not a loss leader right?

        But zoom out a bit; the entire stack is swimming in VC money. OpenRouter itself just raised at a $1.3B valuation backed by a16z. The Chinese models that now account for 36% of all tokens routed through the platform (DeepSeek, Qwen) are priced the way they are because Beijing-adjacent capital has decided market share matters more than margin right now.

        So yes, technically no single party is "throwing money away" on each token; they're just all simultaneously subsidising different parts of the stack for strategic reasons. The floor price you're seeing isn't a stable equilibrium, it's a pile of investor money that hasn't entirely finished burning yet.

        • vlovich1233 hours ago
          > The floor price you're seeing isn't a stable equilibrium, it's a pile of investor money that hasn't entirely finished burning yet.

          All that says is that it gets more expensive in the future as competitors exit the market and sustainability becomes important. That’s why Uber and Lyft were so cheap until they killed taxis. One major difference of course is that some models will remain largely good enough and the incremental cost of running will keep dropping to 0 over time since the hardware needed doesn’t get more expensive and is already purchased.

          • dijit3 hours ago
            I think we agree.

            I only object to taking current prices as if they are perpetual prices.

    • rprend28 minutes ago
      This is not true. API tokens are not sold at a loss, and hardware gets more efficient over time, so serving inference on the same model gets cheaper. LLAMA 3.1 405B parameters was $6/$12/M tokens in 2024, but in 2026 that same model is $3/$3/M tokens.

      The most intelligent model at a given time is much larger than the previous, which is why token costs for GPT5.5 are higher than 5.4. But you should expect that 2 years from now, serving a GPT5.5 sized model will be cheaper than GPT5.5 today. You should expect it to be even cheaper to get an equally intelligent model 2 years from now, because distillation techniques are effective at reducing the necessary parameter count for the same benchmark scores.

    • brianwawok3 hours ago
      So many more efficiencies possible at scale though. I cannot keep a local model 98% utilized 24/7, at least not with my current workload. A big cloud can. I can’t power my servers with DC, I have this AC to DV conversion nonsense. The list goes on.
      • visarga2 hours ago
        Besides fill factor being hard to match, there is also scaling - you can't scale local inference 10x for a spike, but you can with cloud inference.
    • NicuCalcea3 hours ago
      The blog compares the cost of running Gemma4 31b, which on OpenRouter is offered by small no-name inference providers, not by frontier AI companies. It seems like a fair comparison to me.
      • pornelan hour ago
        LLM generation is bottlenecked by RAM bandwidth and latency. You can get almost linear scaling by evaluating more prompts in parallel, because the GPU has nothing to for the relative eternity it takes to read all of the weights from DRAM for every layer for every token.

        On Apple Silicon you can get 4x-8x more tokens per second if you run more queries in parallel (as long as your inference server supports it, and has enough spare RAM for more KV caches).

        When inference is done at datacenter scales, when you distribute generation across multiple GPUs and have kernels carefully tuned to specific hardware, the compute vs DRAM bandwidth speed ratio gets absurd like 200:1. That's why everyone gives you batch inference at a steep discount.

    • OsrsNeedsf2P3 hours ago
      The models have been dropping 10x in price for completing the same tasks, year over year. Even if you think Anthropic is losing money charging 10x more than everyone else for their 400B model, the prices will continue to go down based on model improvement alone
    • ianberdin3 hours ago
      Do you have a proof? Anthropic’s CEO said they Are profitable. Same with OpenAI.
      • dijit3 hours ago
        Profitable for inference if you completely ignore training costs and that you absolutely must continuously train new models.
        • vlovich1233 hours ago
          Which is where your analogy breaks down and why you think you’re taking crazy pills. Inference is growing and selling the oranges in your analogy. Model building is growing the farm to sell larger, juicier more addicting oranges.
          • skippyboxedhero3 hours ago
            The same mistake was made with Amazon, and a million other tech companies in the early 2010s.

            Amazon were losing money, they were losing money because were growing and spent all of their cash flow on growth. It wasn't merely regarded as a hopelessly unprofitable business, if was regarded as potentially fraudulent. The share price collapsed in 2014 because, some thought, the profit would never come, investing in growth was pointless, etc.

            Last year Amazon made nearly $100bn in profit. Stock is up 20x from then...this is after AWS was known (everyone also that was a massive fraud, could never be profitable...we know it was printing from day one), after it was the world's biggest retailer, etc.

            It is difficult to understate how consistently people make this mistake, not just individually but in aggregate. You see the same thing with restaurants, consumer products, office leasing, so many businesses. This is not to say that the future will happen any particular way but that what Anthropic and co are doing is obviously rational and based upon very real cash flow. Anthropic's growth in revenue is, I believe, unparalleled in modern corporate history. A slight difference in this case is also that the economics of training these models is improving exponentially over time.

          • dijit3 hours ago
            Are ya fuckin' serious mate?

            The restaurant next to the mines were profitable up until the moment the mines themselves shut down: one doesn't exist without the other.

            You can't ringfence inference as "the profitable bit" and then hand-wave away the training. Without continuous training there is no inference product.

            Claude 3 Opus isn't sitting there making revenue in 2026 - the thing is just deprecated. The moment you stop spending billions on the next model, your "profitable" inference business is on borrowed time until someone else makes it obsolete.

            Maybe I made a mistake in my analogy... They're not growing a farm and then selling oranges. They're on a treadmill where stopping is death, and the treadmill costs $10bn a year to keep running.

            • atq211943 minutes ago
              > Without continuous training there is no inference product.

              This claim deserves teasing apart.

              Clearly, training is a Red Queen's race today. If a model provider were to unilaterally decide to stop training, they would very quickly lose market share to competitors with better models.

              On the other hand, what if market and investment conditions change such that everybody has to stop training?

              In that case, the models are still there and still as useful as they were the day before. So why wouldn't there still be an inference product?

            • vlovich1233 hours ago
              > They're on a treadmill where stopping is death, and the treadmill costs $10bn a year to keep running.

              You’re literally describing all companies. Google takes about $270bn/year to run. If they stopped spending that they’d die pretty darn quick. It’s also a description of working - unless you’d built up significant savings, if you stopped working you’re also going to die.

              • bjt2 hours ago
                > You’re literally describing all companies.

                No, not quite. It really comes down to opex vs capex and the depreciation schedule for your investment.

                Software development is typically categorized as capex, on a 3-5 year depreciation schedule. You assume the software you write today will be generating value for you that long.

                If a big, expensive model training project only gives you value for a year or less, that is not like most companies.

                • vlovich123an hour ago
                  No, the IRS made that change a while back as part of the TCJA but that’s been reverted in the OBBBA. If you build something and never touch it, sure that should probably be capex you have to depreciate. But if you’re investing continuously in it over time, I don’t see how it’s anything other than opex - there’s nothing being depreciated because you’re constantly improving it. Automobile manufacturers don’t have to count their labor force as capex. Indeed I can’t think of any other industry where labor is capex.

                  But believing that the financials of a project are governed solely by how IRS rules force you to account for headcount is kind of silly.

                  > If a big, expensive model training project only gives you value for a year or less, that is not like most companies.

                  The model itself that gets built? Sure (although clearly the timelines are getting longer). However the important bit here is the research that got done along the way and the infrastructure built to make that model building process cheaper, better etc. all of that stuff sticks around but because it’s hard to appreciate externally you discount it to 0 when it’s literally what they actually spent the money on.

                  But none of that even matters. Google had 270B in opex and their capex has grown from 50B in 2024 to 90B in 2025 and is projected to grow to ~175B for 2026. But even if you discount the “AI” treadmill, you’re still looking at many tens of billions in capex that if they stopped they’d die.

                • Anon109634 minutes ago
                  Software that is sold as a service and requires ongoing maintenance like running in the cloud (and people to keep it running in the cloud) is opex not capex. Google Search is most definitely opex.
              • Danox2 hours ago
                The problem is I don’t think computing is going back to the mainframe era you know where all the computing is done remotely and the only thing you have in front of you is a terminal that is the AI slop maker’s dream, the computing power on the desktop/laptop/tablet/phone is getting better and the models are getting smaller and quicker.

                There is no moat. In the end, what we are calling AI today will just be something that is incorporated into an existing programs that people will use to help them accomplish a task. The public will not be paying more for it. It will just be a commodity added to the existing ecosystems we have today. They

            • genxyan hour ago
              > Claude 3 Opus

              Unless they are changing the architecture in huge ways. The pre-training done for 3 goes into later models. I am sure the frontier labs are figuring out how to pretrain generic feedstocks that can be fed into downstream training pipelines. DeepSeeks incremental training run cost was what, 5M? Alibaba and DeepSeek have the best most efficient training pipelines, look at the rate at which custom Qwen models are being pumped out.

          • no-name-here2 hours ago
            > Inference is growing and selling the oranges in your analogy. Model building is growing the farm to sell larger, juicier more addicting oranges.

            In this analogy, model training would be akin to developing better oranges, but your competitors are also developing better oranges so if you stop spending heavily to improve your oranges, consumers are going to buy ~zero oranges from you within a couple years. (Expanding the farm might be analogous to expanding data centers.)

          • xienze2 hours ago
            In this particular case, inference and training are intertwined. It might be one thing if Anthropic could get away with training a new model every five years and control costs that way. But they can't. Put another way, their inference has no value without continuous, very expensive training. Because consumers aren't purchasing based on price but capability, otherwise the Chinese models on OpenRouter would have buried OpenAI and Anthropic already.
        • spzb3 hours ago
          And ignore capital costs, depreciation, user churn etc
      • tiffanyh3 hours ago
        Do you mind sharing source links to that profitability claim?

        I’m struggling to find the quotes.

      • Danox3 hours ago
        AI CEOs are known to say many things telling the truth, probably isn’t one of them.
      • miltonlost3 hours ago
        If only they had their books open to do more than just "say"
    • tempest_3 hours ago
      It is the model training that is dragging them down.

      If the arms race stopped tomorrow the current price pays for the inference.

      • Danox3 hours ago
        But isn’t training models, a forever task like iterating in tech you can never take a day off, adding humans to the equation don’t humans train/teach themselves new skills over a lifetime, and isn’t one of the selling points in the future when selling this AI slop your AI never goes to sleep and can always be trained forever? The AI price for entry as we go on into the future will only increase.
        • atq211939 minutes ago
          I agree that training is a forever task, and the current rate of training is probably not sustainable. But all that means is that once the current investment mania ends, the market will most likely find a new equilibrium where continuous training still happens, but at a slower rate that can be sustained by inference revenue.
        • asjir2 hours ago
          Just keeping it up to date with competitors is much cheaper, by copying better ones like Qwen did with Claude. Also a bunch of research is trickling into open source / arxiv so catching up should continue becoming cheaper at least as a fraction of training from scratch
    • visarga3 hours ago
      > Frontier AI companies are selling at a loss.

      There are huge economies to be had by batching requests and using lots of RAM for MoE (sparse models). You can't achieve that efficiency at batch size 1 on a single node.

      • asjir2 hours ago
        Exactly, they put a lot of money into engineering and it does give results
    • poly2it3 hours ago
      Well, I'd be surprised if non-R&D inference providers were selling at a loss. There are a plethora to choose from, competition is quite healthy. Will they keep providing cheap tokens while the labs raise their prices? Probably, but then I don't see how they could be raised in the first place. And what timescale are you talking about? A couple of years? It is appropriate to assume inference will become more efficient over time. If you raise your prices, you are going to be out competed before it's profitable (if you assume it is unprofitable) which would be negligent. I don't see how this makes sense.
    • throwatdem123112 hours ago
      The Michael Scott AI Companies.
    • vlovich1233 hours ago
      Except that’s not what the analysis is. They’re spending < $1 to get $1 from you and the other $9 to figure out how to improve the model further and build up products on top of that to turn that $1 spend into $5 in the future.

      In other words, inference is fairly profitable for them and the rest of the money is spent growing revenue as quickly as possible. Building models is still an expensive line item but the costs for that are going down with time.

      There is also maybe a “capture the market” mentality but I don’t think that’s necessarily it - the tools and processes are largely fungible and that’s a huge problem. They need to figure out how to make it sticky for “capture the market”, but there’s also a very real “grow as big as possible as quickly as possible to take on Google”; Google has an existential threat here.

    • EGreg3 hours ago
      These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.

      They could have said the same about transistors. People keep inventing new ways to keep the costs down. Just look at the latest Qwen, DeepSeek, BitNet. Interesting tidbit: they’re all open, and as Google said in 2022: they have no moat.

    • MattRix3 hours ago
      The inference is absolutely not sold at a loss, at least not when paying API prices (the subscriptions are less clear). The reason frontier model companies aren’t profitable is because training the models is so costly, not inference.
    • MuffinFlavored3 hours ago
      > Frontier AI companies are selling at a loss.

      How big/deep of a loss?

      I feel like I read this every day for years that Uber did this same "idiotic, losing" strategy (how it was pitched/discussed) and then one day we woke up and... without much fuss, boom, they were profitable seemingly overnight.

      • Danox2 hours ago
        As long as you have slaves/sharecroppers, driving the people at the top of the pyramid at Uber they’re profitable and Uber makes money as long as you don’t care about the workers and as long as you can get around all of the regulations that are put on traditional cab companies if there are any left on the road.

        For me nothing says low class like the Porsche dealer saying we can call Uber for you to take you home ridiculous… and it was a low class experience dirty car small never again ha ha ha…

      • brianwawok3 hours ago
        Well and uber cut the driver pay in half and doubled the price. They didn’t really find any efficiencies, robo drivers don’t exist yet. Also why I hardly touch them anymore.
        • onesociety20222 hours ago
          All that tells me is they did find an efficiency. If they didn’t, their driver supply would have dropped. Unlike the taxi business, Uber/Lyft can tap into otherwise dormant supply of drivers who already own a car but aren’t willing to spend all 40-60 hours a week driving a taxi. With Uber/Lyft, they can become part-time drivers (they have flexibility and they can use an asset they already own anyway). Is it worse for the full time taxi drivers who used to have the supply artificially constrained in the old medallion system? Yes, but does it also benefit others who want to do this as a flexible job, zero skills required other than driving, no boss to deal with, no job interviews, etc. Yes!
        • MuffinFlavored2 hours ago
          > Well and uber cut the driver pay in half and doubled the price

          Devil's advocate:

          * inflation caused everything to go up to some degree since then

          * if it was "that bad" as you say, they wouldn't be extremely profitable and have so many users

          both things can be true? "they cut the driver pay in half and doubled the price" did not lead to the collapse of the business/people to stop using it.

      • spzb3 hours ago
        Ed Zitron discusses this as part of his post on AI economics : https://www.wheresyoured.at/ais-economics-dont-make-sense/
    • ajross3 hours ago
      > I feel like I'm taking crazy pills.

      Why? It's no less crazy than when Uber and Lyft were doing the same thing. Or when the entire tech industry was doing it in the dot com boom.

      Investment-driven market growth at a loss is like the least surprising thing in all of this. The tech is new and fascinating. The bubble is just another trip through the funhouse.

  • sleepyeldrazi3 hours ago
    If you want a good dense model, use qwen3.6 27B instead, speed will be up, and if you don't take my word for it being smarter, take openrouter's prices of it against the bigger, slower and less memory-efficient gemma do the talking.

    If you want a faster model, go for qwen3.6 35B (or gemma 4 26B if gemma models perform better for your tasks). There is a reason why people (myself included) haven't shut up about those two (especially the 27B). Its small enough to run at a decent speed (especially with the built in MTP that finally has official llama.cpp support) and for many workloads (every benchmark I have ever thrown at it) it is matching or surpassing models it has no right to.

    A couple of days ago I woke up with my internet being down, started 27B in pi, told it to diagnose whats wrong by giving it my router's password, went to grab a coffee and by the time I got back, i had a full report with suggestion on how to proceed. I love openrouter and I use it for many things, but it is not cheaper.

    Subjectivity and opinions based on personal experience with all those models implied naturally, I assume the 31B gemma has cases in which it edges out, I've just failed finding any and I have been running all 4 models mentioned since hours after each of them dropped nonstop for different tasks. Hell, for my hermes, I've started getting better results once I switched from gemma 4 26B to qwen3.5 9B, not even the massively improved 3.6 series. It just feels outdated/ cherrypicked to not use what by many accounts is the current consumer hardware SOTA if doing such an analysis.

    • trollbridge3 hours ago
      Right. Qwen 3.6 45b (6 parameter) runs on a commodity 5090, which, if you're into video games, you probably already have one of. It is entirely usable for most code generation tasks. (Not all, but most.)

      Likewise, DeepSeek V4 Flash is quite accessible on local models, with DwarfStar 4 making it easy to run on a 96GB MacBook.

      There's nothing wrong with paying for inference, but local models bring up some pretty amazing possibilities, such as entirely offline usage or being able to work on private PII, legally privileged, etc. sort of data, or performing tasks with no concern given whatsoever towards billing overruns.

      The other possibility is being able to build a service which you can be 100% assured you can keep running without worrying about a service going down or being end-of-lifed, which is currently a problem with frontier models. My local Qwen setup is entirely predictable. It can run as long as I can keep finding hardware to run it.

      A sensible strategy uses both: have local inference tools available, and use both low-cost and high-cost cloud based models. You can use GPT-5.5 and Opus-4.7 for things they excel at (including laundering the latter via a Claude subscription to make it cheaper) for demanding reasoning tasks, DeepSeek V4 Pro for slightly less demanding tasks, V4 Flash for most (not all) code generation, and then local models for things where you want a local model.

    • ekojs2 hours ago
      Not disagreeing with your argument, but:

      > If you want a good dense model, use qwen3.6 27B instead, speed will be up, and if you don't take my word for it being smarter, take openrouter's prices of it against the bigger, slower and less memory-efficient gemma do the talking.

      Don't know if this is the correct read. I think those providers are simply taking cue from Alibaba's first-party pricing for the 27B Dense. It's kinda overpriced imo. Perhaps it can be explained by how 'reasoning-inefficient' (relative to frontier models or even Gemma) the Qwen models are and longer sequence lengths are expensive to serve.

  • konaraddi3 hours ago
    A lot of comments here are about the issues with the analysis in OP’s post but much of them are “a distinction without a difference” with respect to the broader conclusion. When we look at purely cost and performance (setting aside privacy) then it’s better for individual devs to pay for hosted then for self hosting. Employers are paying for tokens on the job and most devs are finding the $PREFERRED_PROVIDER’s $20/$100/$200/month subscription sufficient outside of work. Most devs don’t fall in the conditions under which running local models make sense purely on the basis of cost vs performance.

    More critically, in practice, setting up local models seems more like a hobby, an educational exercise, or an act of privacy control than it is for cost cutting or productivity.

    • Danox2 hours ago
      The model makers, mainframe dream of computer’s isn’t coming back no matter what OpenAI, Google, Anthropic or Microsoft want, there are too many smart tech barbarians at the gate that want in and they’re not going to be satisfied to go back to the computer terminal era.

      Personal computers eliminated an earlier terminal era, and most if not all of those companies are gone except for IBM and a few stragglers and they are a shell of their former selves.

  • Jayakumark4 hours ago
    OP is comparing against Gemma everywhere but concludes paying Anthropic make more sense. Anthropic is $15 per million output token which is 30-35x more expensive even in openrouter .

    This is like comparing e-bike at home with e-bike rental and concluding therefore we need to rent Toyota since it can go at similar speeds. Getting tired of bad posts getting much attention .

  • antirez4 hours ago
    Mmmm, nope if you do the smart thing. MacBook M5 max 128gb is a premium laptop at 6k, but with it you can do many things and is your good main driver for the day. Then, it can also run DeepSeek V4 flash and perform non trivial tasks locally, without censorship or limitations, even without an internet connection and on very privacy sensitive data. That's a good deal. If you buy 25k for a dual Mac Studio 512gb to abandon OpenAI and company you are going to be disappointed by both performance and cost.
    • throwa3562627 minutes ago
      Don't tell the HN crowd, but you can run some of these models on a $200 rpi5 or a $500 AMD mini PCs.

      Another open secret is that that certain companies give you tens of thousands of tokens freely, with pretty respectable models such as Gemini 4.1 and GLM 4.6.

    • datadrivenangel2 hours ago
      The smart thing is to get a ~48gb MacBook and use it as your daily driver, and then budget ~$800/year for AI subscriptions or tokens and you'll end up at the same price.

      I say, as the author of the blog post, writing this on a MacBook M5 max 128gb..

      • antirezan hour ago
        I agree with you, practically. But there is another angle of the story: for instance models are starting to be useless to do security stuff, since they are every day more censored. Also prices skyrocketed in the latest months, what will happen later? A few months ago I was shocked people resisted to spend 20$/month to get basically free frontier models, and I warned we were headed to house monthly rent figures in the future as AI becomes more and more required to do work. So indeed what you say is absolutely true now (but $800/year is not accurate: you need 20x accounts to do real work in my experience, so $200 * 12 = 2400$/year). But if you have a 128GB MacBook, that no longer looks so costly compared to 2400$/year of frontier models, you can experience uncensored LLMs, a quick thing that always works to do low-value work like TLDR this blog post for me, or what's wrong in this function? Or could you explain me this API? And for this kind of work, DeepSeek v4 Flash looks basically frontier. So if you look at things in perspective, they could have a different shape.
    • kamranjon3 hours ago
      Yea my m4 max with 128gb has ended up making a lot of sense for me. I do video editing, I train ml models, I run large open AI models, I do 3d modeling, rendering and cad work. I never do all of this 100% of the time, I’ll setup a ml training to run over night and check results in the morning, during work I’ll set it up as a server and run local models, on my own time I’ll edit video and work on 3d modeling. It’s an incredibly versatile machine - and all of this is done while keeping your data on your device and giving you full control over your workflows.
  • maho5 hours ago
    The author only compared output token costs -- but for typical agentic workloads, input tokens dominate the costs by a large margin. Running inference locally, input tokens are, to first order, free. (They only generate implicit costs through higher time-to-first-token, higher power use, and lower token output speed).
    • amluto41 minutes ago
      Even ignoring superior caching on a local setup, Mac hardware can often process input token around 10x as quickly as they produce output tokens. Openrouter seems to have only a 2x difference on the same models.
    • Wilya3 hours ago
      Yeah, that completely invalidates his point.

      I looked at a couple random agentic sessions in my openrouter activity, and the input cost is 10x the output cost.

      Prompt caching on openrouter is complicated and unreliable. On local hardware with llama-cpp, it's mostly free.

  • netika2 hours ago
    In my testing, qwen-3.6-27b in full precision is well below sonnet, but above claude haiku in coding tasks. Gemma is not even close to qwen, it’s much, much worse.
  • Sinidir2 hours ago
    Article is seriously wrong, because it makes a huge mistake in the last part. You can't simply look at the produced tokens and that is your cost. In agentic coding there are lots of turns meaning you not only pay for the output tokens you also pay for all the input tokens sent each time (even if a lot cheaper, like 10x when cached). So this calculation does not accurately represent the api cost at all.

    Second thing is you can starkly upgrade the token generation locally if you use agent teams. Single conversations are memory bandwidth bound and don't fully make use of your compute. If you can batch tokens from multiple agents you can easily 5x token generation.

  • regexorcist5 hours ago
    I simply can't go back to cloud AI. Privacy and full control are more important to me than speed and SOTA models.
    • xyzzy1234 hours ago
      Also predictability, resilience, sovereignty. I'm not worried about other people's outages, that unexpected demand will impact me at an inconvenient time, that someone's watering down my model, that my costs will change unpredictably or that some unforseen error will lead to a huge bill.

      It's in the same category as rooftop solar for me. It doesn't have to make strict economic sense if you're the particular type of person who gets peace of mind from control of infrastructure / reduced dependency.

  • nu11ptr5 hours ago
    "Accelerated depreciation (if any) from shortening the lifespan of the device will be more expensive than the electricity"

    Shortening the lifespan?

    • Der_Einzige4 hours ago
      The amount of FUD and notion that hardware depreciates in this manner is widely held. I blame Michael Burry of the Big Short who is perpetuating these lies to the investor community today.

      There's a bunch of retro hardware which should make people pause and realize they're stupid to assume hardware slows down on average even 5% 20 years later (it's probably closer to 2% and I'm being generous).

      HVAC/power delivery and generation are the major factors, and if you didn't skimp/get defective parts for this and replace failed moving parts (usually fans), your hardware is basically the same 20 years down the line as it was today.

      Also using LLMs locally doesn't even induce sustained 100% GPU usage over significant periods of time for most real (agentic coding in OpenCode) use-cases.

      • datadrivenangel2 hours ago
        There are tons of things that can start failing on hardware. I don't realistically expect some LLM usage to materially reduce the lifespan of the laptop, but running it 24/7 for AI usage makes me think that I'm more likely to get 3 years out of the device instead of 10.
        • an hour ago
          undefined
        • Der_Einzigean hour ago
          This is the FUD I’m talking about. B200 GPUs will easily have economically useful lifespans assuming proper cooling and power delivery.
  • synthos5 hours ago
    How much does your data privacy cost?
    • datadrivenangel5 hours ago
      As stated in the analysis, thousands of dollars. That said, the smart thing to do is target smaller models (few billion parameters) and then use larger models for non-privacy tasks.
  • cientifico2 hours ago
    Right now, local inference only make sense for privacy reasons.

    This is common when processing PII. Lawyers, doctors our similar should not be using cloud solutions.

    Also it's harder to setup and always more expensive than any cloud solution.

  • SXXan hour ago
    Author forgot that after 3 years when hardware no longer decent for inference you can still resell it for 25-50% of price.

    Obviously if RAM apocalypse passes by then high-end configurations preserve resale value worse than base models, but still it's hefty bonus of Apple hardware that might change math a lot.

  • Havoc5 hours ago
    I like that the numbers were crunched, but the answer to these is always a bit of a foregone conclusion.

    * Industrial power pricing

    * Wholesale hardware pricing

    * Utilization density of shared API

    means API always wins a cost shootout.

    Privacy & tinkering is cool too though

  • michaelbuckbee5 hours ago
    Slightly different slice into this a very similar situation (local vs OpenRouter AI inference).

    But in _every_ metric other than privacy it was better to run via OpenRouter than a local model, and not by a small amount.

    Direct link to the comparison charts:

    https://sendcheckit.com/blog/ai-powered-subject-line-alterna...

  • trvz4 hours ago
    Local LLMs aren’t about cost, but control.
  • bilekas5 hours ago
    I don't hear people debating which is cheaper, local or cloud run models. The conversation, at least what I hear, is a lot of the time users are not utilizing an awful lot of tickets all the time, those providers will be paid if you never use them. If 80% - 90% of the work I and my team are doing with Ai is grunt work, write tests for this, implement a FFT here, write the dB query for X. Nothing exhausting. Those who are using AI for whole cloth "vibe coded" applications and services are definitely better suited to cloud. If a work laptop can run my local models and get my works needed performance for development, why wouldn't I as a company prefer that?

    Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.

    Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.

  • jwr2 hours ago
    > "run a model like Gemma 4 31b, which is almost anthropic sonnet levels of performance"

    I wish people stopped deluding themselves — I regularly try (and benchmark for my purposes) local models and they are NOWHERE near the huge models like Sonnet or Opus. Nowhere. Yes, you can sometimes get plausibly-looking output for simple tasks, but for anything even remotely requiring thinking there is simply no comparison.

    Local models are useful. I use them for spam filtering, and soon intend to use them for image tagging and OCR. But let's stop saying they can get us "anthropic sonnet levels of performance", because that's just not true.

    • an hour ago
      undefined
  • freakynit5 hours ago
    So I did the India-specific analysis for a tier-3 city. Here, electricity costs 1/3rd of the US version, and you also get solar subsidy up to a certain amount.

    https://shorturl.at/q6gRE

    tldr;

    Hardware deprecation costs are the major factor.

    But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.

    Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.

    So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.

    • datadrivenangel2 hours ago
      I think your link is broken? Would love to see the analysis as well.
      • freakynit2 hours ago
        not much of an analysis really.. just simple math... anyways, that markdown share site takes around 5-10 seconds to load the page.. so, just hang on a bit more time :)
  • perbu3 hours ago
    For me, the appeal of local compute is first and foremost confidentiality and having the possibility to run my 200K documents through an LLM just to see what happen without having to consider the cost.
  • zkmon3 hours ago
    Consider deepseek as well. About 50 cents per 1M tokens, for >1T model
    • an hour ago
      undefined
  • empath7527 minutes ago
    It should not at all be surprising that running models at home is more expensive than commodity providers. That's just generally true of running your own stuff. Even if the cost in money isn't higher, the cost in time is often _significantly_ higher.

    This is why the idea that the AI labs are in trouble because inference will be a commodity is _completely backwards_. Some of the largest and most powerful companies in the world sell commodities. They compete on scale and efficiency, and you are never going to be able to compete with the big labs on either.

  • Archit3ch3 hours ago
    Except I already have a local Mac to run Xcode. OpenRouter cannot help with that, at any price.

    > 64 gigs should run a model like Gemma 4 31b

    No, it can run anything in the 70B range. It's a notable quality upgrade from the 30B, which isn't obvious because the famous flurry of April releases didn't contain any 70Bs.

    It can also run 120B in UD-Q3. Or 230B disk-streamed.

  • jmyeet4 hours ago
    I've dug into this previously for one simple reason: NVidia segments the market by capping VRAM and Apple silicon uses a shared memory model that could challenge that but it currently doesn't. And I really wonder if Apple realizes the potential of what they have or if they even care.

    So, for comparison, a 5090 has 32GB of VRAM and you can get one for ~$3000 maybe. To go beyond that memory with current generation (ie Blackwell) GPUs, you have to go to the RTX 6000 Pro w/ 96GB of VRAM and that's almost $10,000 for the GPU by itself. Beyond that you're in the H100/H200 GPUs and you're talking much bigger money.

    Part of the problem here is the author is looking at laptops. That's the only place you'll find the M5 Max currently. The real problem here is that the Mac Studios haven't been updated in almost 2 years. There were configs of those with 256/512GB of RAM but they've been discontinued, possibly because of the RAM shortage and possibly because of they're reaching EOL. Apple hasn't said why. They never do.

    Many expect M5 Ultra Mac Studios in Q3 and the M5 Ultra may well have >1TB/s of memory bandwidth (for comparison, the 5090 is 1.8TB/s). Memory bandwidth isn't the only issue. A 5090 will still have more compute power (most likely) but being able to run large models without going to a $10k+ GPU could be huge.

    But yes, it's hard to compete with the scales and discounted electricity of a data center. Even H200 compute hours are kinda cheap if you consider the capital cost of what you're using.

    I've looked into getting a 128GB M5 Max 16" MBP. That retails for $6k. You might be able to get it for $5400. But I don't think the value proposition is quite there yet. It's close though.

    • an hour ago
      undefined
    • gizajob3 hours ago
      I think Apple really do care and know that Moore’s law is likely to position them as major winners in this race in 3-7 years time.
      • brookst3 hours ago
        This. The M5’s massive speed up in refill is a good sign.

        Apple isn’t expecting wholesale adoption of on-device models this year or next. But all of their design and iteration suggests they see it coming.

        • datadrivenangel2 hours ago
          Apple has been doing tons of on-device ML for years now, but it's primarily background magic like photo tagging.
  • brisket_bronson4 hours ago
    > Let's round up to $0.20 per kWh.

    Next paragraph

    > At ~50-100 watts and $0.18/kWh that's $0.009 or $0.018 per hour. $0.02 per hour. $0.48 cents per day for the electricity to be running inference at 100%.

    lol

  • clearstack3 hours ago
    Apple services are ~27% of revenue and growing double-digits. The chip is a moat for that flywheel, not a standalone compute bet.
    • an hour ago
      undefined
  • SecretDreams5 hours ago
    Will this cost structure always be this way and are there other benefits to not running your LLM on the cloud?

    E.g.

    Privacy

    Uptime

    Future cost structure controls

    This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.

  • JSR_FDED5 hours ago
    Wouldn’t a Mac Mini be a better comparison?
    • sgt5 hours ago
      Yes, or Mac Studio. Laptops with screens aren't made to run 24/7 heavy workloads.
    • 650REDHAIR4 hours ago
      Also after a few years you can sell and upgrade.

      A 2022 Mac Studio w/ M1 Ultra and 128gb was ~$5200 new and I see them selling for over $4k on eBay.

      Can’t sell your used tokens…

      • onesociety20222 hours ago
        You can’t actually - due to the RAM shortage you can’t even upgrade to an M3 Ultra Mac Studio with 128GB RAM. That model has been discontinued. Even the 96GB model has a wait time of 5 months in most locations. This is the reason why the resale value is so high.
  • anonym294 hours ago
    The true advantage of locally self-hostable, open weight models isn't about monetary cost at all, it's about the CIA triad.

    Running locally, you get confidentiality of knowing your tokens are only ever being processed by your own hardware. You get the integrity of knowing your model isn't being secretly or silently quantized differently behind the scenes, or having it's weights updated in ways you don't want. And you get the availability of never having to worry about an API outage, or even an internet outage, for local inference capacity.

    And this isn't even starting to address the whole added world of features and tunability you get when you control the inference stack. Sampling parameters, caching mechanisms, interpretability etc.

    OpenRouter may be cheaper than frontier labs, but you still lose all of these benefits from open weight models the moment you decide to rely on someone else's hardware for your processing.

    • an hour ago
      undefined
  • panny5 hours ago
    Your laptop AI costs too much? Speculative investors can help!
  • maxdo4 hours ago
    I'm even surprised people ignorantly talking about advantages of buying very expensive device , run it only sometimes and aiming to beat cloud vendors.

    If small model is great it will be hosted with good electricity cost and will be utilized 24/7.

    Isn't it 2+2 of economics ?

    CPU is a commodity, and we are still buying cpu and ram from vendors for same reason

    • bitwize11 minutes ago
      It's a good thing the market is adjusting to the reality that no one needs to own powerful computers, just terminals into the feed of compute available through the cloud, then! Nothing could possibly go wrong from that!
    • throw12345678914 hours ago
      Put a cost on sending your intellectual property to a saas provider who knows where. Half a problem when it is just your IP, hopefully not the IP of your clients. Maybe if one is building yet another html nobody really cares about.
  • znpy25 minutes ago
    I think that the main flaw in the reasoning is assuming that cost of token will stay the same over the years.

    Chances are that token prices will go down, but chances also are that the AI bubble pops and all of a sudden all these companies will either have to make a buck out of the inference or go bankrupt.

    Getting your own hardware just grants you stable pricing.

  • deadbabe3 hours ago
    What would really elevate an article like this is if we could somehow quantify human brain’s equivalent outputs and compare the costs with local LLM and cloud LLMs.
    • datadrivenangel2 hours ago
      The computer does most specific tasks better, faster, and cheaper than I do.
  • SpyCoder775 hours ago
    Open router doesn't cost money per say, it depends on the providers pricing
    • moritzwarhier5 hours ago
      > OpenRouter has Gemma4 31b at ~38-50 cents per million tokens. This means that on the optimistic side (50 watts, 40 tokens per second, and 10 years) the pro max is as cheap as openrouter. On the pessimistic side (100 watts and 3 years at 10 tokens per second) the pro max is 10x the cost. I think ~3x the cost per million tokens is likely the right number for local inference on the pro max from an accounting perspective.

      Apart from that, like detailed in the the article, pricing for local compute also depends on electricity prices.

      By the way, I don't want to snark about it, my English is not very good, but it's "per se", not "per say". Just commenting on this petty thing because it seems to be a common misspelling, and it always trips me up a bit. Makes me wonder about another supposed meaning like "from hearsay".

    • mnahkies5 hours ago
      They do take a cut of 5.5%, (as they should)
  • christkv4 hours ago
    Bizarre running local models have nothing to do with cost. It's about privacy first and foremost
  • newsclues4 hours ago
    Local isn’t (just) about cost, it’s control and trust.
  • mbgerring2 hours ago
    Now include the externalized cost in the U.S. of deploying ~100% of productive capital to build data centers instead of, for example, first-world transportation infrastructure, and tell me which one is cheaper
    • tuwtuwtuwtuw2 hours ago
      Why would I want to include that when determining the cost per token?
      • mbgerring2 hours ago
        It’s part of the cost per token
        • mbgerring2 hours ago
          Can’t reply to the reply here, but yes, you do pay for it with money. Absorbing all of America’s capital and construction labor capacity to build AI data centers, rather than, for example, reaching parity with every other developed country in transportation infrastructure, is a cost you pay every day in gas and time spent in traffic. More to the point, building AI datacenters without paying for modernizing the grid is raising electricity rates. Token costs aren’t just subsidized by VC money, they are subsidized by all of us because of idiotic policy choices. Lots of people have pointed out that the current price per token is heavily subsidized, and a fair analysis would account for that.
          • tuwtuwtuwtuw2 hours ago
            I live in one of the developed countries, not in the US.
        • tuwtuwtuwtuw2 hours ago
          I don't pay for it with my money.
  • 5 hours ago
    undefined
  • an0malous5 hours ago
    OpenRouter and other LLM platforms are being subsidized by VC investment to less than it costs them to run inference, the MacBook Pro is not
    • hankerappan hour ago
      Bingo. I, for one, am loving this phase of enjoying the LLMs at the expense of VC money. Just like how I enjoyed cheap rides and deliveries on Uber. And with the fragmentation in the field, I don't see a monopoly coming up.
    • Kwpolska5 hours ago
      When the AI bubble inevitably pops, the author will find a new way to skew results in favor of cloud LLMs. Like including the price of a desk and a chair in the local token cost.
      • datadrivenangel5 hours ago
        I really wanted the laptop to look better cost-wise, but it doesn't.
        • an0malous4 hours ago
          I mean if you’re buying it just as an LLM inference server it’s not, but most people already have laptops, in which case it’s practically free
  • Der_Einzige4 hours ago
    OpenRouter doesn't expose all the LLM sampling parameters/research that llamacpp, vllm, sglang, et al expose (so no high temperature/highly diverse outputs). Also OpenRouter doesn't let you use steering vectors or LoRA or other personalization techniques per-request. Also no true guarantees of ZDR/privacy/data sovereignty.

    Oh, and the author didn't mention at all anything related to inference optimization, so no idea if they even know about or enabled things like speculative decoding, optimized attention backends, quantization, etc.

    At least AI slop would have hit on far more of the things I listed above. This is worse-than-AI.

  • kburman2 hours ago
    [dead]
  • iluvcommunism2 hours ago
    [dead]
  • 5 hours ago
    undefined
  • mrtimeman5 hours ago
    The full-amortization framing is doing a lot of work here. I bought my laptop because I needed a laptop, not as an inference box, and running a model on it is incidental to that. Once the hardware is sunk for other reasons, the only cost left is electricity plus whatever depreciation you accelerate by hammering the SoC, which the post actually acknowledges in one parenthetical before allocating the full $4299 to tokens anyway.

    Also nobody I know picks local over OpenRouter on price. They pick it for offline, for data not leaving the machine, for no rate limits, for not having a provider go down mid-task. If $/Mtok is the only axis, sure, cloud wins.

    In practice the pattern I see is leaving a small model running on easy background tasks while using the laptop normally, not a dedicated inference box hammered flat out for 5 years.

  • RyanJohn4 hours ago
    [dead]