Start by observing that
1/3 - 1/2 = 2/6 - 3/6
or
1/3 = 1/2 - 1/2 (1/3)
Substitute equation above into RHS an infinite number of times and find
1/3 = -(-1/2)^N for N in 1..inf
You can do this with arbitrary pairs powers of 2 and 3 (also other bases).
The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.
This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.
Sure taking 8 instructions for a multiply is a lot, but microprocessors at that time didn't even have multiply and ran below 1MIPS. I just wanted to bring it up as yet another way someone implemented multiply, and it was pretty effective for its time.
EDIT: too late to edit my original comment, but it added the memory value to the upper half of the 24-bit result.
Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.
The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.
"A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.
I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...
(In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)
You could definitely make the intellectual case for this approach. In cases where there's latency or costs associated with moving data to central computing, this makes sense. (In our case, it was space-based sensors so you could make the case that way.)
But AFAIK this style of processing was never systematically adopted in any space-based processing system, although many such systems (like radars) have ad hoc data reductions performed in hardware close to the sensor.
Thanks for providing that connection!
You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.
The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.
The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.
The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).
In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.
In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.
As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.
In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.
Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.
[1] https://www.youtube.com/watch?v=nll5MWlG7q4 [2] https://ieeexplore.ieee.org/document/282948/
The Pentium multiplier isn't 3 bits per cycles, it's one full multiplication per cycle (though pipelined over multiple cycles).
The part that's 3 bits is the Booth multiplier trick of multiplying 3-bit digits instead of 1-bit digits, but they're all multiplied in parallel.
As such, I suspect (but don't know) that this trick or some variant of it is used even today.
On the R3000/R4000/R4200, the 3-bits-per-cycle multipliers do use radix-8 multiplication, but they don't have a dedicated 3x multiplier. Instead the the 3x result is latched during the first cycle (by adding (x << 1) + x). For the remaining cycles it can do a 3-bit multiplication with nothing more than a bit of booth recoding logic, and a single 64-bit wide adder.
Then MIPS entirely abandoned this radix-8 encoding for the 8-bit-per-cycle multiplier in the R4400 and R4300, replacing it with a simple array of binary CSA adders. Probably because an array of base-2 adders is just much simpler. (Or at least that's what I think I can see on the R4300's die shot, I'm going to need to go back and take a much closer look at the multiplier)
Anything I say about radix-256 in my first comment is probably nonsense, it's not radix-256 simply because it can do 8-bits in one cycle.
What I missed is there is nothing limiting you to one radix-8 addition per cycle (like the early MIPS designs), you can combine the radix-8 encoding with an array of adders. And you only need 1/3rd the adders that a base-2 multiplier would need. The entire point of using the radix-8 encoding is that there is only one carry every 3 bits.
You are probably right. This trick with the dedicated 3x multiplier is probably still used today.
How do these manage to achieve the 3 cycle latency? If it has to go through a pipeline of 8 (or 64?) adders in a row (I don't know if it's actually that many! but I assume more than 3), how do you still manage to get just a 3 cycle latency, assuming a single adder has a latency of 1?
Also, given the trickery needed for the radix-8 multiply, where you get lucky with most values from 1-7 except 3 and 5 which use this specialized circuit, how much such specialized circuits do you need for radix-256 multiply?
Bad assumption. A single adder is only forced to take a full clock cycle if you register its output. Otherwise, data can flow through several stages in a single cycle, as (worst-case) propagation delay allows, before that set of stages hits a register that forces clock alignment.
Depending on the design there might be some other logic that requires a longer path in a single pipe stage, but you wouldn't want it to be too much longer.
Based on modern ISA design I suspect that a 64 bit barrel shifter -- with its 6 layers of muxes -- takes slightly longer than a 64 bit add, and everyone is taking the adder input from after the 2nd layer of shifter muxes, giving you single-cycle instructions to do `a + (b << {0,1,2,3})` either disguised as an LEA (x86, Arm) or as a specific `sh3add` etc instruction (RISC-V).
But I checked the Cortex A78 optimisation manual. They take 1 cycle if the shift is 4 or less and 2 cycles in other cases.
So it kind of makes sense to support fast shift by four. Though, it's more likely they just profiled a bunch of code and decided fast shifts by four was worth budgeting for.
Was there a difference in node size?
Like the M603e that became the G3, it would have been great to see 4 of these R4200s as a quad processor server.
As speed is paramount, there are many more things we will not learn about until the architecture is vintage.
It's sad that speculative execution became a vulnerability.
The early R4400s and R4200 where on the same node (I don't have 0.6 μm die area for the 4400, but the 0.3 μm R4400 was 132 mm² and the 0.35 μm R4300 was 45mm²)
But the R4400 could hit much higher clock speeds, it had an eight stage pipeline, took three cycles to take any branch (the delay slot, then two flushed instructions), and the 64-bit shifter took 2 cycles. The R4200 was designed for much lower clock speeds, it's more or less a 5 stage "classic RISC" pipeline. Branches take 1 cycle (the delay slot) and 64 bit shifts happen in 1 cycle.
The other problem is when it doesn't hit cache. The R4400 had an on-chip controller and on-chip tags for an external 1MB secondary cache (aka, L2), while the R4200 only had half the data cache (8KB vs 16KB) and no support at all for a secondary cache (not enough space for the IO pins). AFAIK, it doesn't even broadcast cache flushes on the bus, so attempting to bolt an external cache controller would be problematic.
(Cache misses on are especially painful on the N64, because the R4300 uses a cut-down 32-bit bus, and there is a lot of latency and bus contention going to main memory)
And to kill your dreams of putting four of them in a single server, they removed all multi-processor support.
That last one feels semi-deliberate to me, like they didn't want their new low-power, low-cost laptop chip to compete with their more expensive chips. I don't think it would have taken that much effort to implement the full cache coherency protocol.
Because I do think the R4200 could have competed on certain memory bound workloads (at least in cost-per-performance metrics).
DEC, IBM, Motorola, and MIPS.
After the EMP flash of 2075, the centient raccoons will enjoy reading the stories of the civilization long gone.
[0] https://typesetinthefuture.com/author/typesetinthefuture/
[1] https://www.amazon.com/Typeset-Future-Typography-Science-Fic...
I would like to hear much more about the R4200i than the 286.
I used to teach binary math to kids, and at a county fair I was upstaged by an 11 year old girl who demonstrated both multiplication by powers of two and division of powers of two.
"What does your father do for a living?'
"He is the director of the laser fusion project!"
"Oh."
Not a lot of confirmation....
https://web.archive.org/web/20160204052256/http://intel80386...
" For those descriptors that are common to both the 80286 and the 80386, the presence of zeros in the final word causes the 80386 to interpret these descriptors exactly as 80286 does; for example: "
This is patently and documented as untrue. The final word is reserved for use as a 365 GDT, and breaks 286 code, specifically Xenix 286, on a 386.
https://www.os2museum.com/wp/theres-more-to-the-286-xenix-st...
https://en.wikipedia.org/wiki/Extended_precision#x86_extende...
> (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)
Why can't you do the same to subtract x4 from x7 to get x3?
Or for hexadecimal: multiplying by 0x10 (16) or by 2, 4 or 8.
Binary 4x is left shift by 2, rather than 2x which is by one.
Adding or subtracting in 2's compliment space is the same operation for the bits of the word.
> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial.
I think you can leave out the "almost" there - multiplying by 0 is 0 and multiplying by 1 is the number itself, as shown in the example, so I would definitely call it trivial.
> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.
> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.
> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.
> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)
If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...
why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?
Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.
The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.
For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.
Hopefully this makes sense. There are a lot of digits and sums flying around.
If you are adding n terms using an adder tree, the 1 feeds to the carry in of the next level ( it is simply a concatenation since the carry term is shifted left by 1). Only thw final carry propagate adder will have the add with 1.
Another interesting question is how sign extension is implemented for a negative term. The problem is that you're adding 64-bit shifted terms to create a 128-bit result. When negating a 64-bit term, all the unused bits to the left need to be set to 1. But the chip doesn't do this. (Among other things, the adders would need to be much wider than 64 bits.) So it must be using a trick for the sign extension.
- LEA is just an instruction moves the address calculated by an addressing mode into its output operand, rather than performing a data transfer to or from that address; all the addressing that LEA can do, MOV instructions can also do.
- The indexed addressing modes of the x86 used by MOV or LEA do not support a scale factor of 3, only powers of two: 1, 2, 4, 8. So address generation has no use for multiplication by 3.
- Article makes it clear that the multiplier by 3 is part of the floating point multiplier.
That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.
Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.
We’ve been at that limit for decades.
What ended was Dennard scaling around 2006. Roughly that frequency would keep going up as feature size went down. But because so many people are confused about what is what, you see a crappy muddled message.
Moore's law has been going strong. It must end eventually, current predictions are that it will be in a decade or two.
Maybe debating what always-somehow-wrong law to cite should not be the focus? Like it's very clear to me that being technically correct about what Moore's law or the Dennard scaling refers to is leaps and bounds less important than the actual, practical computing performance trends that have been observable in the market.
I think Moore’s law should be avoided altogether when discussing progress in this area, because it’s hard to understand the effects of doubling intuitively. Rice grains on chessboards and all that.
One might think ”Moore’s law is slowing down” means progress was faster before and slower now, when it is in fact completely opposite.
If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million.
Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.
So in 2 years we added 500 times more transistors to our cpus than the first 20 years of “prime Moore’s law” did.
This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat, and the fact that most users aren’t doing things with their computers where they can notice any improvements in performance.
But this is not what I as a consumer end up seeing at all. Consider the RTX 5090. Gen-on-gen (so, compared to the 4090), for 20-30% more money, using 20-30% more power, you get 20-30% more raster performance. Meaning the generational improvement is 0, software nonwithstanding.
> If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million. Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.
Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%! Those Ryzens went from 8.3B to 13.2B, a +59% difference. But even this is misleading, because we're not considering die area or any other parameter.
The 4090 and 5090 are the same generation in reality, using the same process node. The 5090 is a bit larger but only has about 20% more transistors of the same type compared to the 4090. Which of course explains the modest performance boosts.
Nvidia could have made the 5090 on a more advanced node but they are in a market position where they can keep making the best products on an older (cheaper) node this time.
>Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%!
That was my point though, to highlight how relative differences in percentages represent vastly different actual performance jumps. It quickly becomes meaningless since it's not the percentages that matter, it is the actual number of transistors.
To put it another way - If you take the first pentium with about 3 million transistors as a baseline, you can express performance increases in "how many pentiums are we adding" instead of using percentages, and note that we are adding orders of magnitude more "pentiums of performance" per generation now than we did 10 years ago.
This is literally a relative measurement. You cannot reason about Moore's Law in absolute changes.
The other poster has laid it out for you in the simplest terms: a 1 billion transistor increase could mean anything. It could be a 1000% improvement - which is absolutely humongous - or a 10% improvement, which is basically irrelevant. If you want to measure how impactful an increase is, you have to look at relative change. 1 billion transistors on it's own means nothing. It is only interesting with respect to the number of transistors in the previous generation - which is a relative change measurement.
Say we are at Generation 1 with 100 billion transistors. By your reasoning, if we add 1 more billion of transistors to this, that's big. 1 billion transistors are a lot. But this is absolutely incorrect! Because we started out with 100 billion transistors, the change is actually irrelevant.
[..]
>This is literally a relative measurement. You cannot reason about Moore's Law in absolute changes.
This is exactly my point. I said exactly the same thing as you: "I think Moore’s law should be avoided altogether when discussing progress in this area"!
Performance is an absolute thing, not a relative thing. Amount of work per second, not percentage-over-previous!
Doubling means that each step encompasses the sum of all previous steps. Every step gives vastly more performance than any preceding step.
If a generation gives 60% more performance and all previous generations gave 100%, the latest jump will still be the one that added the most performance.
I think this is often overlooked and worth mentioning. People are actually complaining about performance jumps that are the biggest performance jumps ever seen because they are thinking in percentages. It makes no sense.
I think the core misunderstanding is what is being discussed. You say "Performance is an absolute thing, not a relative thing. Amount of work per second, not percentage-over-previous!"
But nobody here is concerned with the absolute performance number. Absolute performance is only about what we can do right now with what we have. The discussion is about the performance improvements and how they evolve over time.
Performance improvements are inherently a relative metric. The entire industry, society and economy - everything is built on geometric growth expectations. You cannot then go back to the root metric and say "ah but what really matters is that we can now do X TFlops more", when that number is a mere 10% improvement, nobody will care.
Built into everything around computing is the potential for geometric growth. Startups, AI, video games - it's all predicated on the expectation that our capabilities will grow in geometric fashion. The only way to benchmark geometric growth is by using % improvements. Again, it is wholly irrelevant to everyone that the 5090 can do 20 tflops more than the 4090. 20 TFlops is some number, but whether it is a big improvement or not - ie is it expanding our capabilities at the expected geometric rate or not - is not possible to divine from this number.
And when we look it up, it turns up that we went from 80 tflops to 100 tflops, which is a paltry 25% increase, well short of the expectations of performance growth from Nvidia which was in the 50% region.
20 tflops might or might not be a lot. The only way to judge is to compare how much of an improvement it is relative to what we could have before.
I see your point and it's valid for sure, and certainly the prevailing one, but I think it's worth thinking about the absolute values from time to time even in this context.
In other contexts it's more obvious, let's use video. Going from full HD to 4K is roughly a doubling of resolution in each direction. It means going from about 2 megapixels to 8, a difference of 6.
The next step up is 8k, which is going from 8 megapixels to 33. A difference of 25.
This is a huge jump, and crosses a relevant threshold - many cameras can't even capture 33mpx, and many projectors can't display them.
If you want to capture 8k video, it's not relevant if your camera has twice the megapixel count of the previous one - it needs to have 33mpx. If it has 16 instead of 8 it doesn't matter (also it doesn't really matter if it has 100 instead of 40).
On the other hand, if you want to capture 4k, you need only 8 megapixels. If you have 12, 16 24 or 40 doesn't really matter.
If we return to CPUs, absolute performance matters there too - if you want to play back video without frame drops, you need a certain absolute performance. If your computer is twice as fast and still can't do it, the doubling didn't matter. Similarly if your current computer can do it, doubling it again won't be noticeable.
But it is the percentage that matters? If I have 10B transistors and I add 1B to it, the speedup I can expect is 10%. If I have 1B transistors and add 1B to it, the speedup I can expect is 100%. Tremendous difference. Why would I ever care about the absolute number of transistors added?
Because it's the transistors that actually matter for performance.
If a unit of work (number of files compiled, number of triangles generated or whatever else) takes 1 billion transistors to complete in 1 second, you have gained the same amount of work per second by adding 10% to the latter as you gained by adding 100% to the former.
How much performance you need to feel a difference in a given workload is a separate point, and note that usually the workload changes with the hardware. Compiling the linux kernel in 2025 is a different workload than it was in 2005, for example, and running quake 3 is a different workload than cyberpunk 2077.
If you play a modern game, you don't notice a difference between a 10 year old GPU and a 12 year old GPU - even though one might be twice as fast, they might both be in the single digits of FPS wich feels equally useless.
So we gain more from the hardware than we used to, but since the software is doing more work we're not noticing it as much.
Benchmarking efforts usually involve taking the same amount or type of work, and comparing the runtime durations or throughputs in turn for different hardware.
Rasterization performance benchmarks for the 5090 revealed exactly the same +20% difference we see in transistor count. This is why I do not see the utility in remarking that in absolute terms we're adding more transistors than ever, because this is basically never what matters in practice. I have a set workload and I want it to go some amount faster.
Software sprawl is an issue no doubt, but that on its own is a separate discussion. It bears light relation with the absolute vs. relative differences discussion we're having here.
> How much performance you need to feel a difference in a given workload is a separate point
It was exactly the point I said at the start should be the point of focus. Maybe we're talking past one another, I don't know.
I think we do - we agree that the symptom is that we don't experience the same gains now as we used to, and that is a problem.
My issue is the notion that this is caused by a slowdown in performance gains from the hardware side, when this is clearly not the case. A common complaint is along the lines of "we only got 30% when last time we got 50%", which completely ignores that the latter 30% is way more actual new performance than the previous 50%.
>I legitimately just do not see the utility of framing the topic this way.
IMO it's always useful to identify the actual reason for a problem and think about the fundamentals.
If the problem is that we're not experiencing the performance gains, we should be asking ourselves "Why does software feel slower today despite the hardware having 10x more performance".
Instead we complain about the hardware for not managing to add the equivalent of all previous performance gains every 2 years, because Moore's law observed that it did so in the beginning of the chessboard (so to speak).
Instead of wondering whether Moore's law is dying or not, we should question why Wirth's law seems to be immortal! ;)
Can't resist to throw in this quote (from one of Wirth's students I think):
"Software gets slower faster than hardware gets faster."
Sure, it’s not literally true of literal transistor density, but it feels arrogant to invent new names for the overall phenomenon that Moore observed. Like the way we still talk about Newton’s laws despite very valid “actually technically” incompleteness.
https://hasler.ece.gatech.edu/Published_papers/Technology_ov...
Some view it as a doubling every 18 months, or a cost per transistor (this has gone up with the smallest nodes).
It is roughly an exponential curve in the number of transistors we can use to make a "thing" with.
It is both a capability (can we make things of a certain number of transistors) and is it economically viable to build things of that size.
You could stay at the current node size and halve the cost of that wafer every 18 months and you would still be on the same curve. But it is easier in a our economic system to decrease the node size, keeping the rest of the fixed wafer costs the same and get 2x or 4x the density on the same lines.
If I get nerd sniped, I'd find the two video presentations one by Krste and another by Jim Keller where they unambiguously explain Dennard Scaling and Moore's Law in a way that is congruent with what I just said.
I find "Moore's Second Law" interesting. At least the version I'm familiar with says that the cost of a semiconductor chip fabrication plant doubles every four years. See https://en.wikipedia.org/wiki/Moore%27s_second_law
It's interesting to contrast that trajectory with global GDP. At some point, either global economic growth has to accelerate dramatically to even produce one fab; or we have to leave the 'globe', ie we go into space (but that's still universal GDP exploding), or that law has to break down.
It would be exceedingly funny (to me), if the one of the first two possibilities held true, and would accurately predict either an AI singularity or some Golden space age.
It’s not impossible, but think I think quantum computing will only become practical later than optical.
Does this actually work? At some point, and this is been the case for a while, you're limited by thermals. You can't stack more layers without adding more cooling.
[1]: https://www.asml.com/en/news/stories/2022/what-is-a-gate-all...
[2]: https://anysilicon.com/the-ultimate-guide-to-gate-all-around...
Your sources are excellent. ( Thank you so much for the links. )
I'm only half-joking: the brain gets a lot of its energy efficiency out of most of its parts not working all that hard most of the time; and we are seeing some glimpses of that in mobile processors, too.
Quantum computing is next, right?
Apart from the very niche application of factoring integers into prime numbers, there's scarcely any application know where quantum computers would even theoretically outperform classical computers. And even integer factoring is only remotely useful, until people completely switch away from cryptography that's relies on it.
The one useful application of quantum computing that I know of is: simulating quantum systems. That's less useless than it sounds: a quantum computer can simulate not just itself (trivially), but also other quantum systems. In any case, the real world use case for that is for accelerating progress in material science, not something you nor me would use everyday.
I can believe that quantum computers might be useful for chemistry simulations (Quantum computers aren't really useful for encryption. But you could theoretically use them. They just don't really give you any advantage over running a quantum resistant algorithm on a classic computer.)
I'm especially doubtful that quantum computer would be useful for arbitrary NP problems or even arbitrary linear algebra problems.
https://cs.stackexchange.com/questions/76525/could-a-quantum...
There are specialized algorithms for any part of it, especially search. But demonstrating that quantum computers are good for linear algebra should be enough to show that they are generally useful, I hope.
The encryption I was referring to was quantum link encryption (technically not a quantum computer, but we are splitting hairs here; it uses the same set of underlying mechanisms). Quantum link encryption permits you to have a communications channel that if someone tries to man in the middle, all it does is break the link. Both you and the attacker only see gibberish. It’s like a one time pad that doesn’t require first exchanging pads.
See https://scottaaronson.blog/?p=8329 for something more recent.
That’s not really a takedown of the idea.
The more critical challenge is that there is a massive, massive, constant factor difference between classical and quantum computing using foreseeable technology. Even in the best case (like factoring) where a quantum computer gives you a logarithmic algorithm for a classically exponential problem, it does happen to throw in slowdown factor of a trillion. Oops.
But still, even with large constant factors, algorithmic improvements eventually went out asymptotically. It’s just a matter of building a quantum computer big enough and reliable enough to handle problems of that size.
We are already hitting the limits of what we can train on GPUs with reasonable cost. I expect that there will be many advances in the years to come from improved training algorithms. But at some point that will run dry, and further advances will come from quantum computing. Scott is right to point out that this is far, far beyond any existing companies planning horizon. But that doesn’t mean quantum technologies won’t work, eventually.
I think software bloat is growing faster, though.
While in practice the latency is over an order of magnitude larger, the speed of light round trip distance between two sockets on the same motherboard is likely on the scale of 1-2 nanoseconds which is already several cycles. Once you're talking about warehouse sized datacenters the speed of light distance can get into the microseconds even in a straight line through vacuum/air.
The stuff you use a computer hard drive or Flash drive has remained consistent. Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.
I remember when data interchange meant burning a disc and walking a disc over to my buddy. XML will standardize that and simplify my workflow. /s
With iCloud, your primary data lives in the cloud; your SSD caches recently used data, and the OS will delete it if it runs out of space. That’s millions of users for whom “whatnot” isn’t stored locally. (That’s how a 2TB iCloud subscription can make sense even if only used with a 500 GB SSD)
Lots of Facebook and flash games are stored 'in the cloud' (but probably cached locally).
For many games, a big part of the game logic runs on some server somewhere, especially multiplayer games.
- goto/jmp to instr
- vtable lookup
- hash and lookup in a dictionary
- run a Large Language Model
- goto/jmp to instr
- vtable lookup
- hash and lookup in a dictionary
- run everything through a universal approximator
/s
Yet on hacker news, the exact opposite sentiment is true.
Employee hours are expensive, so optimise as much as you possibly can and push as much time onto the user as they will tolerate. Focus primarily on readability to the detriment of all other factors. Heck, even don’t bother compiling your program. Use an interpreter on the server.
Keep your local cost down, your remote cost up… after all remote cost are not your cost, especially if every company is doing the same thing.
It’s ironic that I could read a book and it be so wrong, yet that book is the foundation Bible for many people’s programming journey.
It's incorrect to make a bookshelf that can't hold the weight of books, but IKEA isn't incorrect to make a crappy-but-adequate bookshelf that will get given away when you move, and a master carpenter isn't incorrect to make an expensive-but-beautiful bookshelf that will be used for many decades.
A seed-stage startup trying to see if anyone cares about their product at all should probably accept it being 100x slower than it theoretically could be. And the Linux maintainers should be working hard to shave milliseconds.
If you are a startup (many HN readers are), employee hours are relatively expensive. The goal is to get a product out as fast as possible, and most of the time you fail at making a working product anyways. Meanwhile, you only have to spend money on servers once you have a successful product. Deferring the problem makes sense here.
On the other hand, I personally write code that'll be near guaranteed to run millions of times in some factory somewhere, and there's little benefit if I ship my code before the factory starts making stuff. I'll take the time to write high-performance and high-quality software because there's now a financial benefit to doing so.
But ultimately, it's my boss who gives me insight into what I should be doing. Not a book, because the book doesn't have context that [company] will lose tens of thousands of dollars every day that I don't fix [issue].
"Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?”
That's not even counting loss of quality-adjusted life years from time spent staring at a screen waiting, like being stuck in traffic.
Does this mean there's an "adder" to add 1 to the "next digit" prior to feeding the digits to the main part of the multiplier? That itself seem like it'd be similar to the carry lookahead circuit. Also thinking about when that needs to be done:
7 = 8-1
6 = 8-2
5 = 8-3
4 = 8-4 <- he didn't say they do this, but it would mean the MSB of the 3-bit value could be used to determine if we need to add 1 to the next digit, saving a couple gates.
You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier
There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).
8086: 29,000 386: 275,000 486: 1.2 million Pentium: 3.1 million The NSA entered the game sometime after 2000, I recall.
> Multiplying a number by three is straightforward in binary: add the number to itself, shifted to the left one position. (As mentioned above, shifting to the left is the same as multiplying by two and is easy in hardware.) Unfortunately, using a simple adder is too slow.
So no, not really. In fact the entire point of article is arguably explaining why it is not that simple.
Perhaps in 10 years training runs will be cheap enough, that one a whim you could just train a new AI model on data from only before the Pentium was released, and then see if it could come up with it?
1. Shift-left
2. Add original value
For multibyte values use rotate left which carries, and do the rotates from LSB to MSB. Then clear carry, then do the adds from LSB to MSB with carries. Should work for any word size.
Ofc I was a teenager but this seemed clever and simple to me.
Was (am) I just missing something?