In the olden, serial computing days, our algorithms were standard, and CPU designers did all sorts of behind-the-scene tricks to improve performance without burdening software developers. It wasn't perfect abstraction, but they tried. Algorithm led the way; hardware had to follow.
CUDA threw that all away, exposed lots of ugly details of GPU hardware design that developers _had to_ take into account. This is why, for a long time, CUDA's primary customers (HPC community & Natl labs) refused to adopt CUDA.
It's interesting that now that CUDA has become a legitimate, widely adopted computing paradigm, how much our view on this has shifted.
There are two arguments in favor of im2col.
1. "I don't want to implement a dedicated software kernel just for convolutions" aka laziness
2. "I don't want to implement dedicated hardware just for convolution"
The former is a sham, the latter is motivated by silicon area constraints. Implementing convolutions requires exactly the same number of FMAs, so you would end up doubling your chip size and automatically be cursed with 50% utilization from the start unless you do both matrix multiplication and convolutions simultaneously.
When you read answers like this: https://stackoverflow.com/a/47422548, they are subtly wrong.
"Element wise convolution performs badly because of the irregular memory accesses involved in it." at a first glance sounds like a reasonable argument, but all you're doing with im2col is shifting the "irregular memory accesses" into a separate kernel. It doesn't fundamentally get rid of the "irregular memory accesses".
The problem with the answer is that the irregularity is purely a result of ones perspective. Assuming you implement im2col in hardware, there is in fact nothing difficult about the irregularity. In fact, what is considered irregular here is perfectly predictable from the perspective of the hardware.
All you do is load x pixels from y rows simultaneously, which is extremely data parallel and SIMD friendly. Once the data is in local registers, you can access it any way you want (each register is effectively its own bank), which allows you to easily produce the im2col output stream and feed it straight to your matrix multiplication unit. You could have implemented the convolution directly, but then again you'd only get 50% utilization due to inflexibility.
Looks like Amerikan sanctions are driving a new wave of innovation in China.
" This work addresses that gap by introducing the Ten- sor Manipulation Unit (TMU): a reconfigurable, near-memory hardware block designed to execute data-movement-intensive (DMI) operators efficiently. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations.
The proposed architecture integrates TMU alongside a TPU within a high-throughput AI SoC, leveraging double buffering and output forwarding to improve pipeline utilization. Fab- ricated in SMIC 40 nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative TM operators. Benchmarking shows that TMU alone achieves up to 1413.43× and 8.54× operator-level latency reduction over ARM A72 and NVIDIA Jetson TX2, respectively.
When integrated with the in- house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs."
100 million tokens per second is currently worth about $130,000,000/day. (Or so ChatGPT 4.1 told me a few days ago)
I'd like to drop that by a factor of at least 1000:1
We need FPGA's at the latest process node, with many GB's of HBM in the package. Fast reconfigurability would also be a nice have.
I feel like the FPGA has stagnated over the last decade as the two largest companies in this space were acquired by Intel and AMD. Those companies haven't kept up the pace of innovation in this space, as it isn't their core business.
16 nm (or “14 nm”) for Ultrascale+.