In my experience, based on profiling and optimizing of ML-based guitar amp models in the PiPedal project (https://rerdavies.github.io/pipedal/), when using only neon instructions, performance is almost completely constrained by L2 memory bandwidth. Compute cost almost completely disappear while waiting for memory loads and stores.
So, although these devices have ferociously impressive FLOP rates, I'm extremely curious as to how the cost of memory loads and stores is going to work.
I can very well imagine that having large local tile buffers is going to dramatically improve performance. But I'm curious how much. No matter how fast the compute speed is, it seems to me that performance of these sorts of devices in practice is going to be constrained by memory transfer rates. And perhaps by L1 caches in the tile compute unit that are better optimized for tile computation than the L1 cache on a general-purpose cPU.
My current expectation: that performance of matrix multiplies increases linearly with respect to tile size. i.e. a tile size if 8x8 floats will perform twice as fast as a matrix multiplier with a tile size of 4x4, since doubling the tile size reduces the required transfers to and from L2 by a factor of two.
So, compared to a basic A72 ARM neon (effectively, 4x8 tile size), I would expect about a 4x improvement by virtue of the fact that the tile size is larger on the Apple tile processor. Both entirely otherwise limited by the cost of L2 memory loads and stores. And maybe another 2x or 3x improvement because the tile processor L1 caches (tile buffers) are tuned for tile multiply/accumulate operations.
Could somebody comment on how these devices actually perform on real matrix multiplies? It seems inconceivable to me that these devices will actually achieve peak FLOP rates in anything but meaningless test cases. And also somewhat of a meaningless exercise to measure peak performance using test cases that are designed to completely eliminate L2 memory transfers.
dividuum
> Although Apple has included a matrix accelerator in its devices since 2019, it used a proprietary instruction set inaccessible to developers, who officially could only use Apple-provided numerical libraries.
How does that work? Does the hardware throw some kind of fault when using those instructions? Or are they merely undocumented and you could use them if you figure out how they work? I guess the second, as hinted by the "officially"?
freeqaz
Any comparison with how much faster this is compared with the previous way of doing things on the CPU?
nxobject
If Apple’s going for one SME accelerator per base M4 chiplet, it’ll be interesting to see how to program scalably for Pro/Max/Ultra variants.
kjkjadksj
I wish they made computers that ran software like games again. Seems like the last few iterations they’ve been working hard on making computers that are able to run ai models a little faster. Are people really asking for that? I would think far more people would like to play a video game over rolling their own matrix multiplication, but I guess that’s why they pay the people at apple the big bucks because they must know best.
ein0p
I’m not sure why they added this feature. All Apple SoCs have far more energy efficient compute than the CPU. This would only make sense for really tiny models which need extremely quick forward pass. For such models the overhead of a GPU or Neural Engine kernel launch would be quite noticeable. But for those the old NEON was already OK, and if not, there also is a dedicated matrix unit there called AMX. Seems kinda random to me.
brcmthrowaway
I'm dim, whats the difference between SVE and SME?
DanielLee5
Great review.
softwaredoug
I just wish they’d make native tensorflow installation actually work without a million apple silicon specific exceptions :)
So, although these devices have ferociously impressive FLOP rates, I'm extremely curious as to how the cost of memory loads and stores is going to work.
I can very well imagine that having large local tile buffers is going to dramatically improve performance. But I'm curious how much. No matter how fast the compute speed is, it seems to me that performance of these sorts of devices in practice is going to be constrained by memory transfer rates. And perhaps by L1 caches in the tile compute unit that are better optimized for tile computation than the L1 cache on a general-purpose cPU.
My current expectation: that performance of matrix multiplies increases linearly with respect to tile size. i.e. a tile size if 8x8 floats will perform twice as fast as a matrix multiplier with a tile size of 4x4, since doubling the tile size reduces the required transfers to and from L2 by a factor of two.
So, compared to a basic A72 ARM neon (effectively, 4x8 tile size), I would expect about a 4x improvement by virtue of the fact that the tile size is larger on the Apple tile processor. Both entirely otherwise limited by the cost of L2 memory loads and stores. And maybe another 2x or 3x improvement because the tile processor L1 caches (tile buffers) are tuned for tile multiply/accumulate operations.
Could somebody comment on how these devices actually perform on real matrix multiplies? It seems inconceivable to me that these devices will actually achieve peak FLOP rates in anything but meaningless test cases. And also somewhat of a meaningless exercise to measure peak performance using test cases that are designed to completely eliminate L2 memory transfers.