LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

fhdsgbbcaA

Looks like LLM inference will follow the same path as Bitcoin: CPU -> GPU -> FPGA -> ASIC.

bitdeep

Not sure if you guys know: Groq already doing this with their ASIC chips. So... the already passed FPGA phase and is on ASICs phase.

The problem is: seems that their costs is 1x or 2x from what they are charging.

jsheard

Is there any particular reason you'd want to use an FPGA for this? Unless your problem space is highly dynamic (e.g. prototyping) or you're making products in vanishing low quantities for a price insensitive market (e.g. military) an ASIC is always going to be better.

There doesn't seem to be much flux in the low level architectures used for inferencing at this point, so may as well commit to an ASIC, as is already happening with Apple, Qualcomm, etc building NPUs into their SOCs.

rldjbpin

as of now there are way too many parallel developments across abstraction layers, hardware or software, to really have the best combo just yet. even this example is for an older architecture because certain things just move slower than others.

but when things plateau off, this, then ASICs, would probably be the most efficient way ahead for "stable" versions of AI models during inference.

KeplerBoy

4 times as efficient as on the SoC's low end arm cores, soo many times less efficient than on modern GPUs I guess?

Not that I was expecting GPU like efficiency from a fairly small scale FPGA project. Nvidia engineers spent thousands of man-years making sure that stuff works well on GPUs.