Why has LLM progress seemingly stalled around the GPT-4 level? Or has it?

muzani

I would say much of it is data. I believe at this point, you can feed it the whole world wide web and it would still be a little smaller than what GPT-4 is trained on.

I'd speculate OpenAI hires a lot of people to make new data for them. There's stories of people who were interviewed where the job is just solving questions to be fed to GPT-4. They're likely looking at things that were thumbed down and adding several other data points to handle those.

There's also the quality of this data. You could feed it data from private convos (which Meta probably does) but most of that data would be low quality.

We're finding that Claude, GPT, Llama, even DeepL consistently mistranslate "you" in Indonesian as "Anda". It's never capitalized in the real world and it's a safe word, the equivalent of "they/them" pronouns in English. Imagine how weird it would be if someone wrote "Alice walks Their dog every afternoon." It's likely they were trained on government docs or perhaps some kind of ads.

But the point is there's lots of holes in the data and the perception of quality is how often it doesn't fail these.

None of the benchmarks catch this bug either. It's not just Indonesian - most of the code benchmarks use Python. So we'll also likely need better benchmarks in the future.

mindwok

Outsider speculation, but I’d guess ROI and compute/infra limitations play a big factor here. LLMs have well understood limitations simply because of their internal machinery - they’re probabilistic, can’t dynamically learn (arguable), stateless. Every major AI lab is working on new techniques to address some of these limitations.

If we just decided to charge ahead and keep upping the parameter counts of these models by orders of magnitude, we’d probably see big improvements in the quality of the models, but is it worth it? Inferencing even at current model sizes is already notoriously expensive, and training is also both expensive, resource constrained (see OpenAI recently signing the deal with Oracle for more DCs) and technically difficult because you’re now doing training across entire data centres.

It seems reasonable for these labs to be focusing on fundamental improvements and research, while making incremental improvements to existing models at least until the compute catches up and it becomes economically feasible to just jam up parameter counts again.

atleastoptimal

GPT-4 is 1.8T parameters. Until a model with at least 2-4x the number of params with no discernable improvement is released can we say that improvement has stalled. The biggest I imagine are Opus and Gemini Ultra with around 2T, far less that the geometric increase to demonstrate new emergent capabilities. Either way evidence has not been put forward yet to disprove scaling laws so until then, the apparent stalling is simply that a next-gen OOM model has yet to be trained and deployed.

coldtea

My guess: because they waited to make their offerings public when they somewhat matured their behind the scenes implementations and architectures.

So now a few releases after that, during which they bumped specs and made smaller changes to the architectures, there's not much room to easily grow.

More training data availability and quality is also a limiting factor to improvement (speed and resource needs can be improved more easily than that).

river1915

problem is not data or hardware costs. llm/transformers are not enough for AGI. we need something that has the ability of learning new skills with abstract reasoning. something like arc-agi puzzle solver. https://arcprize.org/arc

31337Logic

Stalled? Wow. It just came out. Do you want a new model every few months or something?

p1esk

Has gpt5 been released? No. So how do you know if it’s no better than gpt4?