If we just decided to charge ahead and keep upping the parameter counts of these models by orders of magnitude, we’d probably see big improvements in the quality of the models, but is it worth it? Inferencing even at current model sizes is already notoriously expensive, and training is also both expensive, resource constrained (see OpenAI recently signing the deal with Oracle for more DCs) and technically difficult because you’re now doing training across entire data centres.
It seems reasonable for these labs to be focusing on fundamental improvements and research, while making incremental improvements to existing models at least until the compute catches up and it becomes economically feasible to just jam up parameter counts again.
So now a few releases after that, during which they bumped specs and made smaller changes to the architectures, there's not much room to easily grow.
More training data availability and quality is also a limiting factor to improvement (speed and resource needs can be improved more easily than that).
I'd speculate OpenAI hires a lot of people to make new data for them. There's stories of people who were interviewed where the job is just solving questions to be fed to GPT-4. They're likely looking at things that were thumbed down and adding several other data points to handle those.
There's also the quality of this data. You could feed it data from private convos (which Meta probably does) but most of that data would be low quality.
We're finding that Claude, GPT, Llama, even DeepL consistently mistranslate "you" in Indonesian as "Anda". It's never capitalized in the real world and it's a safe word, the equivalent of "they/them" pronouns in English. Imagine how weird it would be if someone wrote "Alice walks Their dog every afternoon." It's likely they were trained on government docs or perhaps some kind of ads.
But the point is there's lots of holes in the data and the perception of quality is how often it doesn't fail these.
None of the benchmarks catch this bug either. It's not just Indonesian - most of the code benchmarks use Python. So we'll also likely need better benchmarks in the future.