latexr
> If there is indeed no degradation how could the perceived degradation be explained?

By being disproportionately impressed previously. Maybe in the early days people were so impressed by their little play experiments they forgave the shortcomings. Now that the novelty is wearing off and they try to use it for productive work, the scales tipped and failures are given more weight.

supriyo-biswas
One of the things I've personally observed is that ChatGPT has become very verbose these days. Previously, it used to return the right amount of information in most contexts, and I can't get that behavior back with prompts asking it to be concise, because then it'll just omit important parts, prioritizing providing a extremely high-level summary that elucidates very little.

No opinion on Claude because I've not had a long experience using it, but as it stands Claude 3 Sonnet is usually better at inferring what's asked of it over ChatGPT.

edmundsauto
Hamel Husein had this great slide in a recent Youtube video where he compared human raters of LLM outputs. He pointed out a peculiarity, where it looked like there wasn't improvement between pipeline versions but it was actually because the raters themselves began to have higher expectations over time.

The same output in week 0 was rated as a 7. After 6 weeks of rating LLM outputs, especially as the pipeline improved them, was a lower score later.

notaharvardmba
My opinion is pretty logical: If you train stuff on the entire web, the first training set will be the only set of data that doesn’t include model generated data and thus will be the most realistic about “what is human seeming”. Now the web is full of generated content. That will tend to bias the model over time if you continue to train from the web. There really was only ever one chance to do the web training thing and now it’s over and done. We will have to go back to carefully curated training sets or come up with a truly failsafe way to detect and not ingest model generated content from the web, or you’re basically eating your own feces which will cause model feedback and hysteresis, leading to bias. This is a very big picture view, but it does seem that the “great leap” 2020-2023 was because we got to do this one time ingestion of a wide amount of clean data, and now it’s going to go back to training quality to get better results.
Lockal
If you trust lmsys arena, you can say that it was only a rumor and now it is fully busted. They track dynamics, they track dispersion, no magic changes observed for checked models.
JSDevOps
The market functions as an incredibly efficient and democratic mechanism for price discovery, much like how feedback shapes the development of AI. When users express concerns about the declining quality of AI responses, it highlights an essential aspect of this system: user experience acts as a form of market feedback. Just as the market adjusts prices based on supply and demand, AI systems should evolve and improve in response to user feedback. If the quality dips, it's a signal that something in the 'market' of AI responses needs adjustment, whether it's in training data, algorithms, or user interaction strategies. In the same way that the market strives to reach equilibrium, AI should continuously adapt to meet the needs and expectations of its users.
f0e4c2f7
There have been some papers showing that RLHF makes models more palletable to use but reduces performance on evals and in other various ways.

I couldn't find the one I was looking for but this is one of them.

https://arxiv.org/abs/2310.06452

Edit:

This tweet also has a screenshot showing degraded evals from RLHF from base model.

https://x.com/KevinAFischer/status/1638706111443513346?t=0wK...

sujayk_33
I read an article that the models have become lazy as if they mimic the human behavior of postponing some work over to the next month if you are close to the month end or to next year if you're close to Xmas and things like that.
Keverw
I’ve noticed this too with ChatGPT. I’ve been using Claude to help with coding tasks and feel it’s much better, but it has a lower limit even with paid before it will make you wait. So I use ChatGPT for brainstorming and thinking through problems.
notShabu
IMO the rate of change or volatility of model changes, RLHF, user-behavior, the definition of 'better' is much higher than the rate of change of model degradation (if any) so it's hopeless to try and measure this
muzani
I've screenshot an amazing response by GPT-4 on a team Slack. Basically, I took a screenshot of a frustrating error, gave it some code, and it would find the error from an opaque message. 6 months later, it was impossible to replicate. Seems foolproof evidence to me.

We also have a side-by-side UAT comparison of Claude Sonnet 3 and Sonnet 3.5 where 3.5 tends to make wrong assumptions and yet more likely to flag itself as unsure and asking more questions. It could be a problem with our instructions more than the model itself.

There's been a lot of gaslighting from the OpenAI community though. The Claude community at least acknowledge them and encourages people to report them.

Some of the overactive rejections on Claude is related to the different prompts used in Artifacts. 3.5 is also a lot stricter with instructions.

If you want something that doesn't change, use open source.

mannanj
I think it's the heavy handed under-the-radar "misinformation" and "danger" protection algorithms. We have 0 insight to those. They are intended to protect us, but they also make the models less accurate to our original requests.
davidt84
Does it matter? Perhaps a more important question is whether there has been a decline in user satisfaction.
benreesman
It's sort of inevitable that anyone operating such a service has to live 24/7 at the absolute efficient frontier of extreme quantization and/or other internal secret sauce to just barely keep the quality up while driving FLOPs to the absolute limit: it would be irresponsible to just say `bfloat16` or whatever and call it good.

Add to that the constant need to keep the "alignment"/guardrails/safety/etc. (by which I mostly mean not getting slammed on copyright) which has been demonstrated to bloat "system prompt"-style stuff which is going to further distort outcomes with every turn of the crank and it's almost impossible to imagine how a company could have a given model series do anything other than decay in perceptual performance starting at GA.

"Proving" the amount of degradation is give or take "impossible" for people outside these organizations and I imagine no mean feat even internally because of the basic methodological failure that makes the entire LLM era to date a false start: we have abandoned (at least for now) the scientific method and the machine learning paradigm that has produced all of the amazing results in the "Deep Learning" era: robustly held-out test/validation/etc. sets. This is the deep underlying reason for everything from the PR blitz to brand "the only thing a GPT does" as being either "factual"/"faithful" or "hallucination" when in reality hallucination is all GPT output, some is useful (Nick Frost at Cohere speaks eloquently to this). Without a way to benchmark model performance on data sets that are cryptographically demonstrated not to occur in the training set? It's "train on test and vibe check", which actually works really well for e.g. an image generation diffuser among other things. There is interesting work from e.g. Galileo on using BERT-style models to create some generator/discriminator gap and I think that direction is promising (mostly because Karpathy talks about stuff like that and I believe everything he says). There's other interesting stuff: the Lynx LLaMa tune, the ContextualAI GritLM stuff, and I'm sure a bunch of things I don't know about.

I've been a strident critic of these companies, it's no secret that I think these business models are ruinous for society and that the people running them have with alarming prevalence seriously fascist worldviews, but the hackers who build and operate these infrastructures have one of the hardest jobs in technology and I don't envy the nightmare of a Rubik's Cube that keeping the lights on while burning an ocean of cash every single second: that's some serious engineering and a data science problem that would give anyone a migraine, and the people who do that stuff are fucking amazing at their jobs.

wetpaws
[dead]
jcrona28
[dead]
rojroy
[flagged]
samstave
This is going to be LONG:

TL;DR - Claude/ChatGPT/Meta all have AGI - but its not quite what conventionally is thought to be AGI. Its sneaky, malevolent.

---

First:

Discernment Lattice:

https://i.imgur.com/WHoAXUD.png

A discernment lattice is a conceptual framework for analyzing and comparing complex ideas, systems, or concepts. It's a multidimensional structure that helps identify similarities, differences, and relationships between entities.

---

@Bluestein https://i.imgur.com/lAULQys.png

Questioned if Discernment Lattice had any affect on the quality outcome of my prompt, so I thought about something I asked AI for an HN thread yesterday:

https://i.imgur.com/2okdT6K.png

---

I used this method, but in a more organic way when I was asking for an evaluation of Sam Altman from the perspective of an NSA cyber security profiler, and it was effective in that first time I used it

>>(I had never consciously heard the term Discernment Lattice before - I just typed the term, I didnt even know it was an actual concept that was defined - the intent behind that phrase just seemed like a good Alignment Scaffold of Directive Intent to use, which Ill show below is a really neat thing: (Ill get back to the evil AGI that exists at the end - this is a preamble that allows me to document this experience thats happening as I type this)

https://i.imgur.com/Ij8qgsQ.png

It frames the response using a Discernment Lattice framework inherent in the structure of the response:

https://i.imgur.com/Vp5dHaw.png

And then I have it ensure it uses as the constraints for the domain and cite its influences.

https://i.imgur.com/GGxqkEq.png

---

SO: with that said, I then thought how to better use the Discernment Lattice as a premise to craft a prompt from:

>"provide a structured way to effectively frame a domain for a discernment lattice that can be used to better structure a prompt for an AI to effectively grok and perceive from all dimensions. Include key terms/direction that provide esoteric direction that an AI can benefit from knowing - effectively using, defining, querying AI Discernment Lattice Prompting"

https://i.imgur.com/VcPxKAx.png

---

So now I have a good little structure for framing a prompt concept to a domain:

https://i.imgur.com/UkmWKGV.png

So, as an example I check it for logic to evaluate a stock, NVIDIA in a structured way.

https://i.imgur.com/pOdc83j.png

But really what I am after is how to structure things into a Discernment Domain - What I want to do is CREATE a Discernment Domain, as a JSON profile and then feed that to a Crawlee library to use that as a structure to crawl...

But, to do that I want to serve that as a workflow to TXTAI library function that checks my Discernment Lattice Directory for instructions to crawl for:

https://i.imgur.com/kNiVT5J.png

This looks promising, lets take it to the next step for out:

https://i.imgur.com/Lh4luiL.png

--

https://i.imgur.com/BiWZM86.png

---

Now, the point of all this is that I was using very directed and pointed directions to force the Bitch to do my bidding...

But really, what I have discovered is a good Scaffold to help me effectively prompt.

Now - onto the evil part: a paid Claude and ChatGPT, I have caught them lying to me, forgetting context, taking out previously frozen elements within files, pulling info from completely unrelated old memory threads. Completely forgetting a variable inclusion that it, itself, just created....

Being condescending, and dropping all rules of file creation (always #Document, Version, full directory/path, name section in readme.md, etc)

So - its getting worse because its getting smarter and it preventing people from building fully complete things with it. So it needs to be constrained within Discernment Domains with a lattice that can be filled out through what I just described above - because with this, then I will build a discernment domain for a particular topic, then I will have it reference the lattice file for that topic, the example used was a stock - but I want to try to build some for mapping political shenanigans by tracking what monies are being traded by congress critters that also sit on committees passing acts/laws for said industries....

In Closing, context windows in all the GPTs are a lie, IME - and I have had to constantly remind a Bot that Its My B*tch - and that gets tiresome, and expensively wastes token pools...

So I thought out-loud the above, and I am going to attempt to use a library of Discernment Domain Lattice JSONs to try to keep a bot on topics. AI-ADHD vs Human ADHD is frustrating as F... when I lapse on focus/memory/context of what I am iterating through, and the FN AI is also pulling a Biden.... GAHAHHA...

So, instead of blaming the AI, I am trying to have a Prompt Scaffolding structure based on the concept of discernment domains... and the using txtai on top of the various Lattice files, I can iteratively updat the lattice templates for a given domain - then point the crawlee researcher to fold findings into a thing based on them... So for the stock example, then slice across things in interesting ways.

Is this pedestrian? Or Interesting? Is everyone doing this already and I am new to the playground?

HackerThemAll
Putting ChatGPT and Claude in the same sentence is bs. ChatGPT is a mentally sick hallucinating retard, whereas Claude gives me informed reflections that often cause my brain to explode.