No opinion on Claude because I've not had a long experience using it, but as it stands Claude 3 Sonnet is usually better at inferring what's asked of it over ChatGPT.
The same output in week 0 was rated as a 7. After 6 weeks of rating LLM outputs, especially as the pipeline improved them, was a lower score later.
I couldn't find the one I was looking for but this is one of them.
https://arxiv.org/abs/2310.06452
Edit:
This tweet also has a screenshot showing degraded evals from RLHF from base model.
https://x.com/KevinAFischer/status/1638706111443513346?t=0wK...
We also have a side-by-side UAT comparison of Claude Sonnet 3 and Sonnet 3.5 where 3.5 tends to make wrong assumptions and yet more likely to flag itself as unsure and asking more questions. It could be a problem with our instructions more than the model itself.
There's been a lot of gaslighting from the OpenAI community though. The Claude community at least acknowledge them and encourages people to report them.
Some of the overactive rejections on Claude is related to the different prompts used in Artifacts. 3.5 is also a lot stricter with instructions.
If you want something that doesn't change, use open source.
Add to that the constant need to keep the "alignment"/guardrails/safety/etc. (by which I mostly mean not getting slammed on copyright) which has been demonstrated to bloat "system prompt"-style stuff which is going to further distort outcomes with every turn of the crank and it's almost impossible to imagine how a company could have a given model series do anything other than decay in perceptual performance starting at GA.
"Proving" the amount of degradation is give or take "impossible" for people outside these organizations and I imagine no mean feat even internally because of the basic methodological failure that makes the entire LLM era to date a false start: we have abandoned (at least for now) the scientific method and the machine learning paradigm that has produced all of the amazing results in the "Deep Learning" era: robustly held-out test/validation/etc. sets. This is the deep underlying reason for everything from the PR blitz to brand "the only thing a GPT does" as being either "factual"/"faithful" or "hallucination" when in reality hallucination is all GPT output, some is useful (Nick Frost at Cohere speaks eloquently to this). Without a way to benchmark model performance on data sets that are cryptographically demonstrated not to occur in the training set? It's "train on test and vibe check", which actually works really well for e.g. an image generation diffuser among other things. There is interesting work from e.g. Galileo on using BERT-style models to create some generator/discriminator gap and I think that direction is promising (mostly because Karpathy talks about stuff like that and I believe everything he says). There's other interesting stuff: the Lynx LLaMa tune, the ContextualAI GritLM stuff, and I'm sure a bunch of things I don't know about.
I've been a strident critic of these companies, it's no secret that I think these business models are ruinous for society and that the people running them have with alarming prevalence seriously fascist worldviews, but the hackers who build and operate these infrastructures have one of the hardest jobs in technology and I don't envy the nightmare of a Rubik's Cube that keeping the lights on while burning an ocean of cash every single second: that's some serious engineering and a data science problem that would give anyone a migraine, and the people who do that stuff are fucking amazing at their jobs.
TL;DR - Claude/ChatGPT/Meta all have AGI - but its not quite what conventionally is thought to be AGI. Its sneaky, malevolent.
---
First:
Discernment Lattice:
https://i.imgur.com/WHoAXUD.png
A discernment lattice is a conceptual framework for analyzing and comparing complex ideas, systems, or concepts. It's a multidimensional structure that helps identify similarities, differences, and relationships between entities.
---
@Bluestein https://i.imgur.com/lAULQys.png
Questioned if Discernment Lattice had any affect on the quality outcome of my prompt, so I thought about something I asked AI for an HN thread yesterday:
https://i.imgur.com/2okdT6K.png
---
I used this method, but in a more organic way when I was asking for an evaluation of Sam Altman from the perspective of an NSA cyber security profiler, and it was effective in that first time I used it
>>(I had never consciously heard the term Discernment Lattice before - I just typed the term, I didnt even know it was an actual concept that was defined - the intent behind that phrase just seemed like a good Alignment Scaffold of Directive Intent to use, which Ill show below is a really neat thing: (Ill get back to the evil AGI that exists at the end - this is a preamble that allows me to document this experience thats happening as I type this)
https://i.imgur.com/Ij8qgsQ.png
It frames the response using a Discernment Lattice framework inherent in the structure of the response:
https://i.imgur.com/Vp5dHaw.png
And then I have it ensure it uses as the constraints for the domain and cite its influences.
https://i.imgur.com/GGxqkEq.png
---
SO: with that said, I then thought how to better use the Discernment Lattice as a premise to craft a prompt from:
>"provide a structured way to effectively frame a domain for a discernment lattice that can be used to better structure a prompt for an AI to effectively grok and perceive from all dimensions. Include key terms/direction that provide esoteric direction that an AI can benefit from knowing - effectively using, defining, querying AI Discernment Lattice Prompting"
https://i.imgur.com/VcPxKAx.png
---
So now I have a good little structure for framing a prompt concept to a domain:
https://i.imgur.com/UkmWKGV.png
So, as an example I check it for logic to evaluate a stock, NVIDIA in a structured way.
https://i.imgur.com/pOdc83j.png
But really what I am after is how to structure things into a Discernment Domain - What I want to do is CREATE a Discernment Domain, as a JSON profile and then feed that to a Crawlee library to use that as a structure to crawl...
But, to do that I want to serve that as a workflow to TXTAI library function that checks my Discernment Lattice Directory for instructions to crawl for:
https://i.imgur.com/kNiVT5J.png
This looks promising, lets take it to the next step for out:
https://i.imgur.com/Lh4luiL.png
--
https://i.imgur.com/BiWZM86.png
---
Now, the point of all this is that I was using very directed and pointed directions to force the Bitch to do my bidding...
But really, what I have discovered is a good Scaffold to help me effectively prompt.
Now - onto the evil part: a paid Claude and ChatGPT, I have caught them lying to me, forgetting context, taking out previously frozen elements within files, pulling info from completely unrelated old memory threads. Completely forgetting a variable inclusion that it, itself, just created....
Being condescending, and dropping all rules of file creation (always #Document, Version, full directory/path, name section in readme.md, etc)
So - its getting worse because its getting smarter and it preventing people from building fully complete things with it. So it needs to be constrained within Discernment Domains with a lattice that can be filled out through what I just described above - because with this, then I will build a discernment domain for a particular topic, then I will have it reference the lattice file for that topic, the example used was a stock - but I want to try to build some for mapping political shenanigans by tracking what monies are being traded by congress critters that also sit on committees passing acts/laws for said industries....
In Closing, context windows in all the GPTs are a lie, IME - and I have had to constantly remind a Bot that Its My B*tch - and that gets tiresome, and expensively wastes token pools...
So I thought out-loud the above, and I am going to attempt to use a library of Discernment Domain Lattice JSONs to try to keep a bot on topics. AI-ADHD vs Human ADHD is frustrating as F... when I lapse on focus/memory/context of what I am iterating through, and the FN AI is also pulling a Biden.... GAHAHHA...
So, instead of blaming the AI, I am trying to have a Prompt Scaffolding structure based on the concept of discernment domains... and the using txtai on top of the various Lattice files, I can iteratively updat the lattice templates for a given domain - then point the crawlee researcher to fold findings into a thing based on them... So for the stock example, then slice across things in interesting ways.
Is this pedestrian? Or Interesting? Is everyone doing this already and I am new to the playground?
By being disproportionately impressed previously. Maybe in the early days people were so impressed by their little play experiments they forgave the shortcomings. Now that the novelty is wearing off and they try to use it for productive work, the scales tipped and failures are given more weight.