> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.
Results are "strong" but can't be felt by the user? What does that even mean?
But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.
"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."
But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?
This is the prompt I gave:
simplify this rust library by removing the different sized enums and only using the U8 size. For example MasksByByte is an enum, change it to be an alias for the U8 datatype. Also the u256 datatype isn't required, we only want U8, so remove all references to U256 as well.
The original crate is trie-hard [1][2] and I forked it and put the models attempts in the fork [3]. I also quickly wrote it up at [4]
[1] https://blog.cloudflare.com/pingora-saving-compute-1-percent...
[2] https://github.com/cloudflare/trie-hard
[3] https://github.com/kpm/trie-hard-simple/tree/main/attempts
[4] https://blog.reyem.dev/post/refactoring_rust_with_chatgpt-o1...
1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.
2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.
I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.
But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.
Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.
As a developer, this is highly concerning, as it makes it much harder to debug where/how the “reasoning” went wrong. The pricing is also silly, because I’m paying for tokens I can’t see.
As a user, I don’t really care. LLMs are already magic boxes and I usually only care about the end result, not the path to get there.
It will be interesting to see how this progresses, both at OpenAI and other foundation model builders.
Kagi LLM benchmarking project:
o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law
it seems trivial, but I tried for more than 2 hours in the past to get gpt4 to play tic-tac-toe optimally and failed (CoT prompt,etc.). The result were too many illegal moves and absolutely no optimal strategy.
o1-preview can do it really well [1]
However, when I use a non-standard grid (3x5) it fails to play optimally. But it makes legal moves and it recognized I had won. [2]
My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.
[1] https://chatgpt.com/share/e/66e3e784-26d4-8013-889b-f56a7fed... [2] https://chatgpt.com/share/e/66e3eae0-2d38-8013-b900-50e6f792...
Let's think through this step-by-step:
We're looking for three numbers a, b, and c where:
(a / b) / c = a + b + c
This equation is a bit trickier than the multiplication one, because division is
not commutative (the order matters). We need to be careful about how we
interpret (a / b) / c.
Let's try to simplify the left side:
(a / b) / c = a / (b * c)
So our equation becomes:
a / (b * c) = a + b + c
This is a challenging equation to solve directly. Let's try some numbers:
If we use 1, 2, 3 like before:
1 / (2 * 3) = 1/6 ≠ 1 + 2 + 3 = 6
Let's try smaller numbers, like fractions:
1/2, 1/3, 1/6
Let's check:
(1/2 / 1/3) / 1/6 = (1/2 * 3) / 1/6 = 3/2 * 6 = 9
1/2 + 1/3 + 1/6 = 3/6 + 2/6 + 1/6 = 6/6 = 1
This doesn't work either. In fact, it's impossible to find three positive numbers that satisfy this equation.
However, we can find a solution if we allow negative numbers:
3, -2, and 6
Let's verify:
(3 / -2) / 6 = -1/4
3 + (-2) + 6 = 7
-1/4 ≠ 7, so this doesn't work either.
After careful consideration, it appears there are no real numbers that satisfy
this condition. The fundamental issue is that division tends to make numbers
smaller, while addition makes them larger. It's mathematically impossible to
find three numbers where these operations result in the same value.
nuff said
> No streaming support, tool usage, batch calls or image inputs either.
I think it's worth adding a note explaining that many of these limitations are due to the beta status of the API. max_tokens is the only parameter I've seen deprecated in the API docs.
From https://platform.openai.com/docs/guides/reasoning
> We will be adding support for some of these parameters in the coming weeks as we move out of beta. Features like multimodality and tool usage will be included in future models of the o1 series.
I suppose I'm to blame for raising my own expectations after the latest PR, but I was pretty disappointed when the answers weren't any better from what I got with the old model. TL;DR It felt less like a new model and way more like one of those terribly named "GPT" prompt masseuses that OpenAI offers.
Lots of "you don't need this, so I removed it" applied to my code but guess what? I did need the bits you deleted, bro.
It felt as unhelpful and bad at instructions as GPT-4o. "I'm sorry, you're absolutely right". It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.
Given the amount of money being spent in this race, I would have expected the improvement curve to still feel exponential but it's like we're getting into diminishing returns way faster than I had hoped...
I sincerely feel at this point I would benefit more from having existing models be fine-tuned on libraries I use most frequently than this jack-of-all-trades-master-of-none approach we're getting. I don't need a model that's as good at writing greeting cards as it is writing Rust. Just give me one of the two.
This has been a very incremental year for OpenAI. If this is what it seems like, then I’ve got to believe they’re stalling for time.
honestly, chatgpt is now a better coder than i ever was or will be
It nailed the execution. Amazing.
To determine which number is larger among 3.9, 3.10, and 3.11, let's compare << them numerically:
Standard Decimal Interpretation:
3.9 is simply 3.9. 3.10 is equivalent to 3.1 because trailing zeros after the decimal point do not add value in decimal numbers. 3.11 remains 3.11. Comparison:
3.9 vs. 3.1 vs. 3.11 3.11 is the largest number. Version Number Interpretation:
In version numbering, each segment separated by a dot is compared as an integer. 3.9 (major version 3, minor version 9) 3.10 (major version 3, minor version 10) 3.11 (major version 3, minor version 11) Comparison:
Minor versions: 11 > 10 > 9 3.11 is the largest version. Answer: 3.11 is the largest number among 3.9, 3.10, and 3.11.
...
So IMO level right?
I asked it to create a user story and a set of tasks to implement some feature. It then created a set of stories where one was to create a story and set of tasks for the very feature I was asking it to plan.
And while reading the article, it mentioned how NOT to provide irrelevant information to the task at hand via RAG. It appears that the trajectory of these thoughts are extremely sensitive to the initial conditions (prompt + context). One would imagine that if it had the ability to backtrack after reflecting, it would help with divergence, however, it appears that wasn't the case here.
Maybe there is another factor here. Maybe there is some confusion when asking it to plan something and the "hidden reasoning" tokens themselves involve planning/reasoning semantics? Maybe some sort of interaction occurred that caused it to fumble? who knows. Interesting stuff though.
One of these days those contraptions will work well enough, not because they're perfect, but because human intelligence isn't really that good either.
(And looking in this mirror isn't flattering us any.)
> I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet)
I saw similar comments elsewhere and I'm stunned - am I the only one who considers 4o a step back when compared to 4 for textual input and output? It basically gives fast semi-useful answers that seem like a slightly improved 3.5.
What OpenAI have delivered here is basically a hack - a neuro-symbolic agent that has a bunch of hard-coded "reasoning" biases built in (via RL). It's a band-aid approach to try to provide some of what's missing from the underlying model which was never designed for what it's now being asked to do.
>the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini!
So the text says reasoning and output tokens are the same, as in you pay for both. But does the increase say that it can actually do more, or does it just mean it is able to output more text?
Because by now I am just bored of GPT4o output, because I don't have the time to read through a multi-paragraph text that explains to me stuff that I already know, when I only want to have a short, technical answer. But maybe that's just what it can't do, give exact answers. I am still not convinced by AI.
Sure, that is a contrived question, but I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count. All in a few seconds.
Current models (any LLM) cannot do that and I do not see a path for them to ever do that at a reasonable cost.
https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...
https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
I'm sick of these clowns couching everything in "look how amazing and powerful and dangerous out AI is"
This is in their excuse for why they hid a bunch of model output they still charge you for.
The new OpenAI model shows a big improvement on some benchmarks over GPT4 one-shot chain-of-thought, but what about vs systems doing something more similar to what presumably this is?
What's a zero shot reasoner? I googled it and all the results are this paper itself. There is a wikipedia article on zero shot learning but I cannot recontextualise it to LLMs.
Perhaps the system prompt is part of the magic?
I challenged o1 to solve the puzzle in my profile info.
It failed spectacularly.
Now see you on the other side ;)
OpenAI and other AI vendors should recognize the widespread suspicion that safety policies are being used to push political agendas. Concrete remedies are called for—for example, clearly defining what “safety” means and specifying prohibited content to reduce suspicions of hidden agendas.
Openly engaging with the public to address concerns about bias and manipulation is a crucial step. If biases are due to innocent reasons like technical limitations, they should be explained. However, if there’s evidence of political bias within teams testing AI systems, it should be acknowledged, and corrective actions should be taken publicly to restore trust.
I will just lean back and wait for the scandal to blow up when some whistleblower reveals that the hidden output tokens about the thought process are billed much higher than they should be - this hidden cost system is just such a tempting way to get far more money for the needed energy/gpu costs, so that they can keep buying more GPUs to train more models faster, I don't see how people as reckless and corrupt as Sam Altman could possibly resist this temptation.
It does not reason. It has some add-on logic the simulates it.
We’re no closer to “AI” today than we were 20 years ago.
The question I always ask GPT is about chess: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."
In isolani structure white has d4 pawn, while he doesn't have either c or e pawns. Black has either e6 or c6 pawn, but he doesn't have d-pawn.
Correct answer would be, that in structure with e6 black king is a bit safer, since f7-a2 diagonal is covered by e6 pawn, which is also controlling f5 point.
o1 answered: "With Black's e6 pawn, the d5 pawn is supported from behind, but Black's light-squared bishop is restricted by the pawn chain, and the open c-file offers White opportunities for pressure. With Black's c6 pawn, the d5 pawn is supported from the side, allowing Black's light-squared bishop more freedom to develop, but the pawn on c6 can become a target, and the absence of an e6 pawn leaves e6 more vulnerable."
while d5 pawn is simply not present in this structure.
So I gave it a little hint: "which diagonal is weakened by absence of c6 pawn and which by absence of e6 pawn? Which diagonal is more meaningful and why?"
And got a smart sounding answer: "In the Isolani structure:
Absence of Black's c6 pawn (Black has an e6 pawn):
Weakened Diagonal: The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open, allowing White's light-squared bishop or queen to target critical squares like d5 and f7, and to potentially infiltrate via c4. Absence of Black's e6 pawn (Black has a c6 pawn):
Weakened Diagonal: The f1–a6 diagonal (the long dark-squared diagonal) is weakened. Reason: The missing e6 pawn leaves this diagonal open, enabling White's dark-squared bishop to exert pressure on b7, d5, and potentially attack Black's queenside. Which Diagonal is More Meaningful and Why?
The f1–a6 diagonal (weakened by the absence of Black's e6 pawn) is more significant. Here's why:
Control Over Critical Squares: This diagonal allows White to target Black's central and queenside weaknesses more effectively, especially the b7 pawn and the d5 pawn, which are crucial in the Isolani structure.
Bishop Activity: White's dark-squared bishop becomes a powerful piece, influencing the game by controlling key squares and supporting potential pawn advances like d4–d5.
Pressure on Black's Position: The open f1–a6 diagonal increases tactical opportunities for White, such as pinning pieces or creating threats that are difficult for Black to parry without the defensive support of the e6 pawn.
In contrast, while the weakened c8–h3 diagonal (due to the absence of c6) does offer White some opportunities, it is generally less impactful. The light-squared bishop's influence along this diagonal doesn't pressure Black's critical central pawns as directly, and Black can often more easily defend against threats on light squares.
Summary: The absence of Black's e6 pawn weakens the f1–a6 diagonal, which is more meaningful because it enhances White's attacking prospects and puts greater pressure on Black's central and queenside defenses."
The problem is though that not even the first few lines of this answer ("The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open") make sense, since c6 is not on c8-h3 diagonal.
These models are going to get embedded deeply into IDE's, like cursor has, and essentially end software development as we know it. A properly written requirements spec, and an engineer, can do the work of 5. Software engineering as done by hand is going to disappear. Saas startups whose moat is a harvard ceo and 5 million in capital will watch their margins disappear. This will be the great equalizer for creative intelligent individuals, true leverage to build what you want
While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.