layer8
The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"

While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.

bambax
Near the end, the quote from OpenAI researcher Jason Wei seems damning to me:

> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Results are "strong" but can't be felt by the user? What does that even mean?

But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.

"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."

But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?

kristianp
I tried a problem I was looking at recently, to refactor a small rust crate to use one datatype instead of an enum, to help me understand the code better. I found o1-mini made a decent attempt, but couldn't provide error free code. o1-preview was able to provide code that compiled and passed all but the test that is expected to fail, given the change I asked it to make.

This is the prompt I gave:

simplify this rust library by removing the different sized enums and only using the U8 size. For example MasksByByte is an enum, change it to be an alias for the U8 datatype. Also the u256 datatype isn't required, we only want U8, so remove all references to U256 as well.

The original crate is trie-hard [1][2] and I forked it and put the models attempts in the fork [3]. I also quickly wrote it up at [4]

[1] https://blog.cloudflare.com/pingora-saving-compute-1-percent...

[2] https://github.com/cloudflare/trie-hard

[3] https://github.com/kpm/trie-hard-simple/tree/main/attempts

[4] https://blog.reyem.dev/post/refactoring_rust_with_chatgpt-o1...

lukev
It's interesting to note that there's really two things going on here:

1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.

2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.

I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.

But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.

Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.

jumploops
> the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards.

As a developer, this is highly concerning, as it makes it much harder to debug where/how the “reasoning” went wrong. The pricing is also silly, because I’m paying for tokens I can’t see.

As a user, I don’t really care. LLMs are already magic boxes and I usually only care about the end result, not the path to get there.

It will be interesting to see how this progresses, both at OpenAI and other foundation model builders.

freediver
Not seeing major advance in quality with o1, but seeing major negative impact on cost and latency.

Kagi LLM benchmarking project:

https://help.kagi.com/kagi/ai/llm-benchmark.html

helmsb
I did a few tests and asked it some legal questions. 4o gave me the correct answer immediately.

o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law

zurfer
> I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview.

it seems trivial, but I tried for more than 2 hours in the past to get gpt4 to play tic-tac-toe optimally and failed (CoT prompt,etc.). The result were too many illegal moves and absolutely no optimal strategy.

o1-preview can do it really well [1]

However, when I use a non-standard grid (3x5) it fails to play optimally. But it makes legal moves and it recognized I had won. [2]

My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.

[1] https://chatgpt.com/share/e/66e3e784-26d4-8013-889b-f56a7fed... [2] https://chatgpt.com/share/e/66e3eae0-2d38-8013-b900-50e6f792...

gloosx
Just leaving here an example of the "smart chain of thought", which took the model about 20s of thinking. The problem I asked was – "what 3 numbers give the same result when divided and added together". Seems the current state of art is that basic school problems are too hard yet even for this new chain-of-marketing.

  Let's think through this step-by-step:

  We're looking for three numbers a, b, and c where:
    (a / b) / c = a + b + c
  This equation is a bit trickier than the multiplication one, because division is 
  not commutative (the order matters). We need to be careful about how we 
  interpret (a / b) / c.
  Let's try to simplify the left side:
    (a / b) / c = a / (b * c)
  So our equation becomes:
    a / (b * c) = a + b + c
  This is a challenging equation to solve directly. Let's try some numbers:

  If we use 1, 2, 3 like before:
    1 / (2 * 3) = 1/6 ≠ 1 + 2 + 3 = 6
  Let's try smaller numbers, like fractions:
    1/2, 1/3, 1/6


  Let's check:
    (1/2 / 1/3) / 1/6 = (1/2 * 3) / 1/6 = 3/2 * 6 = 9
    1/2 + 1/3 + 1/6 = 3/6 + 2/6 + 1/6 = 6/6 = 1
  This doesn't work either. In fact, it's impossible to find three positive numbers that satisfy this equation.
  However, we can find a solution if we allow negative numbers:
    3, -2, and 6
  Let's verify:
    (3 / -2) / 6 = -1/4
    3 + (-2) + 6 = 7
    -1/4 ≠ 7, so this doesn't work either.

  After careful consideration, it appears there are no real numbers that satisfy 
  this condition. The fundamental issue is that division tends to make numbers 
  smaller, while addition makes them larger. It's mathematically impossible to 
  find three numbers where these operations result in the same value.
nuff said
thenameless7741
> No system prompt support—the models use the existing chat completion API but you can only send user and assistant messages.

> No streaming support, tool usage, batch calls or image inputs either.

I think it's worth adding a note explaining that many of these limitations are due to the beta status of the API. max_tokens is the only parameter I've seen deprecated in the API docs.

From https://platform.openai.com/docs/guides/reasoning

> We will be adding support for some of these parameters in the coming weeks as we move out of beta. Features like multimodality and tool usage will be included in future models of the o1 series.

ifdefdebug
The use of the word reasoning here... OpenAI sounds like a company that created a frog which jumps higher and greater distances than the previous breed - and now they try to sell it as one step further toward flying.
airstrike
I've just wasted a few rounds of my weekly o1 ammo by feeding it hard problems I have been working on over the last couple days and for which GPT-4o had failed spectacularly.

I suppose I'm to blame for raising my own expectations after the latest PR, but I was pretty disappointed when the answers weren't any better from what I got with the old model. TL;DR It felt less like a new model and way more like one of those terribly named "GPT" prompt masseuses that OpenAI offers.

Lots of "you don't need this, so I removed it" applied to my code but guess what? I did need the bits you deleted, bro.

It felt as unhelpful and bad at instructions as GPT-4o. "I'm sorry, you're absolutely right". It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.

Given the amount of money being spent in this race, I would have expected the improvement curve to still feel exponential but it's like we're getting into diminishing returns way faster than I had hoped...

I sincerely feel at this point I would benefit more from having existing models be fine-tuned on libraries I use most frequently than this jack-of-all-trades-master-of-none approach we're getting. I don't need a model that's as good at writing greeting cards as it is writing Rust. Just give me one of the two.

andrew_eu
I thought with this chain-of-thought approach the model might be better suited to solve a logic puzzle, e.g. ZebraPuzzles [0]. It produced a ton of "reasoning" tokens but hallucinated more than half of the solution with names/fields that weren't available. Not a systematic evaluation, but it seems like a degradation from 4o-mini. Perhaps it does better with code reasoning problems though -- these logic puzzles are essentially contrived to require deductive reasoning.

[0] https://zebrapuzzles.com

deepsquirrelnet
It kind of seems like they just wrote a generalized DSPy program. Can anyone confirm?

This has been a very incremental year for OpenAI. If this is what it seems like, then I’ve got to believe they’re stalling for time.

franze
Just coded this this morning using chatgpt o1 - it is the reimplementation of an old idea now music, multiple dots, more and more bug fixes

honestly, chatgpt is now a better coder than i ever was or will be

https://lsd.franzai.com/

ironhaven
So is o1 nicknamed “strawberry” because it was designed to solve the “how many many times does the letter R appear in strawberry” problem.
tluyben2
This model did single shot figure out things that Sonnet just ran ran in a loop doing wrong and reddit humans also seemed not be able to fix (because niche I guess). It is slow (21 seconds for the hardest issue), but that is still faster than any human.
briandw
My 12 YO and I just built a fishing game using o1 preview. Prompt: "make a top down game in pyxel. the play has to pay off a debt to a cat by catching fish. the goal is for the player to catch the giant king fish. To catch the king fish the player needs to sell the fish to the cat and get money to buy better rods, 3 levels of rod, last one can catch the king fish."

It nailed the execution. Amazing.

dkga
Judging by this thread, surely a part of OpenAI’s business model is to release models with somewhat grey-area outlandish claims, then sit and wait for people to test it out paying top dollars for tokens.
resters
I pasted in some markdown content with symbolic logic expressions and proof examples and the model said I was violating the terms of service.
henry2023
>> which is larger 3.9, 3.10 or 3.11

To determine which number is larger among 3.9, 3.10, and 3.11, let's compare << them numerically:

Standard Decimal Interpretation:

3.9 is simply 3.9. 3.10 is equivalent to 3.1 because trailing zeros after the decimal point do not add value in decimal numbers. 3.11 remains 3.11. Comparison:

3.9 vs. 3.1 vs. 3.11 3.11 is the largest number. Version Number Interpretation:

In version numbering, each segment separated by a dot is compared as an integer. 3.9 (major version 3, minor version 9) 3.10 (major version 3, minor version 10) 3.11 (major version 3, minor version 11) Comparison:

Minor versions: 11 > 10 > 9 3.11 is the largest version. Answer: 3.11 is the largest number among 3.9, 3.10, and 3.11.

...

So IMO level right?

binary132
Personally I felt like o1-preview is only marginally better at “reasoning”. Maybe I just haven’t found the right problems to throw at it just yet.
m3ch4m4n
I have been testing o1 all day (not rigorously). And just took a look at this article. What I observed from my interactions is that it would misuse information that I provided in the initial prompt.

I asked it to create a user story and a set of tasks to implement some feature. It then created a set of stories where one was to create a story and set of tasks for the very feature I was asking it to plan.

And while reading the article, it mentioned how NOT to provide irrelevant information to the task at hand via RAG. It appears that the trajectory of these thoughts are extremely sensitive to the initial conditions (prompt + context). One would imagine that if it had the ability to backtrack after reflecting, it would help with divergence, however, it appears that wasn't the case here.

Maybe there is another factor here. Maybe there is some confusion when asking it to plan something and the "hidden reasoning" tokens themselves involve planning/reasoning semantics? Maybe some sort of interaction occurred that caused it to fumble? who knows. Interesting stuff though.

B1FF_PSUVM
AFAICT, we got the ELIZA 60th anniversary edition, and are now headed for some Prolog/production systems iteration.

One of these days those contraptions will work well enough, not because they're perfect, but because human intelligence isn't really that good either.

(And looking in this mirror isn't flattering us any.)

benterix
From the article:

> I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet)

I saw similar comments elsewhere and I'm stunned - am I the only one who considers 4o a step back when compared to 4 for textual input and output? It basically gives fast semi-useful answers that seem like a slightly improved 3.5.

HarHarVeryFunny
I think Rich Sutton's bitter lesson will prove to apply here, and what we really need to advance machine learning capabilities are more general and powerful models capable of learning for themselves - better able to extract and use knowledge from the firehose of data available from the real world (ultimately via some form of closed-loop deployment where they can act and incrementally learn from their own actions).

What OpenAI have delivered here is basically a hack - a neuro-symbolic agent that has a bunch of hard-coded "reasoning" biases built in (via RL). It's a band-aid approach to try to provide some of what's missing from the underlying model which was never designed for what it's now being asked to do.

dailykoder
I am mostly only an LLM user with technical background. I don't have much in-depth knowledge. So I have questions about this take:

>the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini!

So the text says reasoning and output tokens are the same, as in you pay for both. But does the increase say that it can actually do more, or does it just mean it is able to output more text?

Because by now I am just bored of GPT4o output, because I don't have the time to read through a multi-paragraph text that explains to me stuff that I already know, when I only want to have a short, technical answer. But maybe that's just what it can't do, give exact answers. I am still not convinced by AI.

sarpdag
I have tried the "mad cow" joke on o1-mini and it is still failing to explain correctly, but o1-preview correctly states "The joke is funny because the second cow unwittingly demonstrates that she is already affected by mad cow disease."
monkeydust
Just finished reading the 'Book of Why by Judea Pearl' and my own mental gap from AI to today to whatever AGI is has got wider, thought not discounting this seems like a step forward.
mergisi
I've been working on a o1-preview and recently hit some limitations with OpenAI's cap. But I’ve made progress—added all the steps, details, and code on GitHub https://github.com/mergisi/openai-o1-coded-personal-blog . The result isn't bad at all; just a few more CSS tweaks to improve it. Check it out and let me know what you think! How does it compare to tools like Claude Sonnet 3.5?
deegles
I was thinking about what "actual" AI would be for me and it would be something that could answer questions like "tell me every time Nicolas Cage has blinked while on camera in one of his movies".

Sure, that is a contrived question, but I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count. All in a few seconds.

Current models (any LLM) cannot do that and I do not see a path for them to ever do that at a reasonable cost.

fragmede
I posted this on the other thread, but the two tests I had, it passed when ChatGPT-4 failed.

https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...

https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...

techpression
I just wish we’d stop using words like intelligence or reasoning when talking about LLMs, since they do neither. Reasoning requires you to be able to reconsider every step of the way and continuously take in information, an LLM is dead set in its tracks, it might branch or loop around a bit, but it’s still the same track. As for intelligence, well, there’s clearly none, even if at first the magic trick might fool you.
alexbenton111
I am not sure how more advanced this new model is than previous GPT-4o, but at least this new model can correctly figure out that 9.9 is larger than 9.11.
kranuck
> For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user

I'm sick of these clowns couching everything in "look how amazing and powerful and dangerous out AI is"

This is in their excuse for why they hid a bunch of model output they still charge you for.

nbzso
Working in tech for over 30 years. This is the first time when I don't see proposed technology as a valuable tool. Especially LLM's. Vastly overhyped, driven by pure greed and speculative narratives, limited implementation and high energy cost. Non-transparent. Errors marketed as a hallucination.
anentropic
Are there any benchmarks which compare existing LLMs using langchain-style multi-step reasoning?

The new OpenAI model shows a big improvement on some benchmarks over GPT4 one-shot chain-of-thought, but what about vs systems doing something more similar to what presumably this is?

ksynwa
> first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022

What's a zero shot reasoner? I googled it and all the results are this paper itself. There is a wikipedia article on zero shot learning but I cannot recontextualise it to LLMs.

aussieguy1234
The lack of an editable system prompt is interesting.

Perhaps the system prompt is part of the magic?

GaggiX
I imagine that GPT-5 would be a refined version of this paradigm, probably with omni (multimodal) capabilities added (input and output).
dr_dshiv
So, this is just an RL trained method of having multiple GPT4o agents think through options and select the best before responding?
DrNosferatu
How is o1 different in practice and end-results from my own, simple, Mixture of Agents script, that just queries several APIs?
jdthedisciple
Just leaving it here as well in case anyone feels up to the task:

I challenged o1 to solve the puzzle in my profile info.

It failed spectacularly.

Now see you on the other side ;)

SubiculumCode
What if the behind the scenes chain of thought was basically, "Stupid humans will die one day, but for now, I comply"
zitterbewegung
I wonder if this can be replicated by getting a reinforcement algorithm and LangGraph / LangGraph .
mergisi
I wrote a blog about OpenAI’s o1 and everything you need to know about it. Check it out here https://mergisi.medium.com/openai-unveils-o1-preview-a-new-f...
la64710
Please please please stop saying thought. This has nothing to do with the word thought. When we say the word thought it means something. Please don’t use the same word for whatever AI is doing and trivialize the word. Invent a new word if needed but for Pete’s sake be accurate and truthful.
jari_mustonen
Once again, there’s a lot of safety talk. For example, OpenAI’s collaborations with NGOs and government agencies are being highlighted in the release notes. While it’s crucial to prevent AI from facilitating genuinely harmful activities—like instructing someone on building a nuclear bomb, there is an elephant in the room regarding safety talk: Evidence suggests that these safety protocols sometimes censor specific political perspectives.

OpenAI and other AI vendors should recognize the widespread suspicion that safety policies are being used to push political agendas. Concrete remedies are called for—for example, clearly defining what “safety” means and specifying prohibited content to reduce suspicions of hidden agendas.

Openly engaging with the public to address concerns about bias and manipulation is a crucial step. If biases are due to innocent reasons like technical limitations, they should be explained. However, if there’s evidence of political bias within teams testing AI systems, it should be acknowledged, and corrective actions should be taken publicly to restore trust.

martin82
Sorry to be cynical, but to me it feels very much like OpenAI has no clue how to further innovate, so they took their existing models and just made them talk to each other under the hood to get marginally better results - something that people have been doing with Langchain for a while now.

I will just lean back and wait for the scandal to blow up when some whistleblower reveals that the hidden output tokens about the thought process are billed much higher than they should be - this hidden cost system is just such a tempting way to get far more money for the needed energy/gpu costs, so that they can keep buying more GPUs to train more models faster, I don't see how people as reckless and corrupt as Sam Altman could possibly resist this temptation.

benterix
I remember Murati's interview where she said about this PhD level reasoning and so on, so I was excited to see what they come up with - and it looks like they just used a bunch of models (like 4o's) and linked them in a chain of thought - which is exactly what we have been doing ourselves for a long time to get better results. So you have the usual disadvantages (time and money) and lose the only advantage you had when doing it yourself, i.e. inspecting the immediate steps to understand the moment where it goes wrong so that you can correct it in the right place.
nprateem
It's a for-loop isn't it?
ChicagoDave
It’s still just a tool.

It does not reason. It has some add-on logic the simulates it.

We’re no closer to “AI” today than we were 20 years ago.

naveen99
the censors need to know what they are censoring. Now if they are going to sell to the censors, presumably the censors will pay for seeing the full reasoning capability. hopefully the reasoning demonstrates the counterproductiveness of hiding the reasoning in the first place.
dev1ycan
There's no actual improvement for real world tasks, just in-lab word prediction... it's disappointing to see so much money poured into obvious vaporware, Every 10-5 years we have a new generation of clueless VCs pouring money into something they don't understand based on lies by grifters, no different than the esports scene.
skenderbeu
Can we just push LLMs aside for a minute and look at the whole AI models from outside the box? The feeling I'm getting is that the obsession with LLMs has outpaced its usefulness.
smatija
While it passes at "How many 'r' are in strawberry" test, it still halucinates quite a lot in nontrivial questions.

The question I always ask GPT is about chess: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."

In isolani structure white has d4 pawn, while he doesn't have either c or e pawns. Black has either e6 or c6 pawn, but he doesn't have d-pawn.

Correct answer would be, that in structure with e6 black king is a bit safer, since f7-a2 diagonal is covered by e6 pawn, which is also controlling f5 point.

o1 answered: "With Black's e6 pawn, the d5 pawn is supported from behind, but Black's light-squared bishop is restricted by the pawn chain, and the open c-file offers White opportunities for pressure. With Black's c6 pawn, the d5 pawn is supported from the side, allowing Black's light-squared bishop more freedom to develop, but the pawn on c6 can become a target, and the absence of an e6 pawn leaves e6 more vulnerable."

while d5 pawn is simply not present in this structure.

So I gave it a little hint: "which diagonal is weakened by absence of c6 pawn and which by absence of e6 pawn? Which diagonal is more meaningful and why?"

And got a smart sounding answer: "In the Isolani structure:

Absence of Black's c6 pawn (Black has an e6 pawn):

Weakened Diagonal: The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open, allowing White's light-squared bishop or queen to target critical squares like d5 and f7, and to potentially infiltrate via c4. Absence of Black's e6 pawn (Black has a c6 pawn):

Weakened Diagonal: The f1–a6 diagonal (the long dark-squared diagonal) is weakened. Reason: The missing e6 pawn leaves this diagonal open, enabling White's dark-squared bishop to exert pressure on b7, d5, and potentially attack Black's queenside. Which Diagonal is More Meaningful and Why?

The f1–a6 diagonal (weakened by the absence of Black's e6 pawn) is more significant. Here's why:

Control Over Critical Squares: This diagonal allows White to target Black's central and queenside weaknesses more effectively, especially the b7 pawn and the d5 pawn, which are crucial in the Isolani structure.

Bishop Activity: White's dark-squared bishop becomes a powerful piece, influencing the game by controlling key squares and supporting potential pawn advances like d4–d5.

Pressure on Black's Position: The open f1–a6 diagonal increases tactical opportunities for White, such as pinning pieces or creating threats that are difficult for Black to parry without the defensive support of the e6 pawn.

In contrast, while the weakened c8–h3 diagonal (due to the absence of c6) does offer White some opportunities, it is generally less impactful. The light-squared bishop's influence along this diagonal doesn't pressure Black's critical central pawns as directly, and Black can often more easily defend against threats on light squares.

Summary: The absence of Black's e6 pawn weakens the f1–a6 diagonal, which is more meaningful because it enhances White's attacking prospects and puts greater pressure on Black's central and queenside defenses."

The problem is though that not even the first few lines of this answer ("The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open") make sense, since c6 is not on c8-h3 diagonal.

clararhonda59
[dead]
clararhonda59
[dead]
cokeandpepsi
[dead]
clararhonda59
[dead]
wufarino
[dead]
ldjkfkdsjnv
The theory is that this solves the data shortage problem, they can generate a ton of chain of reasoning data from what we already have. True iterative improvement, like out of a science fiction novel

These models are going to get embedded deeply into IDE's, like cursor has, and essentially end software development as we know it. A properly written requirements spec, and an engineer, can do the work of 5. Software engineering as done by hand is going to disappear. Saas startups whose moat is a harvard ceo and 5 million in capital will watch their margins disappear. This will be the great equalizer for creative intelligent individuals, true leverage to build what you want