g1: Using Llama-3.1 70B on Groq to create o1-like reasoning chains

segmondy

This is not even remotely close and very silly. A ChainOfThought in a loop.

TreeOfThoughts is a more sophisticated method, see - https://arxiv.org/pdf/2305.10601

The clue we all had with OpenAI for a long time that this was a search through a tree, they hired Noam Brown, and his past work all hinted towards that. Q, is obviously a search on a tree like A. So take something like CoT, build out a tree, search for the best solution across it. The search is the "system-2 reasoning"

sebzim4500

>In all-caps to improve prompt compliance by emphesizing the importance of the instruction

This kind of thing is still so funny to me.

I wonder if the first guy who gets AGI to work will do it by realizing that he can improve LLM reliability over some threshold by telling it in all caps that his pet's life depends on the answer.

thorum

o1’s innovation is not Chain-of-Thought. It’s teaching the model to do CoT well (from massive amounts of human feedback) instead of just pretending to. You’ll never get o1 performance just from prompt engineering.

GaggiX

This seems the usual CoT that has been used for a while, o1 was trained with reinforcement learning with some unknown policy, so it's much better at utilizing the chain of thought.

codelion

This is good I also had worked on something similar in optillm - https://github.com/codelion/optillm. You can do this with any LLM and several optimization techniques (including cot_reflection) like mcts, plansearch, moa etc.

zby

I am always looking for definitions of "reasoning". My theory is that if we find a good definition - then it will turn out that we can build systems that would combine fuzzy llm thinking with classical algorithms to solve "reasoning".

All the problems with llm not reasoning (like planning, counting letters or deductive inference) are easy for classical algos. There needs to be a way to split the thinking process into two parts and then execute each part on the appropriate model.

punnerud

I changed it into running 100% locally with ollama:8b: https://github.com/punnerud/g1

Not updated the Readme yet

pseudotensor

The idea is not silly in my view, I did something similar here: https://github.com/pseudotensor/open-strawberry

The idea is that data generation is required first, to make the reasoning traces. ToT etc. are not required.

dangoodmanUT

> Prompt: Which is larger, .9 or .11?

> Result: .9 is larger than .11

we've broken the semver barrier!

ed

FYI this is just a system prompt and not a fine-tuned model

londons_explore

> This alone, without any training, is sufficient to achieve ~70% accuracy on the Strawberry problem (n=10, "How many Rs are in strawberry?"). Without prompting, Llama-3.1-70b had 0% accuracy and ChatGPT-4o had 30% accuracy.

I think this class of problem might be better solved by allowing the LLM to 'zoom in' and view the input differently. Rather like you might peer closer for more detail if someone asked you about the print quality of something you were reading.

'zoom in' could input the same text letter by letter, or even in image form (rasterize the text) to help answer questions like "How many letters in the word strawberry contain straight lines?"

esoltys

For fun I forked the project to run Llama-3.1 7B or other models using Ollama locally. It doesn't get strawberry right, but it can figure out 0.9 is bigger.

https://github.com/esoltys/o1lama

bofadeez

Not going to work - https://arxiv.org/abs/2310.01798

bofadeez

You can reproduce both of those responses zero shot on 70B with "Let's verify step by step" appended at the end.

a-dub

so is this o1 thing just cot (like has been around for a few years) but baked into the training transcripts, rlhf and inference pipeline?

asah

benchmark results ?

zozbot234

How does this benchmark against Reflection, which was fine-tuned to do the same thing-- provide a detailed Chain of Thought with self-corrections, then write out a final answer?

arnaudsm

The latency of Groq is impressive, much better than o1!

Did you benchmark your system against MMLU-pro?

lobochrome

So it’s the asic groq guys right?

Because it says so nowhere in the repo.

Man Elon makes things confusing.

michelsedgh

i love seeing stuff like this, im guessing it wont be long until this method becomes the norm

4ad

This is the system prompt it uses:

    You are an expert AI assistant that explains your reasoning step by step. For each step, provide a title that describes what you're doing in that step, along with the content. Decide if you need another step or if you're ready to give the final answer. Respond in JSON format with 'title', 'content', and 'next_action' (either 'continue' or 'final_answer') keys. USE AS MANY REASONING STEPS AS POSSIBLE. AT LEAST 3. BE AWARE OF YOUR LIMITATIONS AS AN LLM AND WHAT YOU CAN AND CANNOT DO. IN YOUR REASONING, INCLUDE EXPLORATION OF ALTERNATIVE ANSWERS. CONSIDER YOU MAY BE WRONG, AND IF YOU ARE WRONG IN YOUR REASONING, WHERE IT WOULD BE. FULLY TEST ALL OTHER POSSIBILITIES. YOU CAN BE WRONG. WHEN YOU SAY YOU ARE RE-EXAMINING, ACTUALLY RE-EXAMINE, AND USE ANOTHER APPROACH TO DO SO. DO NOT JUST SAY YOU ARE RE-EXAMINING. USE AT LEAST 3 METHODS TO DERIVE THE ANSWER. USE BEST PRACTICES.

The Python crap around it is superfluous.

Does it work? Well not really:

https://lluminous.chat/?sl=Yjkxpu

https://lluminous.chat/?sl=jooz48

I have also been using this prompt, and while it fails on then problem above, it works better for me than OPs prompt:

    Write many chains of thought for how you’d approach solving the user's question. In this scenario, more is more. You need to type out as many thoughts as possible, placing all your thoughts inside <thinking> tags. 
    Your thoughts are only visible to yourself, the user does not see them and they should not be considered to be part of the final response.
    Consider every possible angle, recheck your work at every step, and backtrack if needed.
    Remember, there are no limits in terms of how long you can think - more thinking will always lead to a better solution.
    You should use your thoughts as a scratchpad, much like humans do when performing complicated math with paper and pen. Don't omit any calculation, write everything out explicitly.
    When counting or maths is involved, write down an enormously verbose scratchpad containing the full calculation, count, or proof, making sure to LABEL every step of the calculation, and writing down the solution step by step.
    Always remember that if you find yourself consistently getting stuck, taking a step back and reconsidering your approach is a good idea. If multiple solutions are plausible, explore each one individually, and provide multiple answers.
    Always provide mathematical proofs of mathematical answers. Be as formal as possible and use LaTeX.
    Don't be afraid to give obvious answers. At the very very end, after pages upon pages of deep thoughts, synthesize the final answer, inside <answer> tags.

In particular it solves this problem: https://lluminous.chat/?sl=LkIWyS

tonetegeatinst

Groq 2 isn't as open as groq 1 iirc. Still hoping we get at least open weights.

Haskell4life

[dead]

aktuel

Let's just assume for a moment that the hype is real and that these LLMs are incredibly intelligent and will replace us all soon. Then the model shouldn't be any less intelligent if we remove facts like Uma Thurman's measurements and other vapid information. If the model already has the capability to use tools than all of that crap is redundant anyway. And while we are at it let's remove a ton of other junk like languages I will never use and which also doesn't make the model any smarter. So how small can this kernel get while still being clearly intelligent, able to communicate flawlessly in english and apply logical reasoning. That would be a worthwile endeavor and maybe even possible without boiling the oceans.