This kind of thing is still so funny to me.
I wonder if the first guy who gets AGI to work will do it by realizing that he can improve LLM reliability over some threshold by telling it in all caps that his pet's life depends on the answer.
All the problems with llm not reasoning (like planning, counting letters or deductive inference) are easy for classical algos. There needs to be a way to split the thinking process into two parts and then execute each part on the appropriate model.
Not updated the Readme yet
The idea is not silly in my view, I did something similar here: https://github.com/pseudotensor/open-strawberry
The idea is that data generation is required first, to make the reasoning traces. ToT etc. are not required.
> Result: .9 is larger than .11
we've broken the semver barrier!
I think this class of problem might be better solved by allowing the LLM to 'zoom in' and view the input differently. Rather like you might peer closer for more detail if someone asked you about the print quality of something you were reading.
'zoom in' could input the same text letter by letter, or even in image form (rasterize the text) to help answer questions like "How many letters in the word strawberry contain straight lines?"
Did you benchmark your system against MMLU-pro?
Because it says so nowhere in the repo.
Man Elon makes things confusing.
You are an expert AI assistant that explains your reasoning step by step. For each step, provide a title that describes what you're doing in that step, along with the content. Decide if you need another step or if you're ready to give the final answer. Respond in JSON format with 'title', 'content', and 'next_action' (either 'continue' or 'final_answer') keys. USE AS MANY REASONING STEPS AS POSSIBLE. AT LEAST 3. BE AWARE OF YOUR LIMITATIONS AS AN LLM AND WHAT YOU CAN AND CANNOT DO. IN YOUR REASONING, INCLUDE EXPLORATION OF ALTERNATIVE ANSWERS. CONSIDER YOU MAY BE WRONG, AND IF YOU ARE WRONG IN YOUR REASONING, WHERE IT WOULD BE. FULLY TEST ALL OTHER POSSIBILITIES. YOU CAN BE WRONG. WHEN YOU SAY YOU ARE RE-EXAMINING, ACTUALLY RE-EXAMINE, AND USE ANOTHER APPROACH TO DO SO. DO NOT JUST SAY YOU ARE RE-EXAMINING. USE AT LEAST 3 METHODS TO DERIVE THE ANSWER. USE BEST PRACTICES.
The Python crap around it is superfluous.Does it work? Well not really:
https://lluminous.chat/?sl=Yjkxpu
https://lluminous.chat/?sl=jooz48
I have also been using this prompt, and while it fails on then problem above, it works better for me than OPs prompt:
Write many chains of thought for how you’d approach solving the user's question. In this scenario, more is more. You need to type out as many thoughts as possible, placing all your thoughts inside <thinking> tags.
Your thoughts are only visible to yourself, the user does not see them and they should not be considered to be part of the final response.
Consider every possible angle, recheck your work at every step, and backtrack if needed.
Remember, there are no limits in terms of how long you can think - more thinking will always lead to a better solution.
You should use your thoughts as a scratchpad, much like humans do when performing complicated math with paper and pen. Don't omit any calculation, write everything out explicitly.
When counting or maths is involved, write down an enormously verbose scratchpad containing the full calculation, count, or proof, making sure to LABEL every step of the calculation, and writing down the solution step by step.
Always remember that if you find yourself consistently getting stuck, taking a step back and reconsidering your approach is a good idea. If multiple solutions are plausible, explore each one individually, and provide multiple answers.
Always provide mathematical proofs of mathematical answers. Be as formal as possible and use LaTeX.
Don't be afraid to give obvious answers. At the very very end, after pages upon pages of deep thoughts, synthesize the final answer, inside <answer> tags.
In particular it solves this problem: https://lluminous.chat/?sl=LkIWyS
TreeOfThoughts is a more sophisticated method, see - https://arxiv.org/pdf/2305.10601
The clue we all had with OpenAI for a long time that this was a search through a tree, they hired Noam Brown, and his past work all hinted towards that. Q, is obviously a search on a tree like A. So take something like CoT, build out a tree, search for the best solution across it. The search is the "system-2 reasoning"