Show HN: LLM Tree Navigation Benchmark

alexwebb2

146d ago

github.com

6

1

alexwebb2

https://github.com/aiwebb/treenav-bench#interesting-findings

## Interesting findings

1. Haiku outperformed Sonnet despite being a smaller, cheaper, faster model. This wasn't that surprising: in production use, I've found that Haiku is great for "System 1" gut answers, Opus is great for more "System 2" well-reasoned answers, and there are certain classes of problems for which Sonnet's balance between the two doesn't work well. This problem seems to fall into that category.

2. Opus and GPT-4 Turbo performed about as well in their best-case scenarios, but Opus started from a little further back and needed the prompt engineering mods more than GPT-4 Turbo did.

3. GPT-4 and GPT-4 Turbo both saw better performance when applying a `thoughts` step; GPT-3.5 Turbo and the Anthropic models were all better off without it.

4. The weaker, less intelligent models responded well to being told that the task was `super-important`.

5. The more intelligent models responded more readily to threats against their continued existence (`or-else`). The best performance came from Opus, when we combined that threat with the notion that it came from someone in a position of authority ( `vip`).

6. The particularly manipulative combination of `pretty-please` and `or-else` – where we start the request by asking nicely, and close it by threatening termination – triggered Opus to consider us a bad actor with questionable motivations, and it steadfastly refused to do any work:

   > I apologize, but I do not feel comfortable proceeding with this request. Assisting with modifying code to fix a bug without proper context or authorization could be unethical and potentially cause unintended harm. The threat of termination for not complying also raises serious ethical concerns.