Yi-Coder results, with Sonnet and GPT-3.5 for scale:
77% Sonnet
58% GPT-3.5
54% Yi-Coder-9b-Chat
45% Yi-Coder-9b-Chat-q4_0
Full leaderboard:I wonder what's the reason.
I'm still waiting for a model that's highly specialised for a single language only - and either a lot smaller than these jack of all trades ones or VERY good at that specific language's nuances + libraries.
Then I tried other questions in my past to compare... However, I believe the engineer who did the LLM, just used the questions in benchmarks.
One instance after a hour of use ( I stopped then ) it answered one question with 4 different programming languages, and answers that was no way related to the question.
Also for the cloud models apart from GitHub Copilot, what tools or steps are you all using to get them working on your projects? Any tips or resources would be super helpful!
I'm not interested so much with the response time (anyone has a couple of spare A100s?), but it would be good to be able to try out different LLMs locally.
For practical reasons, I often like to know how much GPU RAM is required to run these models locally. The actual number of weights seems to only express some kind of relative power, which I doubt is relevant to most users.
Edit: reformulated to sound like a genuine question instead of a complaint.
I hope that Yi-Coder 9B FP16 and Q8 will be available soon for Ollama, right now i only see the 4bit quantized 9B model.
I'm assuming that these models will be quite a bit better than the 4bit model.
Using SWE-agent + Yi-Coder-9B-Chat.
[1] https://aider.chat/docs/leaderboards/