BWStearns
We have a script and a input library that has a bunch of scoring dimensions and allows a head to head comparison of a new candidate prompt vs what's in prod. It takes a configuration (prompt, which LLM to use, temperature etc.) and then gets run with all the various inputs and makes a json blob of the outputs for scoring.

Most of the score dimensions are deterministic but we've added some where we integrated an LLM to do the scoring (... which brings the new problem of scoring the scoring prompt!). We also do a manual scan of the outputs to sanity check. Not doing any fine tuning yet as we're getting pretty good results with just prompting.

ivanpashenko
"7 likes / no comments" --> should I read it as: people interested in others people experience, but have nothing to share about their own? - No prompt on production? - No testing or other routines about it yet?

Please share your current status :)