Molmo: a family of open multimodal AI models

imjonse

Apart from results on benchmarks, what sets Allenai models apart - Olmo/OlMoE/Molmo - is they are fully open, not just open-weights/free to use. The datasets used, a crucial ingredient, are also disclosed and open. UPDATE: they say the datasets will be made available, but they aren't yet.

espadrine

The paper: https://molmo.allenai.org/paper.pdf

> Our key innovation is a simple but effective data collection strategy that avoids these problems: we ask annotators to describe images in speech

I see this as another example that datasets trump architecture nowadays.

The architecture is not where the innovation is: it is only CLIP embeddings converted to the LLM tokens through MLP with some pooling to reduce the token count.

causal

That graphic comparing benchmark averages is really nice, wish things were presented so clearly more often.

That being said, I think this definitely tilts things in Molmo's favor by including so many benchmarks that seem to favor Molmo, in particular the counting ones. The average hides that it has a pretty modest MMLU score compared to state of the art.

danielcampos93

Not mentioned in their blog posts but on the model cards on huggingface: "Molmo 72B is based on Qwen2-72B and uses OpenAI CLIP as vision backbone. Molmo-72B achieves the highest academic benchmark score and ranks second on human evaluation, just slightly behind GPT-4o." Others are based on Qwen 7B. What happened to the Olmo chain?

naiv

image was flagged as inappropriate by the google vision api ?