> Our key innovation is a simple but effective data collection strategy that avoids these problems: we ask annotators to describe images in speech
I see this as another example that datasets trump architecture nowadays.
The architecture is not where the innovation is: it is only CLIP embeddings converted to the LLM tokens through MLP with some pooling to reduce the token count.
That being said, I think this definitely tilts things in Molmo's favor by including so many benchmarks that seem to favor Molmo, in particular the counting ones. The average hides that it has a pretty modest MMLU score compared to state of the art.