I used to use embeddings from the text-encoder of CLIP model, to augment the prompt to better match corresponding images. For example given a word "building" in prompt, i would find the nearest neighbor in the embedding matrix like "concrete", "underground" etc. and substitute/append those after the corresponding word. This lead to a higher recall for most of the queries in my limited experiments!
Looking forward to doing some benchmarking over the next couple weeks
But it seems quite dated technically - which I understand is a tradeoff for performance - but can you provide a way to toggle between different types of similarity (e.g. semantic, NLI, noun-abstract)?
E.g. I sometimes want "Freezing" and "Burning" to be very similar (1) as in regards to say grouping/clustering articles in a newspaper into categories like "Extreme environmental events", like on MTEB/Sentence-Similarity, as classic Word2Vec/GloVe would do. But if this was a chemistry article, I want them to be opposite, like ChatGPT embeddings would be. And sometime I want to use NLI embeddings to work our the causal link between two things. Because the latter two embedding types are more recent (2019+), they are where the technical opportunity is, not the older MTEB/semantic similarity ones which have been performant enough for many use cases since 2014 and 2019 received a big boost with mini-lm-v2 etc.
For the above 3 embedding types I can use SBERT but the dimensions are large, models quite large, and having to load multiple models for different similarity types is straining on resources, it often takes about 6GB because generative embedding models (or E5 etc) are large, as are NLI models.