Show HN: Wordllama – Things you can do with the token embeddings of an LLM

authorfly

Nice. I like the tiny size a lot, that's already an advantage over SBERTs smallest models.

But it seems quite dated technically - which I understand is a tradeoff for performance - but can you provide a way to toggle between different types of similarity (e.g. semantic, NLI, noun-abstract)?

E.g. I sometimes want "Freezing" and "Burning" to be very similar (1) as in regards to say grouping/clustering articles in a newspaper into categories like "Extreme environmental events", like on MTEB/Sentence-Similarity, as classic Word2Vec/GloVe would do. But if this was a chemistry article, I want them to be opposite, like ChatGPT embeddings would be. And sometime I want to use NLI embeddings to work our the causal link between two things. Because the latter two embedding types are more recent (2019+), they are where the technical opportunity is, not the older MTEB/semantic similarity ones which have been performant enough for many use cases since 2014 and 2019 received a big boost with mini-lm-v2 etc.

For the above 3 embedding types I can use SBERT but the dimensions are large, models quite large, and having to load multiple models for different similarity types is straining on resources, it often takes about 6GB because generative embedding models (or E5 etc) are large, as are NLI models.

warangal

Embeddings capture a lot of semantic information based on the training data and objective function, and can be used independently for a lot of useful tasks.

I used to use embeddings from the text-encoder of CLIP model, to augment the prompt to better match corresponding images. For example given a word "building" in prompt, i would find the nearest neighbor in the embedding matrix like "concrete", "underground" etc. and substitute/append those after the corresponding word. This lead to a higher recall for most of the queries in my limited experiments!

jcmeyrignac

Any plan for languages other than english? This would be a perfect tool for french language.

xk3

With a large corpus (10,000+ sentences--each sentence is a "document" in my usecase) I can get similar results by kmeans clustering TF-IDF spmatrix vectors but it looks like this has a lot of utilities for making the kmeans part faster (binarization, etc).

Looking forward to doing some benchmarking over the next couple weeks

Der_Einzige

I wrote a set of "language games" which used a similar set of functions years ago: https://github.com/Hellisotherpeople/Language-games

anonymousfilter

Has anyone thought of using embeddings to solve Little Alchemy? #sample-use

dspoka

Looks cool! Any advantages to the mini-lm model - it seems better on most mteb tasks but wondering if maybe inference or something is better.

ttpphd

This is great for game making! Thank you!

visarga

This shows just how much semantic content is embedded in the tokens themselves.

johnthescott

hmm ... postgresql extension?