Tokens are a big reason today's generative AI falls short

kibwen

> A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

TechCrunch, I respect that you have a style guide vis a vis punctuation and quotation marks, but please understand when it's appropriate to break the rules. :P

Lerc

While I think tokenization does cause limitations, it is not as clear cut as one might think.

GPT4 can answer questions given to it in Base64. I would imagine it suffers some degree of degradation in ability from the extra workload this causes but I haven't seen any measurements on this.

I have wondered about other architectures to help. What happens when a little subnet encodes the (16 or 32?) characters in the neighborhood of the token into an embedding that gets attached to the top level token embedding?

kevincox

This seems like saying "Synonyms are a big reason that humans fall short."

Part of what makes AI interesting is that it can understand a huge number of differently phrased data. It seems like different token encodings would only be a very minor complexity compared to the variety of human language.

amrb

An alternative approache to BPE tokenization https://arxiv.org/abs/2406.19223

greenyies

I find this article very weird.

It doesn't really explain anything besides talking about tokenization on random levels.

You need a certain amount of data to even understand that once upon a time might be a higher level concept.

ttul

Tokenization is a statistical technique that greatly compresses the input while providing some semantic hints to the underlying model. Tokenization is not the big thing holding back generative models. There are so many other challenges being worked on and steadily overcome and progress has been insanely rapid.

PaulHoule

One take on it is that chatbots ought to know something about the tokens that they take. For instance you should be able to ask it how it tokenizes a phrase, what the number is for tokens, etc. One possibility is to train it on synthetic documents that describe the tokenization system.

deepsquirrelnet

I implemented a hierarchical model that pooled utf8 encoded sequences to word vectors and trained it with a decoder on text denoising.

I think the future is a small word encoder model that replaces the token embedding codebook.

And here’s the reason: you can still create a codebook after training and then use the encoder model only for OOV. I’m not sure there’s an excuse not to be doing this, but open to suggestions.

soloist11

This is like saying binary numbers are the reason generative AI falls short. Computers work with transistors which are either on or off so what are these people proposing as the next computational paradigm to fix the problems with binary generative AI?