Will tokenizers disappear?

Dec 28, 2023

Will tokenizers disappear?

This is the strong claim made by Byte Latent Transformer, a paper from Meta.

BLT relies on a relatively convoluted process to transform "Bytes" into ngrams and then patches. This is probably one of the main downside of the approach: a tokenizer is relatively transparent with one unit corresponding to a word or part of the word. Even though at pleias we have spent a lot of time studying and designing tokenizers, we were still unclear about what was precisely happening here (especially the part where they use embeddings from byte ngrams).

Different approaches have been explored to create the patches and the most efficient one is based on… entropy. Basically, BLT merges every low entropy/least surprising sequence of bytes and conversely will keep "decisive" tokens as one later only. So that "Game of Thrones" can be cut off as "G" and then "ame of Thrones". This process is not that far from the LLM sampler Entropix: while Entropix does not change the existing tokenizer but it similarly allocates more "thinking" to decisive tokens.

The patches are interpreted or generated by an encoder and a decoder model. Even though they don’t reference it, the approach reminds me a lot of image transformers/SigLIP. In this sense, "tokenizers" are not really dead as a concept. We still need to train a separate piece of the model to process word inputs.

This leads to the strongest claim from the article: entropy patches are more compute efficient. Basically we allocate more compute to the "harder" part of the text and less to the "easier" parts, with text segmentation being redefined continuously. In contrast with past similar concepts, BLT has been tested at scale and proven to perform as well if not better than a vanilla Llama on several benchmarks. Especially interesting, BLT seems to outperform significantly on "noisy" benchmarks, as byte-level processing brings more resiliency. This could prove especially critical for models in production as they frequently have to interact with noisier ressources than common benchmark — badly OCRized texts, for a start.

More surprisingly, BLT has only been used on a text corpus in English. The most interesting applications for me would be either on highly multilingual texts or on "actual" multimodality. Every digital content is still ultimately a piece of byte. Beyond the technical feasibility stills remains unknown at this stage if BLT can effectively ingest images, sounds or binary data.

In the end, Byte Latent Transformer is not really tokenizer-less. We still have:
* A process encoding text representations that has to be trained somewhere.
* Higher-order units representing input, likely better optimized but also more convoluted (bytes to ngrams to patches).
* Some potential for bias as a language less represented in the training data will also have an higher entropy.