Tokenizer design considerably impacts language mannequin efficiency,
but evaluating tokenizer high quality stays difficult. Whereas textual content compression has emerged as a typical intrinsic metric, latest work questions its reliability as a high quality indicator. We examine whether or not evaluating tokenizers on smaller fashions (350M parameters) reliably predicts their influence at bigger scales (2.7B parameters).
By experiments with established tokenizers from widely-adopted language fashions, we discover that tokenizer selection minimally impacts English duties however yields vital, scale-consistent variations in machine translation efficiency.
Primarily based on these findings, we suggest extra intrinsic metrics that correlate extra strongly with downstream efficiency than textual content compression.
We mix these metrics into an analysis framework that permits extra dependable intrinsic tokenizer comparisons.
- †Work achieved whereas at Apple
- ‡ College of Copenhagen & ROCKWOOL Basis Analysis Unit