arXiv cs.CL AI Research Apr 21

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

★★★★★ significance 3/5

The paper investigates 'token over-fragmentation' in large language models, where non-Latin scripts require more tokens than English. It proposes a new interpretability-based approach for vocabulary expansion and embedding initialization to improve efficiency and performance.

Why it matters Moving beyond frequency-based expansion addresses the fundamental efficiency bottlenecks and linguistic biases inherent in current LLM tokenization architectures.

Read the original at arXiv cs.CL

Related coverage

Global South OpportunitiesPivotal Research Fellowship 2026 (Q3): AI Safety Research Opportunity - Global South Opportunities
arXiv cs.AIAn Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement
arXiv cs.AIPExA: Parallel Exploration Agent for Complex Text-to-SQL
arXiv cs.AIThe Power of Power Law: Asymmetry Enables Compositional Reasoning
arXiv cs.AIOn the Existence of an Inverse Solution for Preference-Based Reductions in Argumentation

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

Tags

Related coverage