Back to Skills

sentencepiece

davila7
Updated Today
6 views
15,516
1,344
15,516
View on GitHub
DesignTokenizationSentencePieceLanguage-IndependentBPEUnigramMultilingualCJK LanguagesUnicodeDeterministicGoogle

About

SentencePiece is a language-independent tokenizer that processes raw Unicode text using BPE or Unigram algorithms. It's fast, memory-efficient, and provides deterministic vocabulary generation without needing language-specific pre-tokenization. Use it for multilingual models, CJK language support, or when you require reproducible tokenization.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/davila7/claude-code-templates
Git CloneAlternative
git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/sentencepiece

Copy and paste this command in Claude Code to install this skill

Documentation

SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

When to use SentencePiece

Use SentencePiece when:

  • Building multilingual models (no language-specific rules)
  • Working with CJK languages (Chinese, Japanese, Korean)
  • Need reproducible tokenization (deterministic vocabulary)
  • Want to train on raw text (no pre-tokenization needed)
  • Require lightweight deployment (6MB memory, 50k sentences/sec)

Performance:

  • Speed: 50,000 sentences/sec
  • Memory: ~6MB for loaded model
  • Languages: All (language-independent)

Use alternatives instead:

  • HuggingFace Tokenizers: Faster training, more flexibility
  • tiktoken: OpenAI models (GPT-3.5/4)
  • BERT WordPiece: English-centric tasks

Quick start

Installation

# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

Train model

# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

Training time: ~1-2 minutes for 100MB corpus

Encode and decode

import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"

Language-independent design

Whitespace as symbol (▁)

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

Key principle: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

Tokenization algorithms

BPE (Byte-Pair Encoding)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

Used by: mBART

Unigram (default)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

Used by: T5, ALBERT, XLNet

Training configuration

Essential parameters

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

Character coverage

Language TypeCoverageRationale
English0.9995Most common chars
CJK (Chinese)1.0All characters needed
Multilingual0.9995Balance

Encoding options

Subword regularization

# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

Use case: Data augmentation for robustness.

Common patterns

T5-style training

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Integration with transformers

from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

Performance benchmarks

Training speed

CorpusBPE (16k)Unigram (8k)
100 MB1-2 min3-4 min
1 GB10-15 min30-40 min

Tokenization speed

  • SentencePiece: 50,000 sentences/sec
  • HF Tokenizers: 200,000 sentences/sec (4× faster)

Supported models

T5 family: t5-base, t5-large (32k vocab, Unigram) ALBERT: albert-base-v2 (30k vocab, Unigram) XLNet: xlnet-base-cased (32k vocab, Unigram) mBART: facebook/mbart-large-50 (250k vocab, BPE)

References

Resources

GitHub Repository

davila7/claude-code-templates
Path: cli-tool/components/skills/ai-research/tokenization-sentencepiece
anthropicanthropic-claudeclaudeclaude-code

Related Skills

whisper

Other

Whisper is OpenAI's multilingual speech recognition model for transcribing and translating audio across 99 languages. It handles tasks like podcast transcription, meeting notes, and processing noisy or multilingual audio. Developers should use it for robust, production-ready speech-to-text capabilities.

View skill

huggingface-tokenizers

Documents

This Claude Skill provides high-performance tokenization using Rust-based HuggingFace tokenizers, supporting BPE, WordPiece, and Unigram algorithms. It enables custom tokenizer training, alignment tracking, and handles padding/truncation while integrating seamlessly with transformers. Use it when you need production-ready tokenization faster than 20 seconds per GB or require custom tokenizer development.

View skill

sentence-transformers

Meta

The sentence-transformers skill provides a production-ready framework for generating high-quality text and image embeddings using over 5,000 pre-trained models. It's ideal for developers needing semantic search, RAG implementations, clustering, or multilingual similarity tasks. Use it when you require state-of-the-art embeddings for retrieval or analysis in a scalable environment.

View skill

web-cli-teleport

Design

This skill helps developers choose between Claude Code Web and CLI interfaces based on task complexity and workflow needs. It enables seamless teleportation of sessions between environments to maintain context and optimize productivity. Use it for session management and to determine the best interface for coding tasks requiring different levels of iteration or back-and-forth interaction.

View skill