Skip to content

User Dictionary

Add custom words to improve analysis for your domain.

Runtime Loading

Load dictionary entries at runtime using loadUserDictionary():

typescript
const suzume = await Suzume.create()

// Add a single word
suzume.loadUserDictionary('ChatGPT,NOUN')

// Add multiple words
suzume.loadUserDictionary(`
スカイツリー,NOUN
ポケモン,NOUN
DeepL,NOUN
`)

Format

Basic Format

surface,pos
FieldRequiredDescription
surfaceYesThe word as it appears in text
posYesPart of speech

Full Format

surface,pos,cost,lemma
FieldRequiredDescription
surfaceYesThe word as it appears in text
posYesPart of speech
costNoWord cost (lower = more likely to be selected)
lemmaNoBase/dictionary form

Part of Speech Values

ValueDescriptionJapanese
NOUNNouns, proper nouns名詞
VERBVerbs動詞
ADJAdjectives形容詞
ADVAdverbs副詞
PARTICLEParticles助詞
AUXAuxiliary verbs助動詞
PRONPronouns代名詞
DETAdnominal adjectives連体詞
CONJConjunctions接続詞
INTJInterjections感動詞
PREFIXPrefixes接頭辞
SUFFIXSuffixes接尾辞
SYMBOLSymbols記号

Japanese POS names

You can also use Japanese POS names (e.g., 名詞, 動詞, 形容詞) instead of English values.

Examples

Tech Terms

csv
ChatGPT,NOUN
GitHub,NOUN
TypeScript,NOUN
WebAssembly,NOUN
Kubernetes,NOUN

Brand Names

csv
スカイツリー,NOUN
ポケモン,NOUN
任天堂,NOUN
ソニー,NOUN

Compound Words

csv
形態素解析,NOUN
機械学習,NOUN
自然言語処理,NOUN

Verbs with Conjugation

csv
ググる,VERB,5000,ググる
バズる,VERB,5000,バズる

Cost Tuning

The cost parameter controls word selection priority:

  • Lower cost = More likely to be selected
  • Default cost = ~8000
  • Common words = 5000-7000
  • Rare words = 9000+
csv
# Prefer "東京都" over "東京" + "都"
東京都,NOUN,5000

# Less common compound
超電磁砲,NOUN,9000

Use Cases

Search Indexing

typescript
// Add domain-specific terms for better tokenization
suzume.loadUserDictionary(`
React,NOUN
Next.js,NOUN
Tailwind,NOUN
`)

const tags = suzume.generateTags('Next.jsでReactアプリを作成')
// ['Next.js', 'React', 'アプリ', '作成']

Chat Applications

typescript
// Add slang and neologisms
suzume.loadUserDictionary(`
草,INTJ
ワロタ,INTJ
エモい,ADJ
`)

E-commerce

typescript
// Add product names and brands
suzume.loadUserDictionary(`
iPhone,NOUN
MacBook,NOUN
AirPods,NOUN
`)

Best Practices

  1. Keep entries minimal - Only add words that are mis-tokenized
  2. Use uppercase POS - NOUN not noun
  3. Test incrementally - Add a few words and verify results
  4. Consider compounds - Add 東京都 if you want it as one token

Binary Dictionary

For faster loading, dictionaries can be pre-compiled to binary format (.dic) using the suzume-cli tool:

bash
# Compile TSV to binary
suzume-cli dict compile user.tsv   # → user.dic

Then load the binary dictionary at runtime:

typescript
// Node.js
import { readFile } from 'fs/promises'
const dictData = new Uint8Array(await readFile('user.dic'))
suzume.loadBinaryDictionary(dictData)

// Browser
const response = await fetch('/dictionaries/user.dic')
const dictData = new Uint8Array(await response.arrayBuffer())
suzume.loadBinaryDictionary(dictData)

Performance

Binary dictionaries load significantly faster than CSV format, making them ideal for production deployments with large custom vocabularies.

.dic Format Overview

The binary dictionary is a compact format with the following layout:

[Header (40 bytes, magic: "SZMD")]
[Double-Array Trie]
[Entry Array (12 bytes each)]
[String Pool (UTF-8)]
  • Double-array trie — Enables fast common-prefix lookup of surface forms (O(m) per query)
  • Entry array — Each entry stores string pool offsets for surface/lemma, POS, and flags
  • String pool — Concatenated, deduplicated UTF-8 strings

During compilation, verbs and adjectives are expanded into their conjugated forms, and all entries are sorted before being packed into the trie.

Persistence

Dictionary entries are stored in memory and lost when the instance is destroyed. To persist:

typescript
// Load from your storage on init
const savedDict = localStorage.getItem('myDictionary')
if (savedDict) {
  suzume.loadUserDictionary(savedDict)
}

// Save when adding new words
function addWord(word: string, pos: string) {
  const entry = `${word},${pos}`
  suzume.loadUserDictionary(entry)

  // Append to storage
  const current = localStorage.getItem('myDictionary') || ''
  localStorage.setItem('myDictionary', current + '\n' + entry)
}