How SQLite FTS5's tokenizer silently blocks Japanese search — and the fix

When you rely on SQLite’s FTS5 extension for full-text search, you expect reliable substring matching across languages. But the default unicode61 tokenizer introduces a subtle but critical limitation: it fails to index Japanese (and other CJK) text properly, causing full-text queries to miss content that exists right in your database.

This issue often goes unnoticed until users actively search for Japanese phrases or mixed-language content. For personal knowledge bases, note-taking apps, or memory layers built on top of SQLite, this silent failure can mean half your data remains invisible — even though it’s stored.

Why the default tokenizer misses Japanese text

FTS5’s default unicode61 tokenizer splits text into tokens based on whitespace and Unicode word boundaries. While this works well for English and other space-delimited languages, it breaks down for Japanese, Chinese, Korean, and Thai, where words are not separated by spaces.

For example, consider this Japanese phrase inserted into an FTS5 table:

import sqlite3

conn = sqlite3.connect(":memory:")
conn.execute("""
    CREATE VIRTUAL TABLE notes USING fts5(content)
""")

conn.execute("""
    INSERT INTO notes(content) 
    VALUES ('Tried time-blocking with the new 朝の運用フロー — the 9-11 slot worked but the 午後 part collapsed again.')
""")

When searching for Japanese substrings like 朝の運用 or 午後, the query returns zero results — even though the text exists in the table:

for q in ["time", "blocking", "朝の運用", "午後"]:
    hits = conn.execute(
        "SELECT count(*) FROM notes WHERE content MATCH ?", (q,)
    ).fetchone()[0]
    print(f"{q!r}: {hits} hit(s)")

Output:

'time': 1 hit(s)
'blocking': 1 hit(s)
'朝の運用': 0 hit(s)
'午後': 0 hit(s)

The issue stems from how the tokenizer processes the Japanese text. The phrase 朝の運用フロー is treated as a single, unbreakable token because there are no spaces. Searching for 朝の運用 effectively asks FTS5: “Is this exact token equal to 朝の運用?” Since the token is 朝の運用フロー, the match fails.

This behavior isn’t a bug — it’s the documented behavior of the unicode61 tokenizer. But for users building multilingual search tools, it acts like a silent recall failure that can go undetected for weeks or even months.

The one-line fix: switch to trigram tokenization

The solution is simple: switch from the default unicode61 tokenizer to the trigram tokenizer. This change requires only one line of SQL:

CREATE VIRTUAL TABLE notes USING fts5(
    content,
    tokenize = 'trigram'
);

With the trigram tokenizer, FTS5 no longer relies on word boundaries. Instead, it indexes every overlapping 3-character sequence (trigram) from the input text. Re-running the same test now returns all four queries with full matches:

'time': 1 hit(s)
'blocking': 1 hit(s)
'朝の運用': 1 hit(s)
'午後': 1 hit(s)

How does it work? The Japanese phrase 朝の運用フロー is broken down into the following trigrams:

朝の
の運
運用
用フ
フロー

Now, when you search for 朝の運用, FTS5 checks whether the trigrams 朝の and の運用 exist in the index — and they do, enabling the query to return a match.

This approach is language-agnostic: it works equally well for Japanese, Chinese, Korean, or Thai, since it operates at the character-sequence level rather than relying on word boundaries.

However, there are trade-offs to consider. Trigram indexing increases the size of the full-text index significantly — typically 1.5x to 2x the size of the raw text. For personal knowledge bases with a few megabytes of content, this is negligible. For large corpora in the gigabytes or terabytes, the storage impact should be evaluated carefully.

Another consideration: trigram searches require queries of at least three characters to work efficiently. Searches for single characters (e.g., 朝) or two-character sequences (e.g., 朝の) may degrade in performance, as FTS5 falls back to less efficient query plans. In application code, it’s wise to filter out short queries or handle them separately.

Production setup: Git as source of truth, SQLite as fast cache

A practical way to implement this fix is to pair SQLite’s FTS5 trigram index with a Git repository holding your raw content. This architecture provides durability, human readability, and easy recovery, while the SQLite index delivers fast search performance.

Here’s how it works in practice. Your raw content lives in Git as human-readable JSON or Markdown files, organized by date:

sample_events/
├── 2026-06-09/
│   ├── abc-def.json
│   └── ghi-jkl.json
└── _data/
    └── memory_index.db  (gitignored)

Each file represents an event or decision, with metadata and content:

{
  "event_id": "0bf61ebd-c835-4a45-a147-675369258d61",
  "timestamp": "2026-05-09T12:00:00Z",
  "type": "decision",
  "content": "Picked SQLite over a vector store because grep needs to still work.",
  "metadata": {
    "session": "S8",
    "tags": ["architecture", "memory"]
  }
}

When a new event is saved, the system performs three steps in sequence:

Write the event as a JSON file to disk.
Commit the file to Git with a descriptive message.
Update the SQLite FTS5 trigram index with the new content.

This design ensures the Git repository remains the source of truth. The SQLite file is disposable — if corrupted, accidentally deleted, or rebuilt on a new machine, it can be regenerated instantly from the Git history. This eliminates risk while maintaining fast search performance.

Since the SQLite index is not synced or versioned, it can be safely added to .gitignore. The index rebuild process takes seconds, even for hundreds of documents, making it ideal for personal or team memory tools.

Another FTS5 footgun: the hyphen character

Beyond the tokenizer issue, FTS5 introduces another subtle trap: the hyphen (-) character. In FTS5, the hyphen is treated as a token separator unless explicitly escaped or configured otherwise. This can break queries for compound terms like time-blocking or 9-11 slot.

For example, searching for time-blocking may split the query into time and blocking, reducing precision. To avoid this, either:

Use quotes in your query: "time-blocking"
Escape the hyphen in the query: time\-blocking
Configure the tokenizer to treat hyphens as part of tokens, depending on your use case

This quirk is well-documented in SQLite’s FTS5 documentation but often overlooked in application logic, leading to unexpected search failures.

Takeaways and next steps

If your full-text search relies on SQLite FTS5 and includes non-English content, especially Japanese, Chinese, or Korean, the default unicode61 tokenizer may be silently filtering out your queries. Switching to the trigram tokenizer restores full recall with minimal effort — one line of SQL.

For production systems, pair the trigram index with a Git-backed source of truth to ensure durability and recoverability. And remember to handle edge cases like hyphenated terms and short queries to maintain query accuracy and performance.

Before you ship your next multilingual search tool, test your tokenizer choice with real data. You might discover your search index is far more limited than you expected.

AI summary

SQLite FTS5’in varsayılan ayarları Japonca, Çince ya da Korece metinlerde arama yaparken neden çalışmaz? Trigram tokenizer ile nasıl basit bir şekilde çözebilirsiniz? Detaylı açıklama ve uygulama örneği.

How SQLite FTS5's tokenizer silently blocks Japanese search — and the fix

Why the default tokenizer misses Japanese text

The one-line fix: switch to trigram tokenization

Production setup: Git as source of truth, SQLite as fast cache

Another FTS5 footgun: the hyphen character

Takeaways and next steps

Comments

How Quantization Affects AI Model Speed and Output Quality

How a 5-Minute Memory Window Shapes AI Conversations

Automate AWS EC2 with Python using Terraform in minutes