enhance: better word division for highlighting by gotoh · Pull Request #2251 · sourcegit-scm/sourcegit

gotoh · 2026-04-08T06:32:06Z

Summary

Improve inline diff highlighting by replacing the ASCII delimiter–based
word division with a Unicode category–based approach, primarily to
produce more precise highlights in languages like Japanese and Chinese.

Problem

The previous implementation splits text at a fixed set of delimiter
characters (spaces, tabs, and common ASCII symbols such as +-*/=!,;).
Non-delimiter characters — including CJK ideographs, Hiragana, and
Katakana — are never treated as boundaries, so a small change within
Japanese or Chinese text causes the entire surrounding phrase to be
highlighted as changed.

Solution

Each character is classified into one of three categories. Consecutive
characters of the same category are grouped into one chunk, except for
the Other category which retains the same per-character behavior as
the previous implementation:

Category	Characters	Chunking
`Letter`	Latin, Greek, Cyrillic and diacritic variants (é, ü, ß…)	grouped
`OtherLetter`	CJK, Hiragana, Katakana, Hangul, Thai, Arabic, etc.	grouped
`Other`	Whitespace, punctuation, symbols — same as previous delimiters	per character

CJK punctuation (。、「」…) falls into Other and acts as a natural
boundary between OtherLetter chunks, making highlighted changes
more precise without requiring language-specific word segmentation.

Category values for all 65,536 char values are pre-computed into a
static read-only array at startup for lock-free O(1) lookup.

Japanese — before / after

Before	After

English — no regression

Before	After

Check point

Japanese diff: divided by non CJK character mixed in sentence without space character.
Japanese diff: CJK punctuation correctly splits chunks at sentence boundaries
ASCII operators and punctuation: per-character precision preserved (/ vs *, == vs !=, etc.)
European words with diacritics (café, über…): treated as single word chunks

Each line is divided into several chunks to highlight the changes. The previous implementation splits text at a fixed set of delimiter characters (spaces, tabs, and common ASCII symbols such as `+-*/=!,;`). Non-delimiter characters — including CJK ideographs, Hiragana, and Katakana — are never treated as boundaries, so they tend to form large, coarse chunks in languages like Japanese or Chinese that do not use spaces to separate words. A small change within such text causes the entire surrounding phrase to be highlighted. This new implementation classifies each character into one of three categories and groups consecutive characters of the same category into one chunk, except for the Other category which is always split character by character: - Letter (Unicode Ll/Lu/Lt/Lm + digits): ASCII letters, digits, and letters with diacritics such as é, ü, ß, ñ, ё. Consecutive Letter characters form one chunk, keeping European words intact. - OtherLetter (Unicode Lo): CJK, Hiragana, Katakana, Hangul, Thai, Arabic, Hebrew, etc. Consecutive OtherLetter characters form one chunk. CJK punctuation (。、「」…) falls into the Other category and therefore acts as a natural boundary between chunks. - Other (default): whitespace, control characters, punctuation, and symbols. This category corresponds to the delimiter characters of the previous implementation. Each character is always its own chunk, preserving the same per-character precision as before for operators, spaces, and punctuation. Category values for all 65,536 char values are pre-computed into a static read-only array at startup for lock-free O(1) lookup.

love-linger self-assigned this Apr 8, 2026

love-linger added the enhancement New feature or request label Apr 8, 2026

love-linger merged commit a3c0b22 into sourcegit-scm:develop Apr 8, 2026
14 checks passed

gotoh deleted the highlighting branch April 8, 2026 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: better word division for highlighting#2251

enhance: better word division for highlighting#2251
love-linger merged 1 commit intosourcegit-scm:developfrom
gotoh:highlighting

gotoh commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gotoh commented Apr 8, 2026

Summary

Problem

Solution

Japanese — before / after

English — no regression

Check point

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants