Skip to content

enhance: better word division for highlighting#2251

Merged
love-linger merged 1 commit intosourcegit-scm:developfrom
gotoh:highlighting
Apr 8, 2026
Merged

enhance: better word division for highlighting#2251
love-linger merged 1 commit intosourcegit-scm:developfrom
gotoh:highlighting

Conversation

@gotoh
Copy link
Copy Markdown

@gotoh gotoh commented Apr 8, 2026

Summary

Improve inline diff highlighting by replacing the ASCII delimiter–based
word division with a Unicode category–based approach, primarily to
produce more precise highlights in languages like Japanese and Chinese.

Problem

The previous implementation splits text at a fixed set of delimiter
characters (spaces, tabs, and common ASCII symbols such as +-*/=!,;).
Non-delimiter characters — including CJK ideographs, Hiragana, and
Katakana — are never treated as boundaries, so a small change within
Japanese or Chinese text causes the entire surrounding phrase to be
highlighted as changed.

Solution

Each character is classified into one of three categories. Consecutive
characters of the same category are grouped into one chunk, except for
the Other category which retains the same per-character behavior as
the previous implementation:

Category Characters Chunking
Letter Latin, Greek, Cyrillic and diacritic variants (é, ü, ß…) grouped
OtherLetter CJK, Hiragana, Katakana, Hangul, Thai, Arabic, etc. grouped
Other Whitespace, punctuation, symbols — same as previous delimiters per character

CJK punctuation (。、「」…) falls into Other and acts as a natural
boundary between OtherLetter chunks, making highlighted changes
more precise without requiring language-specific word segmentation.

Category values for all 65,536 char values are pre-computed into a
static read-only array at startup for lock-free O(1) lookup.

Japanese — before / after

Before After
Japanese before Japanese after

English — no regression

Before After
English before English after

Check point

  • Japanese diff: divided by non CJK character mixed in sentence without space character.
  • Japanese diff: CJK punctuation correctly splits chunks at sentence boundaries
  • ASCII operators and punctuation: per-character precision preserved (/ vs *, == vs !=, etc.)
  • European words with diacritics (café, über…): treated as single word chunks

Each line is divided into several chunks to highlight the changes.

The previous implementation splits text at a fixed set of delimiter
characters (spaces, tabs, and common ASCII symbols such as `+-*/=!,;`).
Non-delimiter characters — including CJK ideographs, Hiragana, and
Katakana — are never treated as boundaries, so they tend to form large,
coarse chunks in languages like Japanese or Chinese that do not use
spaces to separate words. A small change within such text causes the
entire surrounding phrase to be highlighted.

This new implementation classifies each character into one of three
categories and groups consecutive characters of the same category into
one chunk, except for the Other category which is always split
character by character:

- Letter (Unicode Ll/Lu/Lt/Lm + digits): ASCII letters, digits, and
  letters with diacritics such as é, ü, ß, ñ, ё. Consecutive Letter
  characters form one chunk, keeping European words intact.
- OtherLetter (Unicode Lo): CJK, Hiragana, Katakana, Hangul, Thai,
  Arabic, Hebrew, etc. Consecutive OtherLetter characters form one
  chunk. CJK punctuation (。、「」…) falls into the Other category
  and therefore acts as a natural boundary between chunks.
- Other (default): whitespace, control characters, punctuation, and
  symbols. This category corresponds to the delimiter characters of
  the previous implementation. Each character is always its own chunk,
  preserving the same per-character precision as before for operators,
  spaces, and punctuation.

Category values for all 65,536 char values are pre-computed into a
static read-only array at startup for lock-free O(1) lookup.
@love-linger love-linger self-assigned this Apr 8, 2026
@love-linger love-linger added the enhancement New feature or request label Apr 8, 2026
@love-linger love-linger merged commit a3c0b22 into sourcegit-scm:develop Apr 8, 2026
14 checks passed
@gotoh gotoh deleted the highlighting branch April 8, 2026 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants