The Architecture of Kanji: Components, Positions, and Composition Rules

hbaristr 6 min read

Kanji are a compression algorithm

Unicode's CJK Unified Ideographs block encodes 97,680 characters. The KanjiJump decomposition reduces the 3,500 most-used Japanese kanji to 281 atomic components — 200 if you fold positional variants together. Roughly 12:1 compression, from an alphabet smaller than a Scrabble set. The Cihai dictionary catalogs 675 primitives across 16,339 characters; a 2009 Chinese national standard trims it to 514 for common use.

This is not metaphor. Kanji form a combinatorial writing system with a formal grammar, positional constraints, and a phonetic encoding layer — properties Unicode has literally codified into twelve composition operators.

The twelve composition operators

Unicode block U+2FF0–U+2FFB defines Ideographic Description Characters — prefix operators that describe how components combine. A context-free grammar for glyph structure:

Symbol Code Point Name Example Decomposition
U+2FF0 Left to right ⿰木目
U+2FF1 Above to below ⿱木口
U+2FF2 Left-middle-right ⿲彳氵亍
U+2FF3 Above-middle-below ⿳亠口小
U+2FF4 Full surround ⿴囗口
U+2FF5 Surround from above ⿵几皇
U+2FF6 Surround from below ⿶凵㐅
U+2FF7 Surround from left ⿷匚斤
U+2FF8 Above-left surround ⿸疒丙
U+2FF9 Above-right surround ⿹戈廾
U+2FFA Below-left surround ⿺走召
U+2FFB Overlaid ⿻工从

Examples of Ideographic Description Sequences: 字 decomposes as ⿱宀子, 匠 as ⿷匚斤, 京 as ⿳亠口小, 米 as ⿻八木
Worked decomposition: each character on the left is rewritten as an IDC operator (the dashed box) followed by its operand components. Source: Wikimedia Commons.

Ten of the twelve are binary; ⿲ and ⿳ take three operands. Unicode 15.1 added four more (U+2FFC–U+2FFF) for left-open surround, bottom-right surround, horizontal reflection, and rotation — sixteen in total. The original twelve carry most of the load.

The U+2FF0 through U+2FFF Unicode block: sixteen Ideographic Description Characters as code-point cells
The full Ideographic Description Characters block (U+2FF0–U+2FFF), including the four operators added in Unicode 15.1. Source: Wikimedia Commons.

The CHISE project and the cjkvi-ids database have applied IDS decomposition to over 75,000 CJK ideographs — a machine-readable structural atlas of the entire character space. The distribution is heavily skewed. Gao and Kao (2002) found that over 60% of high-frequency characters use ⿰ (left-right), roughly 20% use ⿱ (top-bottom), and the rest split across enclosure and overlay. Left-right dominance is a fingerprint of the phono-semantic architecture: semantic radical on the left, phonetic on the right.

The seven positional slots

Japanese pedagogy names seven component positions (部首の位置). They aren't labels for a chart — they're structural constraints that decide which shape variant a component takes and which slot it gets to live in.

Position Japanese Reading Location Examples
Hen へん Left side 氵 in , 亻 in , 扌 in
Tsukuri つくり Right side 刂 in , 攵 in 教, 頁 in 頭
Kanmuri かんむり Top (crown) 艹 in , 宀 in , 雨 in
Ashi あし Bottom (legs) 灬 in , 心 in , 皿 in
Tare たれ Top-left drape 广 in , 疒 in , 尸 in
Nyou にょう Bottom-left wrap 辶 in , 廴 in , 之 in
Kamae かまえ Full/partial surround 門 in , 囗 in , 行 in

Hen and tsukuri dominate — over 60% of all placements, a direct consequence of ⿰'s prevalence. Across the 2,136 joyo kanji, 6 radicals cover 25% of all characters and 50 radicals cover 75%. Almost all of them sit in hen or kanmuri. Many radicals are pinned to a single slot: 氵 is always hen, 刂 is always tsukuri, 艹 is always kanmuri. Move a component and its shape mutates. becomes 氵 on the left. becomes 忄 on the left but stays 心 on the bottom. collapses into 灬 underneath.

Phonetic components — the sound layer

Somewhere between 67% and 82% of kanji are phono-semantic compounds (形声文字), depending on whose count you trust. The phonetic component (声符 seifu) carries the on'yomi; the semantic radical signals the meaning domain. EDRDG catalogs 150 phonetic components. KanjiJump documents 808 — and notes that 74% of the 3,500 most-used kanji either include a sound component or serve as one.

Reliability is uneven. Some series hold at 100%. Others rot through centuries of sound change between Old Chinese and modern on'yomi. Ten of the most productive:

Component On'yomi Derivatives Reliability Example Series
~30 Medium , 匙, , ,
シャ ~23 Medium , , , ,
セイ ~21 Medium , , , ,
ホウ ~20 Medium , , , ,
サイ ~19 Medium , , , ,
~17 High , , , ,
ケイ ~17 High , 畦, 桂, 蛙,
ホウ ~16 High , , , ,
ハク ~15 High , , , ,
カク ~15 High , , , ,

High: >80% of derivatives share the predicted on'yomi. Medium: 50–80%. Sources: EDRDG, The Kanji Code, KanjiJump.

The perfect series are the highest-leverage components in the whole writing system. (ヒョウ) generates 12 derivatives — , , 瓢, 剽, and more — every one ヒョウ, zero exceptions. 冓 (コウ) yields 10 (, , , ), all コウ. (ホウ) gives 6 (, , , 胞, ), all ホウ. Learn one component, predict the on'yomi of every character in the family on sight. That is a compounding asset, not flashcard busywork.

Decomposition as compilation

The CHISE project (Character Processing Based on a Huge Structured Environment), out of Kyoto University, serializes its IDS decomposition database in RDF and exposes it via SPARQL. Each character is a tree — composition operators at the nodes, atomic components at the leaves. An abstract syntax tree for a glyph. The cjk-decomp project covers 75,000 ideographs and surfaces roughly 10,000 intermediate composite components sitting between the atomic primitives and the final characters.

The shape mirrors how a compiler represents an expression. Terminals (atomic strokes and components). Non-terminals (composite sub-components). Production rules (the IDS operators). The implication is the interesting part: kanji are not 50,000 independent symbols. They are 50,000 strings generated by a grammar with 300–700 terminals and 12 production rules. The writing system is closer to a codebase than a dictionary — and component analysis is the decompiler.

We wrote one. The Kanji Atlas renders the full component graph for the 2,136 joyo characters. Atlas Grade 1 is the easiest place to see the principle in action — every Grade 1 character broken down into its kanji, radicals, and graphemes.

References

Send feedback

Optional — only if you'd like a reply.