The Architecture of Kanji: Components, Positions, and Composition Rules

Kanji are a compression algorithm

Unicode's CJK Unified Ideographs block encodes 97,680 characters. The KanjiJump decomposition reduces the 3,500 most-used Japanese kanji to 281 atomic components — 200 if you fold positional variants together. Roughly 12:1 compression, from an alphabet smaller than a Scrabble set. The Cihai dictionary catalogs 675 primitives across 16,339 characters; a 2009 Chinese national standard trims it to 514 for common use.

This is not metaphor. Kanji form a combinatorial writing system with a formal grammar, positional constraints, and a phonetic encoding layer — properties Unicode has literally codified into twelve composition operators.

The twelve composition operators

Unicode block U+2FF0–U+2FFB defines Ideographic Description Characters — prefix operators that describe how components combine. A context-free grammar for glyph structure:

Symbol	Code Point	Name	Example	Decomposition
⿰	U+2FF0	Left to right	相	⿰木目
⿱	U+2FF1	Above to below	杏	⿱木口
⿲	U+2FF2	Left-middle-right	衍	⿲彳氵亍
⿳	U+2FF3	Above-middle-below	京	⿳亠口小
⿴	U+2FF4	Full surround	回	⿴囗口
⿵	U+2FF5	Surround from above	凰	⿵几皇
⿶	U+2FF6	Surround from below	凶	⿶凵㐅
⿷	U+2FF7	Surround from left	匠	⿷匚斤
⿸	U+2FF8	Above-left surround	病	⿸疒丙
⿹	U+2FF9	Above-right surround	戒	⿹戈廾
⿺	U+2FFA	Below-left surround	超	⿺走召
⿻	U+2FFB	Overlaid	巫	⿻工从

Examples of Ideographic Description Sequences: 字 decomposes as ⿱宀子, 匠 as ⿷匚斤, 京 as ⿳亠口小, 米 as ⿻八木
Worked decomposition: each character on the left is rewritten as an IDC operator (the dashed box) followed by its operand components. Source: Wikimedia Commons.

Ten of the twelve are binary; ⿲ and ⿳ take three operands. Unicode 15.1 added four more (U+2FFC–U+2FFF) for left-open surround, bottom-right surround, horizontal reflection, and rotation — sixteen in total. The original twelve carry most of the load.

The U+2FF0 through U+2FFF Unicode block: sixteen Ideographic Description Characters as code-point cells
The full Ideographic Description Characters block (U+2FF0–U+2FFF), including the four operators added in Unicode 15.1. Source: Wikimedia Commons.

The CHISE project and the cjkvi-ids database have applied IDS decomposition to over 75,000 CJK ideographs — a machine-readable structural atlas of the entire character space. The distribution is heavily skewed. Gao and Kao (2002) found that over 60% of high-frequency characters use ⿰ (left-right), roughly 20% use ⿱ (top-bottom), and the rest split across enclosure and overlay. Left-right dominance is a fingerprint of the phono-semantic architecture: semantic radical on the left, phonetic on the right.

The seven positional slots

Japanese pedagogy names seven component positions (部首の位置). They aren't labels for a chart — they're structural constraints that decide which shape variant a component takes and which slot it gets to live in.

Position	Japanese	Reading	Location	Examples
Hen	偏	へん	Left side	氵 in 海, 亻 in 休, 扌 in 持
Tsukuri	旁	つくり	Right side	刂 in 判, 攵 in 教, 頁 in 頭
Kanmuri	冠	かんむり	Top (crown)	艹 in 花, 宀 in 安, 雨 in 雲
Ashi	脚	あし	Bottom (legs)	灬 in 然, 心 in 思, 皿 in 盤
Tare	垂	たれ	Top-left drape	广 in 店, 疒 in 病, 尸 in 届
Nyou	繞	にょう	Bottom-left wrap	辶 in 道, 廴 in 建, 之 in 芝
Kamae	構	かまえ	Full/partial surround	門 in 間, 囗 in 国, 行 in 術

Hen and tsukuri dominate — over 60% of all placements, a direct consequence of ⿰'s prevalence. Across the 2,136 joyo kanji, 6 radicals cover 25% of all characters and 50 radicals cover 75%. Almost all of them sit in hen or kanmuri. Many radicals are pinned to a single slot: 氵 is always hen, 刂 is always tsukuri, 艹 is always kanmuri. Move a component and its shape mutates. 水 becomes 氵 on the left. 心 becomes 忄 on the left but stays 心 on the bottom. 火 collapses into 灬 underneath.

Phonetic components — the sound layer

Somewhere between 67% and 82% of kanji are phono-semantic compounds (形声文字), depending on whose count you trust. The phonetic component (声符 seifu) carries the on'yomi; the semantic radical signals the meaning domain. EDRDG catalogs 150 phonetic components. KanjiJump documents 808 — and notes that 74% of the 3,500 most-used kanji either include a sound component or serve as one.

Reliability is uneven. Some series hold at 100%. Others rot through centuries of sound change between Old Chinese and modern on'yomi. Ten of the most productive:

Component	On'yomi	Derivatives	Reliability	Example Series
匕	ヒ	~30	Medium	比, 匙, 旨, 尼, 北
者	シャ	~23	Medium	暑, 署, 諸, 緒, 都
生	セイ	~21	Medium	性, 星, 姓, 牲, 産
勹	ホウ	~20	Medium	包, 抱, 泡, 砲, 飽
隹	サイ	~19	Medium	推, 維, 雄, 集, 準
可	カ	~17	High	何, 河, 荷, 歌, 苛
圭	ケイ	~17	High	掛, 畦, 桂, 蛙, 街
方	ホウ	~16	High	放, 防, 紡, 坊, 芳
白	ハク	~15	High	伯, 拍, 泊, 迫, 舶
各	カク	~15	High	格, 閣, 額, 客, 略

High: >80% of derivatives share the predicted on'yomi. Medium: 50–80%. Sources: EDRDG, The Kanji Code, KanjiJump.

The perfect series are the highest-leverage components in the whole writing system. 票 (ヒョウ) generates 12 derivatives — 標, 漂, 瓢, 剽, and more — every one ヒョウ, zero exceptions. 冓 (コウ) yields 10 (構, 溝, 講, 購), all コウ. 包 (ホウ) gives 6 (抱, 泡, 砲, 胞, 飽), all ホウ. Learn one component, predict the on'yomi of every character in the family on sight. That is a compounding asset, not flashcard busywork.

Decomposition as compilation

The CHISE project (Character Processing Based on a Huge Structured Environment), out of Kyoto University, serializes its IDS decomposition database in RDF and exposes it via SPARQL. Each character is a tree — composition operators at the nodes, atomic components at the leaves. An abstract syntax tree for a glyph. The cjk-decomp project covers 75,000 ideographs and surfaces roughly 10,000 intermediate composite components sitting between the atomic primitives and the final characters.

The shape mirrors how a compiler represents an expression. Terminals (atomic strokes and components). Non-terminals (composite sub-components). Production rules (the IDS operators). The implication is the interesting part: kanji are not 50,000 independent symbols. They are 50,000 strings generated by a grammar with 300–700 terminals and 12 production rules. The writing system is closer to a codebase than a dictionary — and component analysis is the decompiler.

We wrote one. The Kanji Atlas renders the full component graph for the 2,136 joyo characters. Atlas Grade 1 is the easiest place to see the principle in action — every Grade 1 character broken down into its kanji, radicals, and graphemes.

References

Unicode Consortium. "Ideographic Description Characters." The Unicode Standard, Chapter 18.
CHISE Project. chise.org
cjkvi-ids. IDS Data for CJK Unified Ideographs. github.com/cjkvi/cjkvi-ids
amake/cjk-decomp. Decomposition data for 75,000 CJK ideographs. github.com/amake/cjk-decomp
KanjiJump. "The 281 Atomic Kanji Components." kanjijump.com
Gao, D.G. & Kao, H.S.R. (2002). Chinese character structure analysis. Acta Psychologica Sinica.
EDRDG. Kanji Phonetic Components. edrdg.org
Millen, A. The Kanji Code. thekanjicode.com
Wikipedia. "Ideographic Description Characters," "Chinese Character Components," "List of Kanji Radicals by Frequency."