The Architecture of Kanji: Components, Positions, and Composition Rules
Kanji are a compression algorithm
Unicode's CJK Unified Ideographs block encodes 97,680 characters. The KanjiJump decomposition reduces the 3,500 most-used Japanese kanji to 281 atomic components — 200 if you fold positional variants together. Roughly 12:1 compression, from an alphabet smaller than a Scrabble set. The Cihai dictionary catalogs 675 primitives across 16,339 characters; a 2009 Chinese national standard trims it to 514 for common use.
This is not metaphor. Kanji form a combinatorial writing system with a formal grammar, positional constraints, and a phonetic encoding layer — properties Unicode has literally codified into twelve composition operators.
The twelve composition operators
Unicode block U+2FF0–U+2FFB defines Ideographic Description Characters — prefix operators that describe how components combine. A context-free grammar for glyph structure:
| Symbol | Code Point | Name | Example | Decomposition |
|---|---|---|---|---|
| ⿰ | U+2FF0 | Left to right | 相 | ⿰木目 |
| ⿱ | U+2FF1 | Above to below | 杏 | ⿱木口 |
| ⿲ | U+2FF2 | Left-middle-right | 衍 | ⿲彳氵亍 |
| ⿳ | U+2FF3 | Above-middle-below | 京 | ⿳亠口小 |
| ⿴ | U+2FF4 | Full surround | 回 | ⿴囗口 |
| ⿵ | U+2FF5 | Surround from above | 凰 | ⿵几皇 |
| ⿶ | U+2FF6 | Surround from below | 凶 | ⿶凵㐅 |
| ⿷ | U+2FF7 | Surround from left | 匠 | ⿷匚斤 |
| ⿸ | U+2FF8 | Above-left surround | 病 | ⿸疒丙 |
| ⿹ | U+2FF9 | Above-right surround | 戒 | ⿹戈廾 |
| ⿺ | U+2FFA | Below-left surround | 超 | ⿺走召 |
| ⿻ | U+2FFB | Overlaid | 巫 | ⿻工从 |

Worked decomposition: each character on the left is rewritten as an IDC operator (the dashed box) followed by its operand components. Source: Wikimedia Commons.
Ten of the twelve are binary; ⿲ and ⿳ take three operands. Unicode 15.1 added four more (U+2FFC–U+2FFF) for left-open surround, bottom-right surround, horizontal reflection, and rotation — sixteen in total. The original twelve carry most of the load.

The full Ideographic Description Characters block (U+2FF0–U+2FFF), including the four operators added in Unicode 15.1. Source: Wikimedia Commons.
The CHISE project and the cjkvi-ids database have applied IDS decomposition to over 75,000 CJK ideographs — a machine-readable structural atlas of the entire character space. The distribution is heavily skewed. Gao and Kao (2002) found that over 60% of high-frequency characters use ⿰ (left-right), roughly 20% use ⿱ (top-bottom), and the rest split across enclosure and overlay. Left-right dominance is a fingerprint of the phono-semantic architecture: semantic radical on the left, phonetic on the right.
The seven positional slots
Japanese pedagogy names seven component positions (部首の位置). They aren't labels for a chart — they're structural constraints that decide which shape variant a component takes and which slot it gets to live in.
| Position | Japanese | Reading | Location | Examples |
|---|---|---|---|---|
| Hen | 偏 | へん | Left side | 氵 in 海, 亻 in 休, 扌 in 持 |
| Tsukuri | 旁 | つくり | Right side | 刂 in 判, 攵 in 教, 頁 in 頭 |
| Kanmuri | 冠 | かんむり | Top (crown) | 艹 in 花, 宀 in 安, 雨 in 雲 |
| Ashi | 脚 | あし | Bottom (legs) | 灬 in 然, 心 in 思, 皿 in 盤 |
| Tare | 垂 | たれ | Top-left drape | 广 in 店, 疒 in 病, 尸 in 届 |
| Nyou | 繞 | にょう | Bottom-left wrap | 辶 in 道, 廴 in 建, 之 in 芝 |
| Kamae | 構 | かまえ | Full/partial surround | 門 in 間, 囗 in 国, 行 in 術 |
Hen and tsukuri dominate — over 60% of all placements, a direct consequence of ⿰'s prevalence. Across the 2,136 joyo kanji, 6 radicals cover 25% of all characters and 50 radicals cover 75%. Almost all of them sit in hen or kanmuri. Many radicals are pinned to a single slot: 氵 is always hen, 刂 is always tsukuri, 艹 is always kanmuri. Move a component and its shape mutates. 水 becomes 氵 on the left. 心 becomes 忄 on the left but stays 心 on the bottom. 火 collapses into 灬 underneath.
Phonetic components — the sound layer
Somewhere between 67% and 82% of kanji are phono-semantic compounds (形声文字), depending on whose count you trust. The phonetic component (声符 seifu) carries the on'yomi; the semantic radical signals the meaning domain. EDRDG catalogs 150 phonetic components. KanjiJump documents 808 — and notes that 74% of the 3,500 most-used kanji either include a sound component or serve as one.
Reliability is uneven. Some series hold at 100%. Others rot through centuries of sound change between Old Chinese and modern on'yomi. Ten of the most productive:
| Component | On'yomi | Derivatives | Reliability | Example Series |
|---|---|---|---|---|
| 匕 | ヒ | ~30 | Medium | 比, 匙, 旨, 尼, 北 |
| 者 | シャ | ~23 | Medium | 暑, 署, 諸, 緒, 都 |
| 生 | セイ | ~21 | Medium | 性, 星, 姓, 牲, 産 |
| 勹 | ホウ | ~20 | Medium | 包, 抱, 泡, 砲, 飽 |
| 隹 | サイ | ~19 | Medium | 推, 維, 雄, 集, 準 |
| 可 | カ | ~17 | High | 何, 河, 荷, 歌, 苛 |
| 圭 | ケイ | ~17 | High | 掛, 畦, 桂, 蛙, 街 |
| 方 | ホウ | ~16 | High | 放, 防, 紡, 坊, 芳 |
| 白 | ハク | ~15 | High | 伯, 拍, 泊, 迫, 舶 |
| 各 | カク | ~15 | High | 格, 閣, 額, 客, 略 |
High: >80% of derivatives share the predicted on'yomi. Medium: 50–80%. Sources: EDRDG, The Kanji Code, KanjiJump.
The perfect series are the highest-leverage components in the whole writing system. 票 (ヒョウ) generates 12 derivatives — 標, 漂, 瓢, 剽, and more — every one ヒョウ, zero exceptions. 冓 (コウ) yields 10 (構, 溝, 講, 購), all コウ. 包 (ホウ) gives 6 (抱, 泡, 砲, 胞, 飽), all ホウ. Learn one component, predict the on'yomi of every character in the family on sight. That is a compounding asset, not flashcard busywork.
Decomposition as compilation
The CHISE project (Character Processing Based on a Huge Structured Environment), out of Kyoto University, serializes its IDS decomposition database in RDF and exposes it via SPARQL. Each character is a tree — composition operators at the nodes, atomic components at the leaves. An abstract syntax tree for a glyph. The cjk-decomp project covers 75,000 ideographs and surfaces roughly 10,000 intermediate composite components sitting between the atomic primitives and the final characters.
The shape mirrors how a compiler represents an expression. Terminals (atomic strokes and components). Non-terminals (composite sub-components). Production rules (the IDS operators). The implication is the interesting part: kanji are not 50,000 independent symbols. They are 50,000 strings generated by a grammar with 300–700 terminals and 12 production rules. The writing system is closer to a codebase than a dictionary — and component analysis is the decompiler.
We wrote one. The Kanji Atlas renders the full component graph for the 2,136 joyo characters. Atlas Grade 1 is the easiest place to see the principle in action — every Grade 1 character broken down into its kanji, radicals, and graphemes.
References
- Unicode Consortium. "Ideographic Description Characters." The Unicode Standard, Chapter 18.
- CHISE Project. chise.org
- cjkvi-ids. IDS Data for CJK Unified Ideographs. github.com/cjkvi/cjkvi-ids
- amake/cjk-decomp. Decomposition data for 75,000 CJK ideographs. github.com/amake/cjk-decomp
- KanjiJump. "The 281 Atomic Kanji Components." kanjijump.com
- Gao, D.G. & Kao, H.S.R. (2002). Chinese character structure analysis. Acta Psychologica Sinica.
- EDRDG. Kanji Phonetic Components. edrdg.org
- Millen, A. The Kanji Code. thekanjicode.com
- Wikipedia. "Ideographic Description Characters," "Chinese Character Components," "List of Kanji Radicals by Frequency."