I stumbled upon an interesting statistical way of comparing various kanji learning systems when I played with XML kanji data available from Jim Breen‘s kick-ass KANJIDIC2 project (XML version of his KANJIDIC project). I present the findings below.
During my attempt to learn Japanese, I had to find a good way to learn the kanji characters. Even though I learned hiragana and katakana in a day or two and I mastered reading of these characters quite automatically while reading Japanese text, kanji turned to be a quite different story. At first I just tried remembering the shapes in the hope that they will stick in my memory, but it turned to be futile for me. As somebody who grew up with the Latin alphabet, I am terrible at remembering shapes, not to mention stroke order.
So I tried memorizing the shapes using drills, pretty much the way Japanese schoolkids learn in the elementary school. It didn’t take too long for me to realize how boring this method is. Also, kanji-a-day is not exactly the speed I was aiming for.
I looked around a bit and found two books that offered mnemonics methods, somewhat similar to each other. Guide to Remembering Japanese Characters by Kenneth G. Henshall and Remembering the Kanji by James W. Heisig. Both books contain about 2000 kanji and provide mnemonics for rapid learning. I eventually chose Heisig’s book since I think it is a more systematic way of learning, and the mnemonics focus more on the meaning than on the shape of kanji.
I think that for me, the book did what it promised to do. I’m able to recognize most of the books kanji, I can write the kanji in the correct stroke order (I think this is the strength of Heisig’s method compared to drills). I could even impress my Japanese friends by correcting their stroke order mistakes.
The main criticism of the book that you’ll find online is the ordering of characters. Almost all other systems teach the most frequent characters first. Heisig chooses a different approach, he groups characters according to primitive elements they contain. Primitive elements are parts of the kanji that are shared among multiple characters. Many of them are characters themselves. Thus learning characters that serve as primitive elements followed by characters containing them as primitive elements in sequence reinforces the learning process. It worked great for me. The drawback however is that you are learning a mix of common and rare kanji. Heisig is honest about this, and he states clearly in the introduction that the book is intended for self-study for those who want to master all 2000 most common characters.
A neat way to compare the ordering of the characters in the two books and also the order in which the characters are learned in Japanese elementary schools is to plot the frequency of use versus the order in which they are learned. All the data is available online thanks to the awesome work of Jim Breen et al. and his KANJIDIC2 project. The data can be downloaded in XML format. This makes it easy to work with. I used Wolfram’s Mathematica with its built-in ability to process XML data.
The kanjidic2.xml file contains an entry for each Japanese kanji. When available, the kanji entry contains the information about the grade in which it is learned and the Henshall’s and Heisig’s indexes that represent the order in which the character is learned in each of these books. Moreover, 2500 most-used characters contain the frequency-of-use ranking. The kanjidic2.xml file description states:
The 2,500 most-used characters have a ranking; … The frequency is a number from 1 to 2,500 that expresses the relative frequency of occurrence of a character in modern Japanese. This is based on a survey in newspapers, so it is biased towards kanji used in newspaper articles. The discrimination between the less frequently used kanji is not strong. (emphasis mine)
So even though it has limitations, the ranking can still provide an useful comparison. As for the grade information, the kanjidic2.xml description says:
1 through 6 indicate the grade in which the kanji is taught in Japanese schools. 8 indicates it is one of the remaining Jouyou Kanji to be learned in junior high school, and 9 indicates it is a Jinmeiyou (for use in names) kanji.
I produced 3 plots. I plotted the dependence of frequency-of-use ranking on Heisig’s index, Henshall’s index and the grade, respectively. I also included the characters that don’t appear in those learning systems but have frequency-of-use ranking to see which “common” kanji are omitted. The plots are below. The points to the right of the dashed line represent characters that have frequency-of-use ranking but don’t appear in the learning system. I added a random noise to the horizontal position of each of these points for the frequency-of-use ranking distribution to be apparent. I also added horizontal noise to the grade plot as the grades are integer values which would again hide the ranking distribution.
The first plot confirms that the criticism of Heisig’s book is valid. There is no obvious correlation between the order in which Heisig presents the characters in his book and frequency of use. I would recommend his book only to those who wanna learn all 2042 kanji that it contains. On the other hand, the Heisig’s order makes for a more entertaining and effective learning, and I can’t stress enough how that is important for learning success.
Henshall’s book shows a much better correlation. The first half obviously presents characters in the frequency-of-use order. And so does the Japanese elementary school system. It is apparent to me that Henshall based his order on the grade system, presenting the 6 grades in the first half of his book and the high school characters (grades 8 and 9 in kanjidic2 data) in the second half.
Interestingly, the grade 1 contains some fairly uncommon characters such as 犬, 耳, 虫, 糸 and 貝, all with frequency-of-use ranking above 1300. But this is obviously an artifact and limitation of the frequency ranking source, newspapers. In fact, these characters are quite useful in basic Japanese.