Kanji-kana ratio for stylometry?Posted: February 6, 2019
I ended my blogpost on hiragana frequency as a stylometric indicator with the remark that, rather than the frequency distribution of different hiragana in the text, the ratio of kana to kanji is used as one of several key characteristics in actual stylometric analysis of Japanese texts. I was curious to find out if this number alone could tell us something about the 4 manga text samples in question (2 randomly selected scenes from Katsuhiro Ōtomo’s Akira and 2 series from Morning magazine, Miko Yasu’s Hakozume and Rito Asami’s Ichikei no karasu – in the following text referred to as A1, A2, M1 and M2, respectively). My intuition was that the results wouldn’t be meaningful because the samples were too small, but let’s see:
This time I chose a sample size of 200 characters (hiragana, katakana, and kanji) per text.
Among the first 200 characters in A1 (i.e. Akira vol. 5, p. 16), there are 113 hiragana, 42 katakana and 45 kanji. This results in a kanji-kana ratio of 45 : (113 + 42) = 0.29.
In A2 (Akira vol. 3, pp. 125 ff.), the first 200 characters comprise of 126 hiragana, 34 katakana, and 40 kanji, i.e. the kanji-kana ratio is 0.25.
In M1, there are 122 hiragana, 9 katakana, and 69 kanji, resulting in a kanji-kana ratio of 0.52.
In M2, there are 117 hiragana, 0 katakana, and 83 kanji, resulting in a kanji-kana ratio of 0.71!
Thus this time the authorship attribution seems to have worked: the two Ōtomo samples have an almost identical score, whereas those of the two Morning samples are completely different. Interestingly, this result contradicts the interpretation from the earlier blogpost in which I had suggested that the scientists in Akira and the lawyers in Karasu have similar ways of talking. The difference in the kanji-kana ratio between Akira and the two Morning manga, though, is explained not only through the more frequent use of kanji in the latter, but also through the vast differences in katakana usage (note that only characters in proper word balloons, i.e. dialogue, are counted, not sound effects).
Ōtomo uses katakana for two different purposes: in A1 mainly to reproduce the names of the foreign researchers, and in A2 to stretch syllables otherwise written in hiragana at the end of words, e.g. なにィ nanii (“whaaat?”) or 何だァ nandaa (“what is iiit?”). Therefore the similarity of the character use in the two Akira samples is superficial only and the pure numbers somewhat misleading. On the other hand, it makes sense that an action-packed scene such as A2 contains less than half as many kanji as the courtroom dialogue in M2; in A2 there are more simple, colloquial words for which the hiragana spelling is more common, e.g. くそう kusou (“shit!”) or うるせェ urusee (“quiet!”), whereas technical terms such as 被告人 hikokunin (“defendant”) in M2 are more clearly and commonly expressed in kanji.
In the end, the old rule applies: only with a large number of sample texts, with a large size of each sample, and through a combination of several different metrics can such stylometric approaches possibly succeed.