Kanji-kana ratio for stylometry?

I ended my blogpost on hiragana frequency as a stylometric indicator with the remark that, rather than the frequency distribution of different hiragana in the text, the ratio of kana to kanji is used as one of several key characteristics in actual stylometric analysis of Japanese texts. I was curious to find out if this number alone could tell us something about the 4 manga text samples in question (2 randomly selected scenes from Katsuhiro Ōtomo’s Akira and 2 series from Morning magazine, Miko Yasu’s Hakozume and Rito Asami’s Ichikei no karasu – in the following text referred to as A1, A2, M1 and M2, respectively). My intuition was that the results wouldn’t be meaningful because the samples were too small, but let’s see:

This time I chose a sample size of 200 characters (hiragana, katakana, and kanji) per text.

Among the first 200 characters in A1 (i.e. Akira vol. 5, p. 16), there are 113 hiragana, 42 katakana and 45 kanji. This results in a kanji-kana ratio of 45 : (113 + 42) = 0.29.

In A2 (Akira vol. 3, pp. 125 ff.), the first 200 characters comprise of 126 hiragana, 34 katakana, and 40 kanji, i.e. the kanji-kana ratio is 0.25.

In M1, there are 122 hiragana, 9 katakana, and 69 kanji, resulting in a kanji-kana ratio of 0.52.

In M2, there are 117 hiragana, 0 katakana, and 83 kanji, resulting in a kanji-kana ratio of 0.71!

6 hiragana, 2 katakana, 3 kanji in A2 (Akira vol. 3, p. 125).

Thus this time the authorship attribution seems to have worked: the two Ōtomo samples have an almost identical score, whereas those of the two Morning samples are completely different. Interestingly, this result contradicts the interpretation from the earlier blogpost in which I had suggested that the scientists in Akira and the lawyers in Karasu have similar ways of talking. The difference in the kanji-kana ratio between Akira and the two Morning manga, though, is explained not only through the more frequent use of kanji in the latter, but also through the vast differences in katakana usage (note that only characters in proper word balloons, i.e. dialogue, are counted, not sound effects).

Ōtomo uses katakana for two different purposes: in A1 mainly to reproduce the names of the foreign researchers, and in A2 to stretch syllables otherwise written in hiragana at the end of words, e.g. なにィ nanii (“whaaat?”) or 何だァ nandaa (“what is iiit?”). Therefore the similarity of the character use in the two Akira samples is superficial only and the pure numbers somewhat misleading. On the other hand, it makes sense that an action-packed scene such as A2 contains less than half as many kanji as the courtroom dialogue in M2; in A2 there are more simple, colloquial words for which the hiragana spelling is more common, e.g. くそう kusou (“shit!”) or うるせェ urusee (“quiet!”), whereas technical terms such as 被告人 hikokunin (“defendant”) in M2 are more clearly and commonly expressed in kanji.

In the end, the old rule applies: only with a large number of sample texts, with a large size of each sample, and through a combination of several different metrics can such stylometric approaches possibly succeed.


Hiragana for stylometry?

The other day I’ve been made aware that some things I’ve said in an earlier blogpost, “Author dictionaries and lexical analysis for comics”, might be misleading. So let’s be clear: if you would like to find something out about the writing style of an author or text, it’s not the best idea to look at the frequently used nouns, kanji, or other units of high semantic content. Those are more useful for analysing the content, i.e. the topic(s), of texts. In stylometry, units with low semantic content, such as function words (the, a, it, etc.), are more attractive objects of study, as they can be used almost independently of the topic and often present writers with a choice of which word to use when. In other words, the same writer tends to use the same function words and may be identified by them. (In practice, though, a combination of different characteristics is used for analysis – see the Stylometry article at Wikipedia and the references there.)

In order to automatically separate function words from content words in a digital text, part-of-speech tagging software may be employed. For Japanese, there is e.g. Kuromoji. But isn’t there a simpler way? Can’t we make use of the kanji–kana distinction used in the aforementioned earlier blogpost? If we identified kanji as the semantically rich(er) units, wouldn’t it be sufficient to focus on the kana for stylometric analysis? Maybe, maybe not. The results would probably be poorer, due to two main reasons:

  1. Every content word (noun, verb, adjective), even if usually written in kanji, may also be written in kana. For instance, 分かる (to understand) is more frequently spelled in hiragana only, わかる. So when we gather kana from a text, we might end up with unwanted content words.
  2. In flection suffixes, hiragana are dependent on the preceding kanji, and thus ultimately on the content of the text. For instance, a text on musical performance might contain many instances of the verb 引く hiku (to play an instrument), so one can expect the hiragana か ka, ki, ku, ke and こ ko to occur more frequently than in other texts, as they are used for inflecting 引く.

That being said, why don’t we put this kana analysis method to the test anyway? Let’s take the example from Akira vol. 5, p. 16 again in which the scientists are talking (初めまして。スタンリー・シモンズ博士です etc.). We’ll focus on hiragana and ignore katakana, as they tend to be used for nouns too. Starting from those two panels, I manually counted these and the following hiragana until I reached 100. Here are the 5 most frequent hiragana in this set:

  • de: 8
  • i: 7
  • shi: 7
  • te: 7
  • no: 6

That means, if this was a sufficiently large sample, in any other piece of text by Ōtomo, or at least within Akira, roughly 8% of its hiragana should be de, 7% should be i, etc. So I randomly picked another scene from Akira (vol. 3, p. 125 ff) and looked at the first 100 hiragana there. The 5 most frequently used hiragana from the previous example are used less often here, with the exception of i:

de, su, u, ru, se, da

  • de: 3
  • i: 8
  • shi: 1
  • te: 2
  • no: 3

In these pages in vol. 3, we find mainly other hiragana such as tsu (9 times – including small tsu), ga (6 times), o (5 times) and su (5 times) to be the most frequently used. That, however, doesn’t tell us anything yet about the similarity of these two pieces of text (which I’m going to call “Akira 1″ and “Akira 2″ from here on). We need to add a third example, and for this purpose I’m going to use 100 hiragana from Miko Yasu’s Hakozume from the recently reviewed Morning magazine. If our method is successful, the differences between Hakozume and each of the two Akira scenes should be larger than those between Akira 1 and Akira 2. With frequency values for approximately 50 distinct hiragana we now have 3 × ~50 data points on which we could unleash the whole range of advanced statistical methods. But we’ll keep things simple by simply adding up the differences in frequencies: Hakozume contains only 6 instances of de, i.e. 2 less than Akira 1; Hakozume uses 3 times i as opposed to the 7 in Akira 1, i.e. 4 less; Hakozume contains 6 instances of shi less than Akira 1; etc. Here’s the table of frequencies of de, i, shi, te and no in Hakozume:

a, no, na, n, de, a, no, ga…

  • de: 6
  • i: 3
  • shi: 1
  • te: 6
  • no: 8

The combined difference between Hakozume and Akira 1 for these 5 hiragana would be 2+4+6+1+2 = 15. For all ~50 different hiragana, the sum is 96.

This looks like a large number, and indeed, when we calculate the difference between Akira 1 and Akira 2 in this way, the result is 82. This means, the two Akira chunks are more similar in their usage of hiragana than Hakozume and Akira 1.

However, we’re not done yet. We still need to compare Hakozume to Akira 2. The result of this comparison may come as a surprise: the sum of differences is also 82! So Akira 2 is as similar to Hakozume as it is to Akira 1. If our goal was to find out whether a given piece of text is taken from Akira or not, our method would fail if we used Akira 2 as our base text with which to compare all others.

ha, no, ki, ka, ra, ho, do, de, ki, wo…

Just to make sure, I took another 100 hiragana from a different random manga in the same issue of Morning, Rito Asami’s Ichikei no karasu. I’ll refer to Ichikei no karasu as Morning 2 from now on, and to Hakozume as Morning 1. The results of the comparisons are even ‘worse’: while the sum of differences between Morning 2 and Akira 2 is 98 – i.e. vastly different – the difference between Morning 2 and Akira 1 is only 74, i.e. very similar.

Frequency of all hiragana in each of the four 100-hiragana samples

In a way, the results do make sense though. We’re looking at dialogue, after all, and the way scientists (in Akira 1) speak is closer to that of lawyers (in Morning 2) than that of insurgent thugs (in Akira 2). And apparently, the conversation between the two policewomen (in Morning 1) is not quite unlike the latter.

As ever so often we could now blame the unsatisfactory results on the small sample size – if we had used chunks of 1000 hiragana instead of 100, surely our attribution attempts would have been more successful? We’ll never find out (unless we obtain a complete digital copy of Akira and extract the hiragana automatically). Another way to improve results would be to tweak the methodology: using data mining algorithms, more elaborate metrics such as co-occurrence of several hiragana could be employed. In actual stylometric research, hiragana seem to be used in yet another metric – the ratio of all hiragana to all other characters (kanji, katakana, rōmaji).

Author dictionaries and lexical analysis for comics

Every once in a while I learn something at my day job that I think would be applicable to comics research too. For instance, in literary studies, dictionaries are compiled that contain all the words (or only the nouns, similar to an encyclopedia) used by a particular author, or even only those used in one single literary text. Think of it as a sort of commentary in a critical edition which explains references to real-world entities, or obscure words that aren’t used anymore, only separate from the source text and organised alphabetically.

Applying this method to comics, we would, of course, ignore all the images and lose the information they convey. On the other hand, looking at the words alone might yield interesting results. For instance, by comparing the frequency of words used in a particular comic to the frequency with which they occur in written language in general, we could test common hypotheses such as “author X uses word Y a lot”.

For comics of more than a few pages length, it would be nice to automatically create a list of all the words in digital form (at least those in speech/thought bubbles and captions – sound effects and inscriptions/labels can be difficult to automatically recognise). Unless a script for the comic you’re interested in is already available, a straightforward (though not necessarily easy) way to get such a list would be to obtain digital images (e.g. scans) of the pages of the comic, then run Optical Character Recognition (OCR) software on them.

As an example, consider these two panels from Akira, in which a scientist is introduced to some colleagues:

two panels from Katsuhiro Otomo's AkiraThe OCR software www.onlineocr.net recognises the text in the five speech bubbles like this:

  1. 初めまして
  2. スタンリー・
  3. よろしく
  4. ジョノレジュ
  5. 初めまして

As far as I can see, only two mistakes (ノレ instead of ル and ですノ instead of です) were made. Instead of focusing on nouns (for which there probably are detecting algorithms for Japanese), it’s easier for now to just look at the kanji and filter out all hiragana and katakana characters. (While you can’t simply say that kanji represent nouns and kana represent other parts of speech, the idea here is that kanji tend to carry more semantic information than kana, which are often only used as flection suffixes.) That leaves us with the six kanji , 名, 前, 博, 士, and 初 again.

We can look up their frequency with which they occur in Japanese language in general, e.g. the frequency rank at WWWJDIC:

  • 前: 27
  • 初: 152
  • 名: 177
  • 士: 526
  • 博: 794

i.e. 前 is the most frequent of the five, 博 the least frequent. Compare these ranks to the frequency with which they occur in our slim sample of two panels:

  • : 33% of all kanji
  • 前, 名, 士, 博: 17% each

What we can see here, if anything, is that two kanji, 士 and 博, are significantly more often used by Katsuhiro Ōtomo than by the average Japanese author. This doesn’t come as a surprise, as the compound 博士 signifies the academic title ‘Dr.’, which is the appropriate form of address for the scientists in this scene, whereas the other kanji 前, 初 and 名 are linked to names and introductions in general, and thus more often used in standard Japanese.

However, even if the frequency of 士 and 博 remained above-average if we analysed all of Akira‘s over 2000 pages, that wouldn’t necessarily mean we had discovered a lexical characteristic of Ōtomo’s writing style. What it would tell us is that there is a subplot about scientists in Akira. Of course, topic analysis based on word frequency is nothing new. More interesting from a formal-lexical point of view would be if we discovered kanji used in different frequencies than we would expect with regard to the subject matter treated in Akira. In this situation it might be useful to look at synonyms: when Ōtomo had several options to express the same thing, why did he choose some words over others?

panel detail from Akira by Katsuhiro ŌtomoFor instance, on the same page as the example above, the relatively infrequent (rank 920) kanji 栄 is used as part of the word “honour” in the expression “I’m honoured to meet you”. Instead, Ōtomo could have used the phrase “nice to meet you” for a third time, using the kanji 初 again, but he didn’t. Suppose there was a significant number of further instances of 栄 in Akira, maybe that would be a formal-stylistic choice, rather than one merely implied by the content of the comic?

I’m aware that all this is very hypothetical, and that looking at just a few panels doesn’t show anything, but if I wanted to analyse a comic in this way, I would basically go on about it as described here, only with more scans. If you would like to learn more about this kind of analysis, I recommend Allen Riddell’s tutorial on “Feature selection: finding distinctive words”.

Top 10 words from Frederik L. Schodt

Cover of Frederik L. Schodt's Dreamland JapanOut of the many authors who publish on comics, Frederik L. Schodt is one of the few with a truly distinct writing style – neither academic nor fannish, neither highbrow nor colloquial, his writings are full of rather obscure words, some of which I have never seen anywhere else. Recently I re-read the beginning of his book Dreamland Japan, and while doing so, just for fun,* assembled this list of my favourite eccentric words therein and their meanings (as far as I could find out):

to accord – p. 19: “Japan is the first nation in the world to accord ‘comic books’ […] nearly the same social status as novels and films.” – to grant, to give.

bone-crushing – p. 28: “Yet along with this celebration of the ordinary is the bone-crushing reality that the vast majority of manga border on trash.” – back-breaking, depressing (cf. German: ‘erdrückend’).

hari-kari – p. 11: “in due time both words [manga and anime] will undoubtedly be listed in the standard English dictionary along with other Japanese imports like ‘hari-kari’ and ‘karaoke.'” – variant of harakiri (ritual suicide).

finicky – p. 13: “In Japan, people’s names are usually listed with the family name first and the given name last. Certain academic types in the English-speaking world are rather finicky about this convention and insist on preserving it even in English texts” – difficult to please, demanding.

to flounder – p. 34: “Japanese people have floundered about trying to the right term to describe the sequential picture-panels that tell a story.” – to struggle.

full-figured – p. 26: “Japanese manga offer far more visual diversity than mainstream American comics, which […] still reveal an obsession with muscled males and full-figured females” – according to Wiktionary, ‘full-figured’ means ‘fat’ or ‘plump’, but here it’s probably used in the sense of ‘curvaceous’ or ‘voluptuous’.

persnickety – p. 14: “Fans of Japanese manga (even more than academics) can be a rather persnickety and unforgiving lot” – see finicky.

profuse – p. 15: “Profuse thanks are offered to all who helped.” – plenty, abundant.

raga-like – p. 14: “[…] with raga-like stories that may continue for thousands of pages” – maybe Schodt means, ‘as lengthy as an Indian epic (raga)’?

satori-like – p. 21: “his face lit up in a satori-like realization” – (Buddhist) enlightenment.

* On the other hand, this little exercise can also be seen as a tentative reflection on the serious topic of academic writing style and the way in which we, as scholars, communicate our findings.