Multivariate statistics: how to measure similarity between comics (or anything, really) based on several characteristics

In recent blogposts about stylometry (e.g. here), I skipped a bit of maths that, in hindsight, might be worth talking about. As it turns out, it’s actually both highly useful and easy to understand.

The examples used here are going to be the same as in the aforementioned post, i.e. 2 scenes from Katsuhiro Ōtomo’s Akira (vol. 5, p. 16 ff, which we’ll call A1, and vol. 3, p. 125 ff, which we’ll call A2) and 2 manga chapters from the October 11, 2018 issue of Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2).

1 variable

Let’s say you want to compare these 4 comics based on 1 variable, e.g. the frequency of the hiragana character で de. (Which is not the most realistic stylometric indicator, but it will make more and more sense with an increasing number of variables.) Nothing easier than that. First, here are the numbers of で de per 100 hiragana for each text:

  • A1: 8
  • A2: 3
  • M1: 6
  • M2: 7

By simply subtracting the numbers from each other, we get the difference between any pair of manga and thus their similarity. Ranked from smallest difference to largest, these would be:

  • A1/M2: 1
  • M1/M2: 1
  • A1/M1: 2
  • A2/M1: 3
  • A2/M2: 4
  • A1/A2: 5

So the two Morning manga and one of the Akira scenes can be said to be similar, while the other Akira scene is the odd one out.

2 variables

With 2 variables, it gets more interesting. Let’s assume you decide that the similarity of these manga is best based on their use of the hiragana で de and い i. The frequencies for the latter are:

  • A1: 7
  • A2: 8
  • M1: 3
  • M2: 2

On a side note, at this point it might be a good idea to think about normalisation: are the numbers of the two variables comparable, so that a difference of e.g. “2” carries the same weight for both characteristics? In our example, this is not a problem because we’re dealing with two hiragana frequencies measured on the same scale, but if your two variables are e.g. the total number of kana characters per chapter and the shoe size of the author, the former will probably have much more impact on the similarity scores than the latter, because the range of numbers is wider – unless you adjust the scale of the variables. Except if this different impact was precisely what you wanted.

Anyway, now we have 4 pairs of values, (8/7), (3/8), (6/3) and (7/2), which we could plot on a x and y axis, like this:

To calculate the distance between any two of these points (i.e. the similarity of two manga), you’ll probably want to use Pythagoras and his a² + b² = c² formula, a.k.a. the Euclidean distance, with ‘a’ and ‘b’ representing the horizontal and vertical distances and ‘c’ being the diagonal line we’re looking for. There’s nothing wrong with that, but it might suprise you that in actual statistics and stylometrics, there are several other ways of measuring this distance. However, we’re going to stick with good old Pythagoras here.

The distance between A1 (で de: 8 / い i: 7) and A2 (3/8), for instance, would be the square root of the sum of (8-3)² and (7-8)², which is approximately 5.1. All distances, ranked from lowest to highest, would be (rounded to one decimal):

  • M1/M2: 1.4
  • A1/M1: 4.5
  • A1/A2: 5.1
  • A1/M2: 5.1
  • A2/M1: 5.8
  • A2/M2: 7.2

Now the two Akira excerpts appear to be more similar than before when the similarity was only based on the frequency of で de, and the similarity between the two Morning manga is greater than that between the first Akira excerpt and either of the two Morning manga.

3 variables

Just as you imagine two points in 2-dimensional space forming two corners of a right-angled triangle (see above), in 3-dimensional space you have to image a rectangular cuboid – a ‘box’ (see the illustration on Wikipedia). Apparently, how to calculate the distance between the two opposite corner points of a cuboid is something you learn in high school, but I couldn’t remember and had to look it up. The formula for distance ‘d’ is: d² = a² + b² + c².

As our third variable, we’re going to use the frequency of the hiragana し shi. In the following list, the number of し shi per 100 hiragana is added as the third coordinate to each manga:

  • A1 (8/7/7)
  • A2 (3/8/1)
  • M1 (6/3/1)
  • M2 (7/2/5)

For instance, the distance between A1 and A2 is the square root of: (8-3)² + (7-8)² + (7-1)², i.e. roughly 7.9. Here are all the distances:

  • M1/M2: 4.2
  • A1/M2: 5.5
  • A2/M1: 5.8
  • A1/M1: 7.5
  • A1/A2: 7.9
  • A2/M2: 8.2

As we can see, the main difference between this similarity ranking and the previous one is that the similarity between the two Akira scenes has become smaller.

4 variables

You might have guessed it by now: even though it gets harder to imagine (and even more so to illustrate) a space of more than 3 dimensions, we can apply more or less the same formula regardless of the number of variables. We only need to add a new summand/addend for each new variable. For 4 variables, the distance between two points would be the square root of (a² + b² + c² + d²). These are the distances if we add the hiragana て te (which occurs 7 times per 100 hiragana in A1, 2 times in A2, 6 in M1, 4 in M2) as the 4th dimension:

  • M1/M2: 4.7
  • A1/M2: 6.2
  • A2/M1: 7.1
  • A1/M1: 7.5
  • A2/M2: 8.5
  • A1/A2: 9.3

Note how the changes become smaller now – apart from the last two pairs having swapped places, the similarity ranking is the same as before.

dialogue in Katsuhiro Ōtomo’s Akira (A1, left) vs. dialogue in Miko Yasu’s Hakozume (M1, right)

25 variables

So how about 25 hiragana frequencies? This is more than half of all the different hiragana in our (100-hiragana samples of the) four manga. I added 21 random hiragana (see the graph) to the 4 from the previous section, and these are the resulting distances:

  • A1/M2: 9.7
  • A2/M1: 11.0
  • A1/A2: 12.5
  • M1/M2: 13.0
  • A2/M2: 13.3
  • A1/M1: 14.7

Who would have thought that? Now it looks as if the ‘scientists’ scene from Akira (A1) is similar to Ichikei no karasu (M2), and the ‘insurgent thugs’ scene from Akira (A2) is similar to Hakozume (M1). Which is what we suspected all along. So who knows, maybe we can do away with all this maths stuff after all? However, the usual caveat applies: proper stylometry should really be based on larger samples than 100 characters per text.


Kanji-kana ratio for stylometry?

I ended my blogpost on hiragana frequency as a stylometric indicator with the remark that, rather than the frequency distribution of different hiragana in the text, the ratio of kana to kanji is used as one of several key characteristics in actual stylometric analysis of Japanese texts. I was curious to find out if this number alone could tell us something about the 4 manga text samples in question (2 randomly selected scenes from Katsuhiro Ōtomo’s Akira and 2 series from Morning magazine, Miko Yasu’s Hakozume and Rito Asami’s Ichikei no karasu – in the following text referred to as A1, A2, M1 and M2, respectively). My intuition was that the results wouldn’t be meaningful because the samples were too small, but let’s see:

This time I chose a sample size of 200 characters (hiragana, katakana, and kanji) per text.

Among the first 200 characters in A1 (i.e. Akira vol. 5, p. 16), there are 113 hiragana, 42 katakana and 45 kanji. This results in a kanji-kana ratio of 45 : (113 + 42) = 0.29.

In A2 (Akira vol. 3, pp. 125 ff.), the first 200 characters comprise of 126 hiragana, 34 katakana, and 40 kanji, i.e. the kanji-kana ratio is 0.25.

In M1, there are 122 hiragana, 9 katakana, and 69 kanji, resulting in a kanji-kana ratio of 0.52.

In M2, there are 117 hiragana, 0 katakana, and 83 kanji, resulting in a kanji-kana ratio of 0.71!

6 hiragana, 2 katakana, 3 kanji in A2 (Akira vol. 3, p. 125).

Thus this time the authorship attribution seems to have worked: the two Ōtomo samples have an almost identical score, whereas those of the two Morning samples are completely different. Interestingly, this result contradicts the interpretation from the earlier blogpost in which I had suggested that the scientists in Akira and the lawyers in Karasu have similar ways of talking. The difference in the kanji-kana ratio between Akira and the two Morning manga, though, is explained not only through the more frequent use of kanji in the latter, but also through the vast differences in katakana usage (note that only characters in proper word balloons, i.e. dialogue, are counted, not sound effects).

Ōtomo uses katakana for two different purposes: in A1 mainly to reproduce the names of the foreign researchers, and in A2 to stretch syllables otherwise written in hiragana at the end of words, e.g. なにィ nanii (“whaaat?”) or 何だァ nandaa (“what is iiit?”). Therefore the similarity of the character use in the two Akira samples is superficial only and the pure numbers somewhat misleading. On the other hand, it makes sense that an action-packed scene such as A2 contains less than half as many kanji as the courtroom dialogue in M2; in A2 there are more simple, colloquial words for which the hiragana spelling is more common, e.g. くそう kusou (“shit!”) or うるせェ urusee (“quiet!”), whereas technical terms such as 被告人 hikokunin (“defendant”) in M2 are more clearly and commonly expressed in kanji.

In the end, the old rule applies: only with a large number of sample texts, with a large size of each sample, and through a combination of several different metrics can such stylometric approaches possibly succeed.


Artifacts from Japan, part 5: Morning #43, 2018

Two years ago I already introduced another original Japanese manga magazine here, Weekly Young Jump, but I don’t want to give the impression that all manga magazines in Japan are like that. So here’s a look at a magazine that is also filed under seinen (i.e. targeted towards young adult men), but much more mature.

Language: Japanese
Authors: various
Publisher: Kōdansha
Pages: 400
Price: ¥370 ($3.30 / €2.85)
Website: http://morning.moae.jp/ (Japanese)

Morning (or “Weekly Morning” according to Wikipedia, but the word “Morning” [EDIT: I mean “Weekly”, of course] is not on the cover as far as I have seen) is not quite as widely read as Young Jump, but its circulation (well over 100,000 copies per issue) is still huge compared to Western comic magazines. In the past, Morning has run famous manga series such as Gon, Planetes, Space Brothers, and Vagabond.

The copy of the issue at hand (dated October 11, but actually published two weeks earlier) has the same dimensions as Young Jump and the same printing quality (or lack thereof), but already on the outside, the content is quite different: instead of an erotic photograph, there’s a cover image that actually refers to one of the manga inside – グラゼニ / Gurazeni by Yūji Moritaka and Keiji Adachi, a baseball series that seems to be relatively popular in Japan. Inside there is very little editorial content apart from a 4-page interview with Moritaka and film director Hitoshi Ōne.

Which brings us to the manga in this issue. There are roughly 20 chapters of 18 pages on average, and these are the more noteworthy ones apart from Gurazeni:

<img class="wp-image-3485 size-medium" src="https://650centplague.files.wordpress.com/2018/10/theseus.png?w=200" alt="A particularly striking page from Toshiya Higashimoto's Theseus no fune.” width=”200″ height=”300″ /> A particularly striking page from Toshiya Higashimoto’s Theseus no fune.

As you can perhaps see from these short descriptions, most of the manga in Morning are set in the real world rather than some fantasy or science fiction setting. Considering Morning and Young Jump alone, the vast variety of manga within the seinen demographic becomes palpable – a variety hardly represented by the few of these titles that have been published in the West.