Multivariate statistics: how to measure similarity between comics (or anything, really) based on several characteristics

In recent blogposts about stylometry (e.g. here), I skipped a bit of maths that, in hindsight, might be worth talking about. As it turns out, it’s actually both highly useful and easy to understand.

The examples used here are going to be the same as in the aforementioned post, i.e. 2 scenes from Katsuhiro Ōtomo’s Akira (vol. 5, p. 16 ff, which we’ll call A1, and vol. 3, p. 125 ff, which we’ll call A2) and 2 manga chapters from the October 11, 2018 issue of Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2).

1 variable

Let’s say you want to compare these 4 comics based on 1 variable, e.g. the frequency of the hiragana character で de. (Which is not the most realistic stylometric indicator, but it will make more and more sense with an increasing number of variables.) Nothing easier than that. First, here are the numbers of で de per 100 hiragana for each text:

  • A1: 8
  • A2: 3
  • M1: 6
  • M2: 7

By simply subtracting the numbers from each other, we get the difference between any pair of manga and thus their similarity. Ranked from smallest difference to largest, these would be:

  • A1/M2: 1
  • M1/M2: 1
  • A1/M1: 2
  • A2/M1: 3
  • A2/M2: 4
  • A1/A2: 5

So the two Morning manga and one of the Akira scenes can be said to be similar, while the other Akira scene is the odd one out.

2 variables

With 2 variables, it gets more interesting. Let’s assume you decide that the similarity of these manga is best based on their use of the hiragana で de and い i. The frequencies for the latter are:

  • A1: 7
  • A2: 8
  • M1: 3
  • M2: 2

On a side note, at this point it might be a good idea to think about normalisation: are the numbers of the two variables comparable, so that a difference of e.g. “2” carries the same weight for both characteristics? In our example, this is not a problem because we’re dealing with two hiragana frequencies measured on the same scale, but if your two variables are e.g. the total number of kana characters per chapter and the shoe size of the author, the former will probably have much more impact on the similarity scores than the latter, because the range of numbers is wider – unless you adjust the scale of the variables. Except if this different impact was precisely what you wanted.

Anyway, now we have 4 pairs of values, (8/7), (3/8), (6/3) and (7/2), which we could plot on a x and y axis, like this:

To calculate the distance between any two of these points (i.e. the similarity of two manga), you’ll probably want to use Pythagoras and his a² + b² = c² formula, a.k.a. the Euclidean distance, with ‘a’ and ‘b’ representing the horizontal and vertical distances and ‘c’ being the diagonal line we’re looking for. There’s nothing wrong with that, but it might suprise you that in actual statistics and stylometrics, there are several other ways of measuring this distance. However, we’re going to stick with good old Pythagoras here.

The distance between A1 (で de: 8 / い i: 7) and A2 (3/8), for instance, would be the square root of the sum of (8-3)² and (7-8)², which is approximately 5.1. All distances, ranked from lowest to highest, would be (rounded to one decimal):

  • M1/M2: 1.4
  • A1/M1: 4.5
  • A1/A2: 5.1
  • A1/M2: 5.1
  • A2/M1: 5.8
  • A2/M2: 7.2

Now the two Akira excerpts appear to be more similar than before when the similarity was only based on the frequency of で de, and the similarity between the two Morning manga is greater than that between the first Akira excerpt and either of the two Morning manga.

3 variables

Just as you imagine two points in 2-dimensional space forming two corners of a right-angled triangle (see above), in 3-dimensional space you have to image a rectangular cuboid – a ‘box’ (see the illustration on Wikipedia). Apparently, how to calculate the distance between the two opposite corner points of a cuboid is something you learn in high school, but I couldn’t remember and had to look it up. The formula for distance ‘d’ is: d² = a² + b² + c².

As our third variable, we’re going to use the frequency of the hiragana し shi. In the following list, the number of し shi per 100 hiragana is added as the third coordinate to each manga:

  • A1 (8/7/7)
  • A2 (3/8/1)
  • M1 (6/3/1)
  • M2 (7/2/5)

For instance, the distance between A1 and A2 is the square root of: (8-3)² + (7-8)² + (7-1)², i.e. roughly 7.9. Here are all the distances:

  • M1/M2: 4.2
  • A1/M2: 5.5
  • A2/M1: 5.8
  • A1/M1: 7.5
  • A1/A2: 7.9
  • A2/M2: 8.2

As we can see, the main difference between this similarity ranking and the previous one is that the similarity between the two Akira scenes has become smaller.

4 variables

You might have guessed it by now: even though it gets harder to imagine (and even more so to illustrate) a space of more than 3 dimensions, we can apply more or less the same formula regardless of the number of variables. We only need to add a new summand/addend for each new variable. For 4 variables, the distance between two points would be the square root of (a² + b² + c² + d²). These are the distances if we add the hiragana て te (which occurs 7 times per 100 hiragana in A1, 2 times in A2, 6 in M1, 4 in M2) as the 4th dimension:

  • M1/M2: 4.7
  • A1/M2: 6.2
  • A2/M1: 7.1
  • A1/M1: 7.5
  • A2/M2: 8.5
  • A1/A2: 9.3

Note how the changes become smaller now – apart from the last two pairs having swapped places, the similarity ranking is the same as before.

dialogue in Katsuhiro Ōtomo’s Akira (A1, left) vs. dialogue in Miko Yasu’s Hakozume (M1, right)

25 variables

So how about 25 hiragana frequencies? This is more than half of all the different hiragana in our (100-hiragana samples of the) four manga. I added 21 random hiragana (see the graph) to the 4 from the previous section, and these are the resulting distances:

  • A1/M2: 9.7
  • A2/M1: 11.0
  • A1/A2: 12.5
  • M1/M2: 13.0
  • A2/M2: 13.3
  • A1/M1: 14.7

Who would have thought that? Now it looks as if the ‘scientists’ scene from Akira (A1) is similar to Ichikei no karasu (M2), and the ‘insurgent thugs’ scene from Akira (A2) is similar to Hakozume (M1). Which is what we suspected all along. So who knows, maybe we can do away with all this maths stuff after all? However, the usual caveat applies: proper stylometry should really be based on larger samples than 100 characters per text.


Flesch reading ease for stylometry?

The Flesch reading-ease score (FRES, also called FRE – ‘Flesch Reading Ease’) is still a popular measurement for the readability of texts, despite some criticism and suggestions for improvement since it was first proposed by Rudolf Flesch in 1948. (I’ve never read his original paper, though; all my information is taken from Wikipedia.) On a scale from 0 to 100, it indicates how difficult it is to understand a given text based on sentence length and word length, with a low score meaning difficult to read and a high score meaning easy to read.

Sentence length and word length are also popular factors in stylometry, the idea here being that some authors (or, generally speaking, kinds of text) prefer longer sentences and/or words while others prefer shorter ones. Thus such scores based on sentence length and word length might serve as an indicator of how similar two given texts are. In fact, FRES is used in actual stylometry, albeit only as one factor among many (e.g. in Brennan, Afroz and Greenstadt 2012 (PDF)). Over other stylometric indicators, FRES would have the added benefit that it actually says something in itself about the text, rather than being merely a number that only means something in relation to another.

The original FRES formula was developed for English and has been modified for other languages. In the last few stylometry blogposts here, the examples were taken from Japanese manga, but FRES is not well suited for Japanese. The main reason is that syllables don’t play much of a role in Japanese readability. More important factors are the number of characters and the ratio of kanji, as the number of syllables per character varies. A two-kanji compound, for instance, can have fewer syllables than a single-kanji word (e.g. 部長 bu‧chō ‘head of department’ vs. 力 chi‧ka‧ra ‘power’). Therefore, we’re going to use our old English-language X-Men examples from 2017 again.

The comics in question are: Astonishing X-Men #1 (1995) written by Scott Lobdell, Ultimate X-Men #1 (2001) written by Mark Millar, and Civil War: X-Men #1 (2006) written by David Hine. Looking at just the opening sequence of each comic (see the previous X-Men post for some images), we get the following sentence / word / syllable counts:

  • AXM: 3 sentences, 68 words, 100 syllables.
  • UXM: 6 sentences, 82 words, 148 syllables.
  • CW:XM: 7 sentences, 79 words, 114 syllables.

We don’t even need to use Flesch’s formula to get an idea of the readability differences: the sentences in AXM are really long and those in CW:XM are much shorter. As for word length, UXM stands out with rather long words such as “unconstitutional”, which is reflected in the high ratio of syllables per word.

Applying the formula (cf. Wikipedia), we get the following FRESs:

  • AXM: 59.4
  • UXM: 40.3
  • CW:XM: 73.3

Who would have thought that! It looks like UXM (or at least the selected portion) is harder to read than AXM – a FRES of 40.3 is already ‘College’ level according to Flesch’s table.

But how do these numbers help us if we’re interested in stylometric similarity? All three texts are written by different writers. So far we could only say (again – based on a insufficiently sized sample) that Hine’s writing style is closer to Lobdell’s than to Millar’s. The ultimate test for a stylometric indicator would be to take an additional example text that is written by one of the three authors, and see if its FRES is close to the one from the same author’s X-Men text.

Our 4th example will be the rather randomly selected Nemesis by Millar (2010, art by Steve McNiven) from which we’ll also take all text from the first few panels.

3 panels from Nemesis by Mark Millar and Steve McNiven

Part of the opening scene from Nemesis.

These are the numbers for the selected text fragment from Nemesis:

  • 8 sentences, 68 words, 88 syllables.
  • This translates to a FRES of 88.7!

In other words, Nemesis and UXM, the two comics written by Millar, appear to be the most dissimilar of the four! However, that was to be expected. Millar would be a poor writer if he always applied the same style to each character in each scene. And the two selected scenes are very different: a TV news report in UXM in contrast to a dialogue (or perhaps more like the typical villain’s monologue) in Nemesis.

Interestingly, there is a TV news report scene in Nemesis too (Part 3, p. 3). Wouldn’t that make for a more suitable comparison?

Here are the numbers for this TV scene which I’ll call N2:

  • 4 sentences, 81 words, 146 syllables.
  • FRES: 33.8

Now this looks more like Millar’s writing from UXM: the difference between the two scores is so small (6.5) that they can be said to be almost identical.

Still, we haven’t really proven anything yet. One possible interpretation of the scores is that the ~30-40 range is simply the usual range for this type of text, i.e. TV news reports. So perhaps these scores are not specific to Millar (or even to comics). One would have to look at similar scenes by Lobdell, Hine and/or other writers to verify that, and ideally also at real-world news transcripts.

On the other hand, one thing has worked well: two texts that we had intuitively identified as similar – UXM and N2 – indeed showed similar Flesch scores. That means FRES is not only a measurement of readability but also of stylometric similarity – albeit a rather crude one which is, as always, best used in combination with other metrics.


Kanji-kana ratio for stylometry?

I ended my blogpost on hiragana frequency as a stylometric indicator with the remark that, rather than the frequency distribution of different hiragana in the text, the ratio of kana to kanji is used as one of several key characteristics in actual stylometric analysis of Japanese texts. I was curious to find out if this number alone could tell us something about the 4 manga text samples in question (2 randomly selected scenes from Katsuhiro Ōtomo’s Akira and 2 series from Morning magazine, Miko Yasu’s Hakozume and Rito Asami’s Ichikei no karasu – in the following text referred to as A1, A2, M1 and M2, respectively). My intuition was that the results wouldn’t be meaningful because the samples were too small, but let’s see:

This time I chose a sample size of 200 characters (hiragana, katakana, and kanji) per text.

Among the first 200 characters in A1 (i.e. Akira vol. 5, p. 16), there are 113 hiragana, 42 katakana and 45 kanji. This results in a kanji-kana ratio of 45 : (113 + 42) = 0.29.

In A2 (Akira vol. 3, pp. 125 ff.), the first 200 characters comprise of 126 hiragana, 34 katakana, and 40 kanji, i.e. the kanji-kana ratio is 0.25.

In M1, there are 122 hiragana, 9 katakana, and 69 kanji, resulting in a kanji-kana ratio of 0.52.

In M2, there are 117 hiragana, 0 katakana, and 83 kanji, resulting in a kanji-kana ratio of 0.71!

6 hiragana, 2 katakana, 3 kanji in A2 (Akira vol. 3, p. 125).

Thus this time the authorship attribution seems to have worked: the two Ōtomo samples have an almost identical score, whereas those of the two Morning samples are completely different. Interestingly, this result contradicts the interpretation from the earlier blogpost in which I had suggested that the scientists in Akira and the lawyers in Karasu have similar ways of talking. The difference in the kanji-kana ratio between Akira and the two Morning manga, though, is explained not only through the more frequent use of kanji in the latter, but also through the vast differences in katakana usage (note that only characters in proper word balloons, i.e. dialogue, are counted, not sound effects).

Ōtomo uses katakana for two different purposes: in A1 mainly to reproduce the names of the foreign researchers, and in A2 to stretch syllables otherwise written in hiragana at the end of words, e.g. なにィ nanii (“whaaat?”) or 何だァ nandaa (“what is iiit?”). Therefore the similarity of the character use in the two Akira samples is superficial only and the pure numbers somewhat misleading. On the other hand, it makes sense that an action-packed scene such as A2 contains less than half as many kanji as the courtroom dialogue in M2; in A2 there are more simple, colloquial words for which the hiragana spelling is more common, e.g. くそう kusou (“shit!”) or うるせェ urusee (“quiet!”), whereas technical terms such as 被告人 hikokunin (“defendant”) in M2 are more clearly and commonly expressed in kanji.

In the end, the old rule applies: only with a large number of sample texts, with a large size of each sample, and through a combination of several different metrics can such stylometric approaches possibly succeed.