Multivariate statistics: how to measure similarity between comics (or anything, really) based on several characteristics

In recent blogposts about stylometry (e.g. here), I skipped a bit of maths that, in hindsight, might be worth talking about. As it turns out, it’s actually both highly useful and easy to understand.

The examples used here are going to be the same as in the aforementioned post, i.e. 2 scenes from Katsuhiro Ōtomo’s Akira (vol. 5, p. 16 ff, which we’ll call A1, and vol. 3, p. 125 ff, which we’ll call A2) and 2 manga chapters from the October 11, 2018 issue of Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2).

1 variable

Let’s say you want to compare these 4 comics based on 1 variable, e.g. the frequency of the hiragana character で de. (Which is not the most realistic stylometric indicator, but it will make more and more sense with an increasing number of variables.) Nothing easier than that. First, here are the numbers of で de per 100 hiragana for each text:

  • A1: 8
  • A2: 3
  • M1: 6
  • M2: 7

By simply subtracting the numbers from each other, we get the difference between any pair of manga and thus their similarity. Ranked from smallest difference to largest, these would be:

  • A1/M2: 1
  • M1/M2: 1
  • A1/M1: 2
  • A2/M1: 3
  • A2/M2: 4
  • A1/A2: 5

So the two Morning manga and one of the Akira scenes can be said to be similar, while the other Akira scene is the odd one out.

2 variables

With 2 variables, it gets more interesting. Let’s assume you decide that the similarity of these manga is best based on their use of the hiragana で de and い i. The frequencies for the latter are:

  • A1: 7
  • A2: 8
  • M1: 3
  • M2: 2

On a side note, at this point it might be a good idea to think about normalisation: are the numbers of the two variables comparable, so that a difference of e.g. “2” carries the same weight for both characteristics? In our example, this is not a problem because we’re dealing with two hiragana frequencies measured on the same scale, but if your two variables are e.g. the total number of kana characters per chapter and the shoe size of the author, the former will probably have much more impact on the similarity scores than the latter, because the range of numbers is wider – unless you adjust the scale of the variables. Except if this different impact was precisely what you wanted.

Anyway, now we have 4 pairs of values, (8/7), (3/8), (6/3) and (7/2), which we could plot on a x and y axis, like this:

To calculate the distance between any two of these points (i.e. the similarity of two manga), you’ll probably want to use Pythagoras and his a² + b² = c² formula, a.k.a. the Euclidean distance, with ‘a’ and ‘b’ representing the horizontal and vertical distances and ‘c’ being the diagonal line we’re looking for. There’s nothing wrong with that, but it might suprise you that in actual statistics and stylometrics, there are several other ways of measuring this distance. However, we’re going to stick with good old Pythagoras here.

The distance between A1 (で de: 8 / い i: 7) and A2 (3/8), for instance, would be the square root of the sum of (8-3)² and (7-8)², which is approximately 5.1. All distances, ranked from lowest to highest, would be (rounded to one decimal):

  • M1/M2: 1.4
  • A1/M1: 4.5
  • A1/A2: 5.1
  • A1/M2: 5.1
  • A2/M1: 5.8
  • A2/M2: 7.2

Now the two Akira excerpts appear to be more similar than before when the similarity was only based on the frequency of で de, and the similarity between the two Morning manga is greater than that between the first Akira excerpt and either of the two Morning manga.

3 variables

Just as you imagine two points in 2-dimensional space forming two corners of a right-angled triangle (see above), in 3-dimensional space you have to image a rectangular cuboid – a ‘box’ (see the illustration on Wikipedia). Apparently, how to calculate the distance between the two opposite corner points of a cuboid is something you learn in high school, but I couldn’t remember and had to look it up. The formula for distance ‘d’ is: d² = a² + b² + c².

As our third variable, we’re going to use the frequency of the hiragana し shi. In the following list, the number of し shi per 100 hiragana is added as the third coordinate to each manga:

  • A1 (8/7/7)
  • A2 (3/8/1)
  • M1 (6/3/1)
  • M2 (7/2/5)

For instance, the distance between A1 and A2 is the square root of: (8-3)² + (7-8)² + (7-1)², i.e. roughly 7.9. Here are all the distances:

  • M1/M2: 4.2
  • A1/M2: 5.5
  • A2/M1: 5.8
  • A1/M1: 7.5
  • A1/A2: 7.9
  • A2/M2: 8.2

As we can see, the main difference between this similarity ranking and the previous one is that the similarity between the two Akira scenes has become smaller.

4 variables

You might have guessed it by now: even though it gets harder to imagine (and even more so to illustrate) a space of more than 3 dimensions, we can apply more or less the same formula regardless of the number of variables. We only need to add a new summand/addend for each new variable. For 4 variables, the distance between two points would be the square root of (a² + b² + c² + d²). These are the distances if we add the hiragana て te (which occurs 7 times per 100 hiragana in A1, 2 times in A2, 6 in M1, 4 in M2) as the 4th dimension:

  • M1/M2: 4.7
  • A1/M2: 6.2
  • A2/M1: 7.1
  • A1/M1: 7.5
  • A2/M2: 8.5
  • A1/A2: 9.3

Note how the changes become smaller now – apart from the last two pairs having swapped places, the similarity ranking is the same as before.

dialogue in Katsuhiro Ōtomo’s Akira (A1, left) vs. dialogue in Miko Yasu’s Hakozume (M1, right)

25 variables

So how about 25 hiragana frequencies? This is more than half of all the different hiragana in our (100-hiragana samples of the) four manga. I added 21 random hiragana (see the graph) to the 4 from the previous section, and these are the resulting distances:

  • A1/M2: 9.7
  • A2/M1: 11.0
  • A1/A2: 12.5
  • M1/M2: 13.0
  • A2/M2: 13.3
  • A1/M1: 14.7

Who would have thought that? Now it looks as if the ‘scientists’ scene from Akira (A1) is similar to Ichikei no karasu (M2), and the ‘insurgent thugs’ scene from Akira (A2) is similar to Hakozume (M1). Which is what we suspected all along. So who knows, maybe we can do away with all this maths stuff after all? However, the usual caveat applies: proper stylometry should really be based on larger samples than 100 characters per text.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s