Multivariate statistics: how to measure similarity between comics (or anything, really) based on several characteristicsPosted: December 18, 2019
In recent blogposts about stylometry (e.g. here), I skipped a bit of maths that, in hindsight, might be worth talking about. As it turns out, it’s actually both highly useful and easy to understand.
The examples used here are going to be the same as in the aforementioned post, i.e. 2 scenes from Katsuhiro Ōtomo’s Akira (vol. 5, p. 16 ff, which we’ll call A1, and vol. 3, p. 125 ff, which we’ll call A2) and 2 manga chapters from the October 11, 2018 issue of Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2).
Let’s say you want to compare these 4 comics based on 1 variable, e.g. the frequency of the hiragana character で de. (Which is not the most realistic stylometric indicator, but it will make more and more sense with an increasing number of variables.) Nothing easier than that. First, here are the numbers of で de per 100 hiragana for each text:
- A1: 8
- A2: 3
- M1: 6
- M2: 7
By simply subtracting the numbers from each other, we get the difference between any pair of manga and thus their similarity. Ranked from smallest difference to largest, these would be:
- A1/M2: 1
- M1/M2: 1
- A1/M1: 2
- A2/M1: 3
- A2/M2: 4
- A1/A2: 5
So the two Morning manga and one of the Akira scenes can be said to be similar, while the other Akira scene is the odd one out.
With 2 variables, it gets more interesting. Let’s assume you decide that the similarity of these manga is best based on their use of the hiragana で de and い i. The frequencies for the latter are:
- A1: 7
- A2: 8
- M1: 3
- M2: 2
On a side note, at this point it might be a good idea to think about normalisation: are the numbers of the two variables comparable, so that a difference of e.g. “2” carries the same weight for both characteristics? In our example, this is not a problem because we’re dealing with two hiragana frequencies measured on the same scale, but if your two variables are e.g. the total number of kana characters per chapter and the shoe size of the author, the former will probably have much more impact on the similarity scores than the latter, because the range of numbers is wider – unless you adjust the scale of the variables. Except if this different impact was precisely what you wanted.
To calculate the distance between any two of these points (i.e. the similarity of two manga), you’ll probably want to use Pythagoras and his a² + b² = c² formula, a.k.a. the Euclidean distance, with ‘a’ and ‘b’ representing the horizontal and vertical distances and ‘c’ being the diagonal line we’re looking for. There’s nothing wrong with that, but it might suprise you that in actual statistics and stylometrics, there are several other ways of measuring this distance. However, we’re going to stick with good old Pythagoras here.
The distance between A1 (で de: 8 / い i: 7) and A2 (3/8), for instance, would be the square root of the sum of (8-3)² and (7-8)², which is approximately 5.1. All distances, ranked from lowest to highest, would be (rounded to one decimal):
- M1/M2: 1.4
- A1/M1: 4.5
- A1/A2: 5.1
- A1/M2: 5.1
- A2/M1: 5.8
- A2/M2: 7.2
Now the two Akira excerpts appear to be more similar than before when the similarity was only based on the frequency of で de, and the similarity between the two Morning manga is greater than that between the first Akira excerpt and either of the two Morning manga.
Just as you imagine two points in 2-dimensional space forming two corners of a right-angled triangle (see above), in 3-dimensional space you have to image a rectangular cuboid – a ‘box’ (see the illustration on Wikipedia). Apparently, how to calculate the distance between the two opposite corner points of a cuboid is something you learn in high school, but I couldn’t remember and had to look it up. The formula for distance ‘d’ is: d² = a² + b² + c².
As our third variable, we’re going to use the frequency of the hiragana し shi. In the following list, the number of し shi per 100 hiragana is added as the third coordinate to each manga:
- A1 (8/7/7)
- A2 (3/8/1)
- M1 (6/3/1)
- M2 (7/2/5)
For instance, the distance between A1 and A2 is the square root of: (8-3)² + (7-8)² + (7-1)², i.e. roughly 7.9. Here are all the distances:
- M1/M2: 4.2
- A1/M2: 5.5
- A2/M1: 5.8
- A1/M1: 7.5
- A1/A2: 7.9
- A2/M2: 8.2
As we can see, the main difference between this similarity ranking and the previous one is that the similarity between the two Akira scenes has become smaller.
You might have guessed it by now: even though it gets harder to imagine (and even more so to illustrate) a space of more than 3 dimensions, we can apply more or less the same formula regardless of the number of variables. We only need to add a new summand/addend for each new variable. For 4 variables, the distance between two points would be the square root of (a² + b² + c² + d²). These are the distances if we add the hiragana て te (which occurs 7 times per 100 hiragana in A1, 2 times in A2, 6 in M1, 4 in M2) as the 4th dimension:
- M1/M2: 4.7
- A1/M2: 6.2
- A2/M1: 7.1
- A1/M1: 7.5
- A2/M2: 8.5
- A1/A2: 9.3
Note how the changes become smaller now – apart from the last two pairs having swapped places, the similarity ranking is the same as before.
So how about 25 hiragana frequencies? This is more than half of all the different hiragana in our (100-hiragana samples of the) four manga. I added 21 random hiragana (see the graph) to the 4 from the previous section, and these are the resulting distances:
- A1/M2: 9.7
- A2/M1: 11.0
- A1/A2: 12.5
- M1/M2: 13.0
- A2/M2: 13.3
- A1/M1: 14.7
Who would have thought that? Now it looks as if the ‘scientists’ scene from Akira (A1) is similar to Ichikei no karasu (M2), and the ‘insurgent thugs’ scene from Akira (A2) is similar to Hakozume (M1). Which is what we suspected all along. So who knows, maybe we can do away with all this maths stuff after all? However, the usual caveat applies: proper stylometry should really be based on larger samples than 100 characters per text.
Akira Code 7 Alert is an unofficial animated short film by Richard Nyst that went online on YouTube two weeks ago. I hesitate to call it a ‘fan film’ because it looks so professional. The interesting thing about it is that it focuses on characters from the Akira manga that didn’t make it into the anime: the caretaker robots, also known as ‘Security Balls’, which the military employs for riot control. (They are quite relevant though if one reads Akira as a cyberpunk manga, as I have argued elsewhere.) In animation, they are reminiscent of the Tachikoma in the Ghost in the Shell: Stand Alone Complex anime series. Or maybe the other way round: you can see that Masamune Shirow most likely got the inspiration for the Fuchikoma in his Ghost in the Shell manga from Katsuhiro Ōtomo’s Akira manga.
Disclosure: I’m credited as “Japanese script advisor” in the film.
I ended my blogpost on hiragana frequency as a stylometric indicator with the remark that, rather than the frequency distribution of different hiragana in the text, the ratio of kana to kanji is used as one of several key characteristics in actual stylometric analysis of Japanese texts. I was curious to find out if this number alone could tell us something about the 4 manga text samples in question (2 randomly selected scenes from Katsuhiro Ōtomo’s Akira and 2 series from Morning magazine, Miko Yasu’s Hakozume and Rito Asami’s Ichikei no karasu – in the following text referred to as A1, A2, M1 and M2, respectively). My intuition was that the results wouldn’t be meaningful because the samples were too small, but let’s see:
This time I chose a sample size of 200 characters (hiragana, katakana, and kanji) per text.
Among the first 200 characters in A1 (i.e. Akira vol. 5, p. 16), there are 113 hiragana, 42 katakana and 45 kanji. This results in a kanji-kana ratio of 45 : (113 + 42) = 0.29.
In A2 (Akira vol. 3, pp. 125 ff.), the first 200 characters comprise of 126 hiragana, 34 katakana, and 40 kanji, i.e. the kanji-kana ratio is 0.25.
In M1, there are 122 hiragana, 9 katakana, and 69 kanji, resulting in a kanji-kana ratio of 0.52.
In M2, there are 117 hiragana, 0 katakana, and 83 kanji, resulting in a kanji-kana ratio of 0.71!
Thus this time the authorship attribution seems to have worked: the two Ōtomo samples have an almost identical score, whereas those of the two Morning samples are completely different. Interestingly, this result contradicts the interpretation from the earlier blogpost in which I had suggested that the scientists in Akira and the lawyers in Karasu have similar ways of talking. The difference in the kanji-kana ratio between Akira and the two Morning manga, though, is explained not only through the more frequent use of kanji in the latter, but also through the vast differences in katakana usage (note that only characters in proper word balloons, i.e. dialogue, are counted, not sound effects).
Ōtomo uses katakana for two different purposes: in A1 mainly to reproduce the names of the foreign researchers, and in A2 to stretch syllables otherwise written in hiragana at the end of words, e.g. なにィ nanii (“whaaat?”) or 何だァ nandaa (“what is iiit?”). Therefore the similarity of the character use in the two Akira samples is superficial only and the pure numbers somewhat misleading. On the other hand, it makes sense that an action-packed scene such as A2 contains less than half as many kanji as the courtroom dialogue in M2; in A2 there are more simple, colloquial words for which the hiragana spelling is more common, e.g. くそう kusou (“shit!”) or うるせェ urusee (“quiet!”), whereas technical terms such as 被告人 hikokunin (“defendant”) in M2 are more clearly and commonly expressed in kanji.
In the end, the old rule applies: only with a large number of sample texts, with a large size of each sample, and through a combination of several different metrics can such stylometric approaches possibly succeed.
The other day I’ve been made aware that some things I’ve said in an earlier blogpost, “Author dictionaries and lexical analysis for comics”, might be misleading. So let’s be clear: if you would like to find something out about the writing style of an author or text, it’s not the best idea to look at the frequently used nouns, kanji, or other units of high semantic content. Those are more useful for analysing the content, i.e. the topic(s), of texts. In stylometry, units with low semantic content, such as function words (the, a, it, etc.), are more attractive objects of study, as they can be used almost independently of the topic and often present writers with a choice of which word to use when. In other words, the same writer tends to use the same function words and may be identified by them. (In practice, though, a combination of different characteristics is used for analysis – see the Stylometry article at Wikipedia and the references there.)
In order to automatically separate function words from content words in a digital text, part-of-speech tagging software may be employed. For Japanese, there is e.g. Kuromoji. But isn’t there a simpler way? Can’t we make use of the kanji–kana distinction used in the aforementioned earlier blogpost? If we identified kanji as the semantically rich(er) units, wouldn’t it be sufficient to focus on the kana for stylometric analysis? Maybe, maybe not. The results would probably be poorer, due to two main reasons:
- Every content word (noun, verb, adjective), even if usually written in kanji, may also be written in kana. For instance, 分かる (to understand) is more frequently spelled in hiragana only, わかる. So when we gather kana from a text, we might end up with unwanted content words.
- In flection suffixes, hiragana are dependent on the preceding kanji, and thus ultimately on the content of the text. For instance, a text on musical performance might contain many instances of the verb 引く hiku (to play an instrument), so one can expect the hiragana か ka, き ki, く ku, け ke and こ ko to occur more frequently than in other texts, as they are used for inflecting 引く.
That being said, why don’t we put this kana analysis method to the test anyway? Let’s take the example from Akira vol. 5, p. 16 again in which the scientists are talking (初めまして。スタンリー・シモンズ博士です etc.). We’ll focus on hiragana and ignore katakana, as they tend to be used for nouns too. Starting from those two panels, I manually counted these and the following hiragana until I reached 100. Here are the 5 most frequent hiragana in this set:
- de: 8
- i: 7
- shi: 7
- te: 7
- no: 6
That means, if this was a sufficiently large sample, in any other piece of text by Ōtomo, or at least within Akira, roughly 8% of its hiragana should be de, 7% should be i, etc. So I randomly picked another scene from Akira (vol. 3, p. 125 ff) and looked at the first 100 hiragana there. The 5 most frequently used hiragana from the previous example are used less often here, with the exception of i:
- de: 3
- i: 8
- shi: 1
- te: 2
- no: 3
In these pages in vol. 3, we find mainly other hiragana such as tsu (9 times – including small tsu), ga (6 times), o (5 times) and su (5 times) to be the most frequently used. That, however, doesn’t tell us anything yet about the similarity of these two pieces of text (which I’m going to call “Akira 1″ and “Akira 2″ from here on). We need to add a third example, and for this purpose I’m going to use 100 hiragana from Miko Yasu’s Hakozume from the recently reviewed Morning magazine. If our method is successful, the differences between Hakozume and each of the two Akira scenes should be larger than those between Akira 1 and Akira 2. With frequency values for approximately 50 distinct hiragana we now have 3 × ~50 data points on which we could unleash the whole range of advanced statistical methods. But we’ll keep things simple by simply adding up the differences in frequencies: Hakozume contains only 6 instances of de, i.e. 2 less than Akira 1; Hakozume uses 3 times i as opposed to the 7 in Akira 1, i.e. 4 less; Hakozume contains 6 instances of shi less than Akira 1; etc. Here’s the table of frequencies of de, i, shi, te and no in Hakozume:
- de: 6
- i: 3
- shi: 1
- te: 6
- no: 8
The combined difference between Hakozume and Akira 1 for these 5 hiragana would be 2+4+6+1+2 = 15. For all ~50 different hiragana, the sum is 96.
This looks like a large number, and indeed, when we calculate the difference between Akira 1 and Akira 2 in this way, the result is 82. This means, the two Akira chunks are more similar in their usage of hiragana than Hakozume and Akira 1.
However, we’re not done yet. We still need to compare Hakozume to Akira 2. The result of this comparison may come as a surprise: the sum of differences is also 82! So Akira 2 is as similar to Hakozume as it is to Akira 1. If our goal was to find out whether a given piece of text is taken from Akira or not, our method would fail if we used Akira 2 as our base text with which to compare all others.
Just to make sure, I took another 100 hiragana from a different random manga in the same issue of Morning, Rito Asami’s Ichikei no karasu. I’ll refer to Ichikei no karasu as Morning 2 from now on, and to Hakozume as Morning 1. The results of the comparisons are even ‘worse’: while the sum of differences between Morning 2 and Akira 2 is 98 – i.e. vastly different – the difference between Morning 2 and Akira 1 is only 74, i.e. very similar.
In a way, the results do make sense though. We’re looking at dialogue, after all, and the way scientists (in Akira 1) speak is closer to that of lawyers (in Morning 2) than that of insurgent thugs (in Akira 2). And apparently, the conversation between the two policewomen (in Morning 1) is not quite unlike the latter.
As ever so often we could now blame the unsatisfactory results on the small sample size – if we had used chunks of 1000 hiragana instead of 100, surely our attribution attempts would have been more successful? We’ll never find out (unless we obtain a complete digital copy of Akira and extract the hiragana automatically). Another way to improve results would be to tweak the methodology: using data mining algorithms, more elaborate metrics such as co-occurrence of several hiragana could be employed. In actual stylometric research, hiragana seem to be used in yet another metric – the ratio of all hiragana to all other characters (kanji, katakana, rōmaji).
Earlier this year I gave a talk at MSU Comics Forum, and now a journal article based on that talk has already been published:
Has Akira Always Been a Cyberpunk Comic?
Arts 7(3), https://doi.org/10.3390/arts7030032
Here’s the abstract again:
Between the late 1980s and early 1990s, interest in the cyberpunk genre peaked in the Western world, perhaps most evidently when Terminator 2: Judgment Day became the highest-grossing film of 1991. It has been argued that the translation of Katsuhiro Ōtomo’s manga Akira into several European languages at just that time (into English beginning in 1988, into French, Italian, and Spanish beginning in 1990, and into German beginning in 1991) was no coincidence. In hindsight, cyberpunk tropes are easily identified in Akira to the extent that it is nowadays widely regarded as a classic cyberpunk comic. But has this always been the case? When Akira was first published in America and Europe, did readers see it as part of a wave of cyberpunk fiction? Did they draw the connections to previous works of the cyberpunk genre across different media that today seem obvious? In this paper, magazine reviews of Akira in English and German from the time when it first came out in these languages will be analysed in order to gauge the past readers’ genre awareness. The attribution of the cyberpunk label to Akira competed with others such as the post-apocalyptic, or science fiction in general. Alternatively, Akira was sometimes regarded as an exceptional, novel work that transcended genre boundaries. In contrast, reviewers of the Akira anime adaptation, which was released at roughly the same time as the manga in the West (1989 in Germany and the United States), more readily drew comparisons to other cyberpunk films such as Blade Runner.
Read the article online for free at http://www.mdpi.com/2076-0752/7/3/32.
Fun fact: this is my 10th publication (not counting reviews, translations, and articles related to my library ‘day job’)! Find them all here: https://www.bibsonomy.org/cv/user/iglesia
In less than a month, I’m going to participate in a panel on cyberpunk comics at Michigan State University Comics Forum. Here’s the abstract for my paper, which is closely connected to my PhD research:
Between the late 1980s and early 1990s, interest in the cyberpunk genre peaked in the Western world, perhaps most evidently when Terminator 2: Judgment Day became the highest-grossing film of 1991. It has been argued that the translation of Katsuhiro Ōtomo’s manga Akira into several European languages at just that time (from 1988 in English, from 1991 in French, German, Italian and Spanish) was no coincidence. In hindsight, cyberpunk tropes are easily identified in Akira to the extent that it is nowadays widely regarded as a classic cyberpunk comic. But has this always been the case? When Akira was first published in America and Europe, did readers see it as part of a wave of cyberpunk fiction? Did they draw the connections to previous works of the cyberpunk genre across different media that today seem obvious? In this paper, magazine reviews of Akira in English and German from the time when it first came out in these languages are analysed in order to gauge the past readers’ genre awareness. The attribution of the cyberpunk label to Akira competed with others such as the post-apocalyptic, or science fiction in general. Alternatively, Akira was sometimes regarded as an exceptional, novel work that transcended genre boundaries. In contrast, reviewers of the Akira anime adaptation, which was released at roughly the same time as the manga in the West (1989 in Germany and the United States), more readily drew comparisons to other cyberpunk films such as Blade Runner.
Bwana, producer of electronic music from Toronto/Berlin, has released an EP titled The Capsule’s Pride (Bikes) (Comicgate reported last week) for which he had rearranged the Akira anime soundtrack into 9 EDM tracks. This EP is available for free both as audio stream and YouTube video playlist. The latter is more interesting in this context: each video consists of a sparsely animated black-and-white still image from Akira. The funny thing is, the images are taken from the manga, not from the anime.
It’s funny because not only music samples were taken from the anime, but also dialogue samples (from the English dub) that directly refer to the major plot difference between the comic and its adaptation: “there is your messiah…” (in both track 1 and 5). At first I thought, whoever made those videos didn’t know the material well. On the other hand, at least two of the videos fit the titles of the corresponding tracks: the video for the title track “Capsule’s Pride (Bikes)” shows Kaneda on his motorcycle (pictured) – his first one, the one he has when he is still leader of the “Capsule” gang – and the video for “K&K (Lovers in the Light)” shows Kei and Kaneda. Another nice touch is that the Canon decal in the former image has been inconspicuously replaced by one bearing Bwana’s name.