Hiragana for stylometry?

The other day I’ve been made aware that some things I’ve said in an earlier blogpost, “Author dictionaries and lexical analysis for comics”, might be misleading. So let’s be clear: if you would like to find something out about the writing style of an author or text, it’s not the best idea to look at the frequently used nouns, kanji, or other units of high semantic content. Those are more useful for analysing the content, i.e. the topic(s), of texts. In stylometry, units with low semantic content, such as function words (the, a, it, etc.), are more attractive objects of study, as they can be used almost independently of the topic and often present writers with a choice of which word to use when. In other words, the same writer tends to use the same function words and may be identified by them. (In practice, though, a combination of different characteristics is used for analysis – see the Stylometry article at Wikipedia and the references there.)

In order to automatically separate function words from content words in a digital text, part-of-speech tagging software may be employed. For Japanese, there is e.g. Kuromoji. But isn’t there a simpler way? Can’t we make use of the kanji–kana distinction used in the aforementioned earlier blogpost? If we identified kanji as the semantically rich(er) units, wouldn’t it be sufficient to focus on the kana for stylometric analysis? Maybe, maybe not. The results would probably be poorer, due to two main reasons:

  1. Every content word (noun, verb, adjective), even if usually written in kanji, may also be written in kana. For instance, 分かる (to understand) is more frequently spelled in hiragana only, わかる. So when we gather kana from a text, we might end up with unwanted content words.
  2. In flection suffixes, hiragana are dependent on the preceding kanji, and thus ultimately on the content of the text. For instance, a text on musical performance might contain many instances of the verb 引く hiku (to play an instrument), so one can expect the hiragana か ka, ki, ku, ke and こ ko to occur more frequently than in other texts, as they are used for inflecting 引く.

That being said, why don’t we put this kana analysis method to the test anyway? Let’s take the example from Akira vol. 5, p. 16 again in which the scientists are talking (初めまして。スタンリー・シモンズ博士です etc.). We’ll focus on hiragana and ignore katakana, as they tend to be used for nouns too. Starting from those two panels, I manually counted these and the following hiragana until I reached 100. Here are the 5 most frequent hiragana in this set:

  • de: 8
  • i: 7
  • shi: 7
  • te: 7
  • no: 6

That means, if this was a sufficiently large sample, in any other piece of text by Ōtomo, or at least within Akira, roughly 8% of its hiragana should be de, 7% should be i, etc. So I randomly picked another scene from Akira (vol. 3, p. 125 ff) and looked at the first 100 hiragana there. The 5 most frequently used hiragana from the previous example are used less often here, with the exception of i:

de, su, u, ru, se, da

  • de: 3
  • i: 8
  • shi: 1
  • te: 2
  • no: 3

In these pages in vol. 3, we find mainly other hiragana such as tsu (9 times – including small tsu), ga (6 times), o (5 times) and su (5 times) to be the most frequently used. That, however, doesn’t tell us anything yet about the similarity of these two pieces of text (which I’m going to call “Akira 1″ and “Akira 2″ from here on). We need to add a third example, and for this purpose I’m going to use 100 hiragana from Miko Yasu’s Hakozume from the recently reviewed Morning magazine. If our method is successful, the differences between Hakozume and each of the two Akira scenes should be larger than those between Akira 1 and Akira 2. With frequency values for approximately 50 distinct hiragana we now have 3 × ~50 data points on which we could unleash the whole range of advanced statistical methods. But we’ll keep things simple by simply adding up the differences in frequencies: Hakozume contains only 6 instances of de, i.e. 2 less than Akira 1; Hakozume uses 3 times i as opposed to the 7 in Akira 1, i.e. 4 less; Hakozume contains 6 instances of shi less than Akira 1; etc. Here’s the table of frequencies of de, i, shi, te and no in Hakozume:

a, no, na, n, de, a, no, ga…

  • de: 6
  • i: 3
  • shi: 1
  • te: 6
  • no: 8

The combined difference between Hakozume and Akira 1 for these 5 hiragana would be 2+4+6+1+2 = 15. For all ~50 different hiragana, the sum is 96.

This looks like a large number, and indeed, when we calculate the difference between Akira 1 and Akira 2 in this way, the result is 82. This means, the two Akira chunks are more similar in their usage of hiragana than Hakozume and Akira 1.

However, we’re not done yet. We still need to compare Hakozume to Akira 2. The result of this comparison may come as a surprise: the sum of differences is also 82! So Akira 2 is as similar to Hakozume as it is to Akira 1. If our goal was to find out whether a given piece of text is taken from Akira or not, our method would fail if we used Akira 2 as our base text with which to compare all others.

ha, no, ki, ka, ra, ho, do, de, ki, wo…

Just to make sure, I took another 100 hiragana from a different random manga in the same issue of Morning, Rito Asami’s Ichikei no karasu. I’ll refer to Ichikei no karasu as Morning 2 from now on, and to Hakozume as Morning 1. The results of the comparisons are even ‘worse’: while the sum of differences between Morning 2 and Akira 2 is 98 – i.e. vastly different – the difference between Morning 2 and Akira 1 is only 74, i.e. very similar.

Frequency of all hiragana in each of the four 100-hiragana samples

In a way, the results do make sense though. We’re looking at dialogue, after all, and the way scientists (in Akira 1) speak is closer to that of lawyers (in Morning 2) than that of insurgent thugs (in Akira 2). And apparently, the conversation between the two policewomen (in Morning 1) is not quite unlike the latter.

As ever so often we could now blame the unsatisfactory results on the small sample size – if we had used chunks of 1000 hiragana instead of 100, surely our attribution attempts would have been more successful? We’ll never find out (unless we obtain a complete digital copy of Akira and extract the hiragana automatically). Another way to improve results would be to tweak the methodology: using data mining algorithms, more elaborate metrics such as co-occurrence of several hiragana could be employed. In actual stylometric research, hiragana seem to be used in yet another metric – the ratio of all hiragana to all other characters (kanji, katakana, rōmaji).

Advertisements

Article “Has Akira Always Been a Cyberpunk Comic?” published

Earlier this year I gave a talk at MSU Comics Forum, and now a journal article based on that talk has already been published:

Has Akira Always Been a Cyberpunk Comic?
Arts 7(3), https://doi.org/10.3390/arts7030032

Here’s the abstract again:

Between the late 1980s and early 1990s, interest in the cyberpunk genre peaked in the Western world, perhaps most evidently when Terminator 2: Judgment Day became the highest-grossing film of 1991. It has been argued that the translation of Katsuhiro Ōtomo’s manga Akira into several European languages at just that time (into English beginning in 1988, into French, Italian, and Spanish beginning in 1990, and into German beginning in 1991) was no coincidence. In hindsight, cyberpunk tropes are easily identified in Akira to the extent that it is nowadays widely regarded as a classic cyberpunk comic. But has this always been the case? When Akira was first published in America and Europe, did readers see it as part of a wave of cyberpunk fiction? Did they draw the connections to previous works of the cyberpunk genre across different media that today seem obvious? In this paper, magazine reviews of Akira in English and German from the time when it first came out in these languages will be analysed in order to gauge the past readers’ genre awareness. The attribution of the cyberpunk label to Akira competed with others such as the post-apocalyptic, or science fiction in general. Alternatively, Akira was sometimes regarded as an exceptional, novel work that transcended genre boundaries. In contrast, reviewers of the Akira anime adaptation, which was released at roughly the same time as the manga in the West (1989 in Germany and the United States), more readily drew comparisons to other cyberpunk films such as Blade Runner.

Read the article online for free at http://www.mdpi.com/2076-0752/7/3/32.

Fun fact: this is my 10th publication (not counting reviews, translations, and articles related to my library ‘day job’)! Find them all here: https://www.bibsonomy.org/cv/user/iglesia


Upcoming talk: Has Akira always been a cyberpunk comic?

In less than a month, I’m going to participate in a panel on cyberpunk comics at Michigan State University Comics Forum. Here’s the abstract for my paper, which is closely connected to my PhD research:

Between the late 1980s and early 1990s, interest in the cyberpunk genre peaked in the Western world, perhaps most evidently when Terminator 2: Judgment Day became the highest-grossing film of 1991. It has been argued that the translation of Katsuhiro Ōtomo’s manga Akira into several European languages at just that time (from 1988 in English, from 1991 in French, German, Italian and Spanish) was no coincidence. In hindsight, cyberpunk tropes are easily identified in Akira to the extent that it is nowadays widely regarded as a classic cyberpunk comic. But has this always been the case? When Akira was first published in America and Europe, did readers see it as part of a wave of cyberpunk fiction? Did they draw the connections to previous works of the cyberpunk genre across different media that today seem obvious? In this paper, magazine reviews of Akira in English and German from the time when it first came out in these languages are analysed in order to gauge the past readers’ genre awareness. The attribution of the cyberpunk label to Akira competed with others such as the post-apocalyptic, or science fiction in general. Alternatively, Akira was sometimes regarded as an exceptional, novel work that transcended genre boundaries. In contrast, reviewers of the Akira anime adaptation, which was released at roughly the same time as the manga in the West (1989 in Germany and the United States), more readily drew comparisons to other cyberpunk films such as Blade Runner.


Article “The Task of Manga Translation: Akira in the West” published

task

My conference paper from 2014, which so far had been only published in German and in print, is now available online and in English:

de la Iglesia, Martin 2016, ‘The Task of Manga Translation: Akira in the West’. The Comics Grid: Journal of Comics Scholarship 6(1), http://dx.doi.org/10.16995/cg.59

There’s also a PDF version.

Abstract:
Translated editions of Katsuhiro Ōtomo’s manga Akira played an important role in the popularisation of manga in the Western world. Published in Japan between 1982 and 1990, editions in European languages followed as soon as the late 1980s. In the first US edition (Epic 1988–1995) the originally black and white manga was printed in colour and published in 38 issues, which were designed not unlike typical American comic books. The first German edition (Carlsen 1991–1996) marked the beginning of Carlsen’s manga publishing efforts. It was based on the English-language edition and also printed in colour, and combined two American issues in one.

This article analyses the materiality of these two translated editions with a focus on three main issues – the mirroring (or ‘flipping’) which changes the reading direction from right-to-left into left-to-right, the colouring of the originally black and white artwork, and the translation of different kinds of script (sound effects, speech bubble text, and inscriptions or labels) – before concluding with a brief examination of their critical reception.


Bartkira the animated trailer

(via Major Spoilers)

Remember Bartkira, the comic mashup of Akira and The Simpsons (mentioned briefly here one year ago)? Based on this idea, Kaitlin Sullivan, in collaboration with many other artists, has made an animated short film. This fan film adapts the animated Akira film rather than the comic, so we get to see some new scenes and characters not present in Bartkira the comic.


Conference paper “Akira im Westen” published

panel from Akira by Katsuhiro Ōtomo

Last year at a conference on “the translation and adaptation of comics” in Hildesheim, Germany, I gave a talk on the first English and German editions of Katsuhiro Ōtomo’s Akira . The conference proceedings have now been published as a book, albeit with most of the papers in German, including my own. I’m working on making an English-language, Open Access version of my talk available soon. Anyway, here’s the bibliographic data:

de la Iglesia, Martin. “Akira im Westen.” In Comics. Übersetzungen und Adaptionen, edited by Nathalie Mälzer, 355-373. Berlin: Frank & Timme, 2015.

The ISBN of the book is: 978-3-7329-0131-9


Social Network Analysis of co-occurring comic characters

Another thing I learned at my librarian job is that Social Network Analysis (SNA) methods seem to become increasingly popular in the Humanities. The basic idea of SNA is that you define a type of entity as nodes (actors), and some criterion for establishing edges (connections) between them. Once you have constructed such a network, you can analyse it by applying various mathematical operations. The difficult part is defining your nodes and particularly your edges in a way that is both feasible and meaningful.

Some Literature scholars have tackled this problem by using SNA for drama. Written plays are highly structured: speakers are indicated in fairly standardised ways, so that they can be used as nodes in a network. Edges between them can be formed by looking at which characters are on stage at the same time (i.e during the same scene), possibly indicating a dialogue or other interaction. Another benefit of using drama for SNA is that many older texts are available digitally. Crowdsourcing may be used to clean up this data, thus making it machine-readable for SNA purposes. The resulting graphs may provide insight into certain historic developments, e.g. the number of characters per play increasing over time (PDF, German).

In comics, such automatic processing is still a distant dream, but on a smaller scale, networks may be constructed manually. Identifying nodes is more problematic in comics, though, because unlike in drama, characters aren’t explicitly named each time they appear. They usually have to be identified by their looks, which isn’t always easy. Another problem is how to define the edges. A research group from Paderborn recently proposed (PDF, German) to establish an edge between two characters whenever they appear on a page together. In my opinion, a more suitable category than the page would be the panel, as there are sometimes narrative shifts between panels on the same page, so that the co-occurrence of characters on a page doesn’t necessarily imply interaction. Furthermore, some comics don’t have pages, but they all have panels.

To test the feasibility of this approach, I built a little character network based on co-occurrence within panels, once again using Akira. Here is a Gephi rendering of such a network from the first 16 pages of volume 3 (blue numbers indicate the number of panels on which both of the connected characters appear):

Character co-occurence network of Akira vol. 3, pp. 5-20I assigned the group of soldiers to one single node rather than one node per visible soldier, similar to a speaker designation for groups of people in a play. As we will see in the second example, these ‘crowd’ nodes may cause some headache. Anyway, the most striking thing about this network is that it consists of three unconnected clusters. In other words, the action takes place at three different places on these 16 pages: the military base, Miyako’s temple, and the streets of Neo Tokyo. (Actually there are two more locales – the site of the SOL laser beam impact and SOL in space – but no character interaction takes place there.) Keep that in mind as we look at the first 17* pages of the 4th volume:

character co-occurrence network of Akira, vol. 4, pp. 4-20

2 panels from p. 13 of Akira vol. 4 by Katsuhiro Ōtomo

Spot the Lieutenant.

At first glance, this graph is very different from the first: instead of three clusters, there is one small and one large cluster. However, this impression is misleading. Because I lumped all of Tetsuo’s henchmen together as “Great Tokyo Empire mob”, they act as a bridge between the actually unconnected scenes at the rescue helicopter on the one hand, and Lieutenant Yamada and his diving unit entering the city on the other. (Another problem here is that Yamada can’t be recognised until he takes off his diving suit – for simplicity’s sake I just assumed he is always among the group of divers depicted.)

Thus we can tentatively recognise a pattern in Ōtomo’s storytelling: rather than building his story around one central protagonist, he frequently jumps between parallel lines of action, with shifts taking place approximately every 2-8 pages.


*Why the different number of pages (16 and 17, respectively)? The reason is that I analysed both volumes until p. 20, but vol. 3 starts on p. 5, whereas vol. 4 starts on p. 4.