Author dictionaries and lexical analysis for comicsPosted: March 19, 2015
Every once in a while I learn something at my day job that I think would be applicable to comics research too. For instance, in literary studies, dictionaries are compiled that contain all the words (or only the nouns, similar to an encyclopedia) used by a particular author, or even only those used in one single literary text. Think of it as a sort of commentary in a critical edition which explains references to real-world entities, or obscure words that aren’t used anymore, only separate from the source text and organised alphabetically.
Applying this method to comics, we would, of course, ignore all the images and lose the information they convey. On the other hand, looking at the words alone might yield interesting results. For instance, by comparing the frequency of words used in a particular comic to the frequency with which they occur in written language in general, we could test common hypotheses such as “author X uses word Y a lot”.
For comics of more than a few pages length, it would be nice to automatically create a list of all the words in digital form (at least those in speech/thought bubbles and captions – sound effects and inscriptions/labels can be difficult to automatically recognise). Unless a script for the comic you’re interested in is already available, a straightforward (though not necessarily easy) way to get such a list would be to obtain digital images (e.g. scans) of the pages of the comic, then run Optical Character Recognition (OCR) software on them.
As an example, consider these two panels from Akira, in which a scientist is introduced to some colleagues:
The OCR software www.onlineocr.net recognises the text in the five speech bubbles like this:
As far as I can see, only two mistakes (ノレ instead of ル and ですノ instead of です) were made. Instead of focusing on nouns (for which there probably are detecting algorithms for Japanese), it’s easier for now to just look at the kanji and filter out all hiragana and katakana characters. (While you can’t simply say that kanji represent nouns and kana represent other parts of speech, the idea here is that kanji tend to carry more semantic information than kana, which are often only used as flection suffixes.) That leaves us with the six kanji 初, 名, 前, 博, 士, and 初 again.
We can look up their frequency with which they occur in Japanese language in general, e.g. the frequency rank at WWWJDIC:
- 前: 27
- 初: 152
- 名: 177
- 士: 526
- 博: 794
i.e. 前 is the most frequent of the five, 博 the least frequent. Compare these ranks to the frequency with which they occur in our slim sample of two panels:
- 初: 33% of all kanji
- 前, 名, 士, 博: 17% each
What we can see here, if anything, is that two kanji, 士 and 博, are significantly more often used by Katsuhiro Ōtomo than by the average Japanese author. This doesn’t come as a surprise, as the compound 博士 signifies the academic title ‘Dr.’, which is the appropriate form of address for the scientists in this scene, whereas the other kanji 前, 初 and 名 are linked to names and introductions in general, and thus more often used in standard Japanese.
However, even if the frequency of 士 and 博 remained above-average if we analysed all of Akira‘s over 2000 pages, that wouldn’t necessarily mean we had discovered a lexical characteristic of Ōtomo’s writing style. What it would tell us is that there is a subplot about scientists in Akira. Of course, topic analysis based on word frequency is nothing new. More interesting from a formal-lexical point of view would be if we discovered kanji used in different frequencies than we would expect with regard to the subject matter treated in Akira. In this situation it might be useful to look at synonyms: when Ōtomo had several options to express the same thing, why did he choose some words over others?
For instance, on the same page as the example above, the relatively infrequent (rank 920) kanji 栄 is used as part of the word “honour” in the expression “I’m honoured to meet you”. Instead, Ōtomo could have used the phrase “nice to meet you” for a third time, using the kanji 初 again, but he didn’t. Suppose there was a significant number of further instances of 栄 in Akira, maybe that would be a formal-stylistic choice, rather than one merely implied by the content of the comic?
I’m aware that all this is very hypothetical, and that looking at just a few panels doesn’t show anything, but if I wanted to analyse a comic in this way, I would basically go on about it as described here, only with more scans. If you would like to learn more about this kind of analysis, I recommend Allen Riddell’s tutorial on “Feature selection: finding distinctive words”.