Chi-squared: which characteristics do really matter?

In our little stylometric experiments, we compared different manga in terms of their hiragana frequencies. While we were able to say how similar or different the comics are to each other, it’s hard to tell in what way precisely they are different, i.e. which hiragana differed vastly in quantity and which were more or less the same. Intuitively, we ought to be able to answer this by looking at how much the hiragana counts differ from the average, but it would be good to have a more exact measure of what it means to differ “vastly” or to be “more or less” the same. If we could identify those hiragana in which the manga are hardly any different, we could ignore them in future experiments, which would be a relief since we’re otherwise stuck with as many as ~50 hiragana to keep track of.

Enter chi-squared (also called chi-square, or χ²), which is perhaps the most widespread of several statistical tests for this purpose. I first learned about it during my Master’s, but either I forgot about it or I had never really understood it in the first place. But now that I’ve looked it up again, I found it’s actually quite simple: the idea is to not only calculate the difference between the actual (observed) and the “average” (expected) value, but to square the result and then divide it by the expected value. The squaring has the effect to make large differences stand out more, while the division makes different chi-squared values comparable.

So, the formula would be:

(observed – expected)² / expected

[You might have seen this formula with a sum sign at the beginning: when you perform a “chi-squared test”, you take the sum of all calculated values and look it up in a table to determine whether your experiment is random or not (see below). In our case, it definitely isn’t.]

Let’s take the hiragana で de as an example. In our first 100-character sample from Katsuhiro Ōtomo’s Akira (A1), で de occurred 8 times (see the chart here). In the second Akira sample (A2), it is found 3 times. In the two manga samples from Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2), で de is found 6 and 7 times, respectively. Overall, there are 24 で de in those four manga samples. The sum of all hiragana in these manga samples is 435 (so it turns out I took slightly more than 100 hiragana for each sample; don’t ask me why), which means that on average, で de should occur with a frequency of 24/435 = 0.0552. In other words, roughly every 19th hiragana in any of the four manga should be a で de. For the first of the two Akira samples, A1, which consists of 112 hiragana in total, the expected value for で de is 112 * 0.0552 = 6.18, i.e. we expect to find 6 or 7 で de in A1.

There actually are 8 で de in A1. That’s a difference of 8 – 6.18 = 1.82. Squared and divided by the expected value of 6.18, this results in a chi-squared value of 0.536.

Compare this to the frequency of で de in the other Akira sample, A2, where it occurs only 3 times, i.e. much less than one would have thought. Given a hiragana total of 106 for A2, we get an expected value of 106 * 0.0552 = 5.85. Accordingly, chi-squared for で/A2 is (3 – 5.85)² / 5.85 = 1.39.

However, our aim was to compare different hiragana, so let’s also calculate the chi-squared values for し shi, which occurs 7 times in A1, 1 time in A2, and 6 times in the other two manga, so the total for し shi is 14. Chi-squared for し in A1 is (7 – (14/435)*112)² / ((14/435)*112) = 3.199 and chi-squared for し in A2 is (1 – (14/435)*106)² / (14/435)*106 = 1.705.

As you can see, the chi-squared values for し shi are higher than for で de, which means that the former hiragana contributes more to the overall difference between A1 and A2 than the latter. In other words, the usage of で de throughout Akira is close to the average, thus comparatively unremarkable and perhaps not the most relevant stylometric property.

Here’s a chart of the chi-squared values for all 51 hiragana characters that occur in the four manga samples (click to enlarge):

A dialogue excerpt from Hakozume by Miko Yasu which illustrates the above-average frequency of the hiragana character え e. With regard to our little example corpus, か ka and ん n are relatively frequent too.

One can easily see several spikes at the hiragana え e, ん n, と to and ず zu, though more important than the individual values are the sums, which are also high for お o and こ ko. These 6 hiragana alone contribute roughly 70% towards the overall sum of chi-squares! If our corpus was of a sufficient size (which it is definitely not), we could focus on these 6 hiragana in further experiments, as difference in hiragana usage among manga would be most likely connected to them.

In contrast, hiragana like び bi, く ku and に ni, with chi-square values close to zero, seem to have very little explanatory power over stylometric differences; their usage differs hardly among the four manga in question.

Of course, chi-squared can not only be applied to character counts in stylometry, but also to anything else that is countable. For instance, I recently mentioned the 1:1 gender ratio as a potential criterion for corpus building. One possible null hypothesis would be that good (or popular) comics are equally likely to be authored by men or by women. If we look at the 60 people who authored the top 10 comics from each of the last four years’ best-of lists (only counting the first-mentioned author when there are more than 3), we end up with 41 men and 19 women. This distribution isn’t quite the 30:30 we might have expected, but can it still be said to be roughly equal?

To answer this with the help of chi-squared, we calculate the two chi-squared values, one for male authors:

(41 – 30)² / 30 = 4.033

and one for female authors:

(19 – 30)² / 30 = 4.033

Now we add those two numbers together and look up the result in a table like this one. We need to use the first row as we have 1 “degree of freedom” in our essentially binary variable. There, our chi-squared sum of 8.07 lies between the p=0.01 and the p=0.001 column, meaning that the null hypothesis can be rejected with high confidence. In other words, the deviation of our sample from a 30:30 gender ratio is statistically significant. Of course, what exactly this gender bias means and where it comes from is another question.

In case all of this didn’t make any sense to you, there are many online tutorials on chi-squared which perhaps explain it better, among which I recommend this video by Paul Andersen on YouTube.

Multivariate statistics: how to measure similarity between comics (or anything, really) based on several characteristics

In recent blogposts about stylometry (e.g. here), I skipped a bit of maths that, in hindsight, might be worth talking about. As it turns out, it’s actually both highly useful and easy to understand.

The examples used here are going to be the same as in the aforementioned post, i.e. 2 scenes from Katsuhiro Ōtomo’s Akira (vol. 5, p. 16 ff, which we’ll call A1, and vol. 3, p. 125 ff, which we’ll call A2) and 2 manga chapters from the October 11, 2018 issue of Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2).

1 variable

Let’s say you want to compare these 4 comics based on 1 variable, e.g. the frequency of the hiragana character で de. (Which is not the most realistic stylometric indicator, but it will make more and more sense with an increasing number of variables.) Nothing easier than that. First, here are the numbers of で de per 100 hiragana for each text:

  • A1: 8
  • A2: 3
  • M1: 6
  • M2: 7

By simply subtracting the numbers from each other, we get the difference between any pair of manga and thus their similarity. Ranked from smallest difference to largest, these would be:

  • A1/M2: 1
  • M1/M2: 1
  • A1/M1: 2
  • A2/M1: 3
  • A2/M2: 4
  • A1/A2: 5

So the two Morning manga and one of the Akira scenes can be said to be similar, while the other Akira scene is the odd one out.

2 variables

With 2 variables, it gets more interesting. Let’s assume you decide that the similarity of these manga is best based on their use of the hiragana で de and い i. The frequencies for the latter are:

  • A1: 7
  • A2: 8
  • M1: 3
  • M2: 2

On a side note, at this point it might be a good idea to think about normalisation: are the numbers of the two variables comparable, so that a difference of e.g. “2” carries the same weight for both characteristics? In our example, this is not a problem because we’re dealing with two hiragana frequencies measured on the same scale, but if your two variables are e.g. the total number of kana characters per chapter and the shoe size of the author, the former will probably have much more impact on the similarity scores than the latter, because the range of numbers is wider – unless you adjust the scale of the variables. Except if this different impact was precisely what you wanted.

Anyway, now we have 4 pairs of values, (8/7), (3/8), (6/3) and (7/2), which we could plot on a x and y axis, like this:

To calculate the distance between any two of these points (i.e. the similarity of two manga), you’ll probably want to use Pythagoras and his a² + b² = c² formula, a.k.a. the Euclidean distance, with ‘a’ and ‘b’ representing the horizontal and vertical distances and ‘c’ being the diagonal line we’re looking for. There’s nothing wrong with that, but it might suprise you that in actual statistics and stylometrics, there are several other ways of measuring this distance. However, we’re going to stick with good old Pythagoras here.

The distance between A1 (で de: 8 / い i: 7) and A2 (3/8), for instance, would be the square root of the sum of (8-3)² and (7-8)², which is approximately 5.1. All distances, ranked from lowest to highest, would be (rounded to one decimal):

  • M1/M2: 1.4
  • A1/M1: 4.5
  • A1/A2: 5.1
  • A1/M2: 5.1
  • A2/M1: 5.8
  • A2/M2: 7.2

Now the two Akira excerpts appear to be more similar than before when the similarity was only based on the frequency of で de, and the similarity between the two Morning manga is greater than that between the first Akira excerpt and either of the two Morning manga.

3 variables

Just as you imagine two points in 2-dimensional space forming two corners of a right-angled triangle (see above), in 3-dimensional space you have to image a rectangular cuboid – a ‘box’ (see the illustration on Wikipedia). Apparently, how to calculate the distance between the two opposite corner points of a cuboid is something you learn in high school, but I couldn’t remember and had to look it up. The formula for distance ‘d’ is: d² = a² + b² + c².

As our third variable, we’re going to use the frequency of the hiragana し shi. In the following list, the number of し shi per 100 hiragana is added as the third coordinate to each manga:

  • A1 (8/7/7)
  • A2 (3/8/1)
  • M1 (6/3/1)
  • M2 (7/2/5)

For instance, the distance between A1 and A2 is the square root of: (8-3)² + (7-8)² + (7-1)², i.e. roughly 7.9. Here are all the distances:

  • M1/M2: 4.2
  • A1/M2: 5.5
  • A2/M1: 5.8
  • A1/M1: 7.5
  • A1/A2: 7.9
  • A2/M2: 8.2

As we can see, the main difference between this similarity ranking and the previous one is that the similarity between the two Akira scenes has become smaller.

4 variables

You might have guessed it by now: even though it gets harder to imagine (and even more so to illustrate) a space of more than 3 dimensions, we can apply more or less the same formula regardless of the number of variables. We only need to add a new summand/addend for each new variable. For 4 variables, the distance between two points would be the square root of (a² + b² + c² + d²). These are the distances if we add the hiragana て te (which occurs 7 times per 100 hiragana in A1, 2 times in A2, 6 in M1, 4 in M2) as the 4th dimension:

  • M1/M2: 4.7
  • A1/M2: 6.2
  • A2/M1: 7.1
  • A1/M1: 7.5
  • A2/M2: 8.5
  • A1/A2: 9.3

Note how the changes become smaller now – apart from the last two pairs having swapped places, the similarity ranking is the same as before.

dialogue in Katsuhiro Ōtomo’s Akira (A1, left) vs. dialogue in Miko Yasu’s Hakozume (M1, right)

25 variables

So how about 25 hiragana frequencies? This is more than half of all the different hiragana in our (100-hiragana samples of the) four manga. I added 21 random hiragana (see the graph) to the 4 from the previous section, and these are the resulting distances:

  • A1/M2: 9.7
  • A2/M1: 11.0
  • A1/A2: 12.5
  • M1/M2: 13.0
  • A2/M2: 13.3
  • A1/M1: 14.7

Who would have thought that? Now it looks as if the ‘scientists’ scene from Akira (A1) is similar to Ichikei no karasu (M2), and the ‘insurgent thugs’ scene from Akira (A2) is similar to Hakozume (M1). Which is what we suspected all along. So who knows, maybe we can do away with all this maths stuff after all? However, the usual caveat applies: proper stylometry should really be based on larger samples than 100 characters per text.

Flesch reading ease for stylometry?

The Flesch reading-ease score (FRES, also called FRE – ‘Flesch Reading Ease’) is still a popular measurement for the readability of texts, despite some criticism and suggestions for improvement since it was first proposed by Rudolf Flesch in 1948. (I’ve never read his original paper, though; all my information is taken from Wikipedia.) On a scale from 0 to 100, it indicates how difficult it is to understand a given text based on sentence length and word length, with a low score meaning difficult to read and a high score meaning easy to read.

Sentence length and word length are also popular factors in stylometry, the idea here being that some authors (or, generally speaking, kinds of text) prefer longer sentences and/or words while others prefer shorter ones. Thus such scores based on sentence length and word length might serve as an indicator of how similar two given texts are. In fact, FRES is used in actual stylometry, albeit only as one factor among many (e.g. in Brennan, Afroz and Greenstadt 2012 (PDF)). Over other stylometric indicators, FRES would have the added benefit that it actually says something in itself about the text, rather than being merely a number that only means something in relation to another.

The original FRES formula was developed for English and has been modified for other languages. In the last few stylometry blogposts here, the examples were taken from Japanese manga, but FRES is not well suited for Japanese. The main reason is that syllables don’t play much of a role in Japanese readability. More important factors are the number of characters and the ratio of kanji, as the number of syllables per character varies. A two-kanji compound, for instance, can have fewer syllables than a single-kanji word (e.g. 部長 bu‧chō ‘head of department’ vs. 力 chi‧ka‧ra ‘power’). Therefore, we’re going to use our old English-language X-Men examples from 2017 again.

The comics in question are: Astonishing X-Men #1 (1995) written by Scott Lobdell, Ultimate X-Men #1 (2001) written by Mark Millar, and Civil War: X-Men #1 (2006) written by David Hine. Looking at just the opening sequence of each comic (see the previous X-Men post for some images), we get the following sentence / word / syllable counts:

  • AXM: 3 sentences, 68 words, 100 syllables.
  • UXM: 6 sentences, 82 words, 148 syllables.
  • CW:XM: 7 sentences, 79 words, 114 syllables.

We don’t even need to use Flesch’s formula to get an idea of the readability differences: the sentences in AXM are really long and those in CW:XM are much shorter. As for word length, UXM stands out with rather long words such as “unconstitutional”, which is reflected in the high ratio of syllables per word.

Applying the formula (cf. Wikipedia), we get the following FRESs:

  • AXM: 59.4
  • UXM: 40.3
  • CW:XM: 73.3

Who would have thought that! It looks like UXM (or at least the selected portion) is harder to read than AXM – a FRES of 40.3 is already ‘College’ level according to Flesch’s table.

But how do these numbers help us if we’re interested in stylometric similarity? All three texts are written by different writers. So far we could only say (again – based on a insufficiently sized sample) that Hine’s writing style is closer to Lobdell’s than to Millar’s. The ultimate test for a stylometric indicator would be to take an additional example text that is written by one of the three authors, and see if its FRES is close to the one from the same author’s X-Men text.

Our 4th example will be the rather randomly selected Nemesis by Millar (2010, art by Steve McNiven) from which we’ll also take all text from the first few panels.

3 panels from Nemesis by Mark Millar and Steve McNiven

Part of the opening scene from Nemesis.

These are the numbers for the selected text fragment from Nemesis:

  • 8 sentences, 68 words, 88 syllables.
  • This translates to a FRES of 88.7!

In other words, Nemesis and UXM, the two comics written by Millar, appear to be the most dissimilar of the four! However, that was to be expected. Millar would be a poor writer if he always applied the same style to each character in each scene. And the two selected scenes are very different: a TV news report in UXM in contrast to a dialogue (or perhaps more like the typical villain’s monologue) in Nemesis.

Interestingly, there is a TV news report scene in Nemesis too (Part 3, p. 3). Wouldn’t that make for a more suitable comparison?

Here are the numbers for this TV scene which I’ll call N2:

  • 4 sentences, 81 words, 146 syllables.
  • FRES: 33.8

Now this looks more like Millar’s writing from UXM: the difference between the two scores is so small (6.5) that they can be said to be almost identical.

Still, we haven’t really proven anything yet. One possible interpretation of the scores is that the ~30-40 range is simply the usual range for this type of text, i.e. TV news reports. So perhaps these scores are not specific to Millar (or even to comics). One would have to look at similar scenes by Lobdell, Hine and/or other writers to verify that, and ideally also at real-world news transcripts.

On the other hand, one thing has worked well: two texts that we had intuitively identified as similar – UXM and N2 – indeed showed similar Flesch scores. That means FRES is not only a measurement of readability but also of stylometric similarity – albeit a rather crude one which is, as always, best used in combination with other metrics.

Kanji-kana ratio for stylometry?

I ended my blogpost on hiragana frequency as a stylometric indicator with the remark that, rather than the frequency distribution of different hiragana in the text, the ratio of kana to kanji is used as one of several key characteristics in actual stylometric analysis of Japanese texts. I was curious to find out if this number alone could tell us something about the 4 manga text samples in question (2 randomly selected scenes from Katsuhiro Ōtomo’s Akira and 2 series from Morning magazine, Miko Yasu’s Hakozume and Rito Asami’s Ichikei no karasu – in the following text referred to as A1, A2, M1 and M2, respectively). My intuition was that the results wouldn’t be meaningful because the samples were too small, but let’s see:

This time I chose a sample size of 200 characters (hiragana, katakana, and kanji) per text.

Among the first 200 characters in A1 (i.e. Akira vol. 5, p. 16), there are 113 hiragana, 42 katakana and 45 kanji. This results in a kanji-kana ratio of 45 : (113 + 42) = 0.29.

In A2 (Akira vol. 3, pp. 125 ff.), the first 200 characters comprise of 126 hiragana, 34 katakana, and 40 kanji, i.e. the kanji-kana ratio is 0.25.

In M1, there are 122 hiragana, 9 katakana, and 69 kanji, resulting in a kanji-kana ratio of 0.52.

In M2, there are 117 hiragana, 0 katakana, and 83 kanji, resulting in a kanji-kana ratio of 0.71!

6 hiragana, 2 katakana, 3 kanji in A2 (Akira vol. 3, p. 125).

Thus this time the authorship attribution seems to have worked: the two Ōtomo samples have an almost identical score, whereas those of the two Morning samples are completely different. Interestingly, this result contradicts the interpretation from the earlier blogpost in which I had suggested that the scientists in Akira and the lawyers in Karasu have similar ways of talking. The difference in the kanji-kana ratio between Akira and the two Morning manga, though, is explained not only through the more frequent use of kanji in the latter, but also through the vast differences in katakana usage (note that only characters in proper word balloons, i.e. dialogue, are counted, not sound effects).

Ōtomo uses katakana for two different purposes: in A1 mainly to reproduce the names of the foreign researchers, and in A2 to stretch syllables otherwise written in hiragana at the end of words, e.g. なにィ nanii (“whaaat?”) or 何だァ nandaa (“what is iiit?”). Therefore the similarity of the character use in the two Akira samples is superficial only and the pure numbers somewhat misleading. On the other hand, it makes sense that an action-packed scene such as A2 contains less than half as many kanji as the courtroom dialogue in M2; in A2 there are more simple, colloquial words for which the hiragana spelling is more common, e.g. くそう kusou (“shit!”) or うるせェ urusee (“quiet!”), whereas technical terms such as 被告人 hikokunin (“defendant”) in M2 are more clearly and commonly expressed in kanji.

In the end, the old rule applies: only with a large number of sample texts, with a large size of each sample, and through a combination of several different metrics can such stylometric approaches possibly succeed.

Hiragana for stylometry?

The other day I’ve been made aware that some things I’ve said in an earlier blogpost, “Author dictionaries and lexical analysis for comics”, might be misleading. So let’s be clear: if you would like to find something out about the writing style of an author or text, it’s not the best idea to look at the frequently used nouns, kanji, or other units of high semantic content. Those are more useful for analysing the content, i.e. the topic(s), of texts. In stylometry, units with low semantic content, such as function words (the, a, it, etc.), are more attractive objects of study, as they can be used almost independently of the topic and often present writers with a choice of which word to use when. In other words, the same writer tends to use the same function words and may be identified by them. (In practice, though, a combination of different characteristics is used for analysis – see the Stylometry article at Wikipedia and the references there.)

In order to automatically separate function words from content words in a digital text, part-of-speech tagging software may be employed. For Japanese, there is e.g. Kuromoji. But isn’t there a simpler way? Can’t we make use of the kanji–kana distinction used in the aforementioned earlier blogpost? If we identified kanji as the semantically rich(er) units, wouldn’t it be sufficient to focus on the kana for stylometric analysis? Maybe, maybe not. The results would probably be poorer, due to two main reasons:

  1. Every content word (noun, verb, adjective), even if usually written in kanji, may also be written in kana. For instance, 分かる (to understand) is more frequently spelled in hiragana only, わかる. So when we gather kana from a text, we might end up with unwanted content words.
  2. In flection suffixes, hiragana are dependent on the preceding kanji, and thus ultimately on the content of the text. For instance, a text on musical performance might contain many instances of the verb 引く hiku (to play an instrument), so one can expect the hiragana か ka, ki, ku, ke and こ ko to occur more frequently than in other texts, as they are used for inflecting 引く.

That being said, why don’t we put this kana analysis method to the test anyway? Let’s take the example from Akira vol. 5, p. 16 again in which the scientists are talking (初めまして。スタンリー・シモンズ博士です etc.). We’ll focus on hiragana and ignore katakana, as they tend to be used for nouns too. Starting from those two panels, I manually counted these and the following hiragana until I reached 100. Here are the 5 most frequent hiragana in this set:

  • de: 8
  • i: 7
  • shi: 7
  • te: 7
  • no: 6

That means, if this was a sufficiently large sample, in any other piece of text by Ōtomo, or at least within Akira, roughly 8% of its hiragana should be de, 7% should be i, etc. So I randomly picked another scene from Akira (vol. 3, p. 125 ff) and looked at the first 100 hiragana there. The 5 most frequently used hiragana from the previous example are used less often here, with the exception of i:

de, su, u, ru, se, da

  • de: 3
  • i: 8
  • shi: 1
  • te: 2
  • no: 3

In these pages in vol. 3, we find mainly other hiragana such as tsu (9 times – including small tsu), ga (6 times), o (5 times) and su (5 times) to be the most frequently used. That, however, doesn’t tell us anything yet about the similarity of these two pieces of text (which I’m going to call “Akira 1″ and “Akira 2″ from here on). We need to add a third example, and for this purpose I’m going to use 100 hiragana from Miko Yasu’s Hakozume from the recently reviewed Morning magazine. If our method is successful, the differences between Hakozume and each of the two Akira scenes should be larger than those between Akira 1 and Akira 2. With frequency values for approximately 50 distinct hiragana we now have 3 × ~50 data points on which we could unleash the whole range of advanced statistical methods. But we’ll keep things simple by simply adding up the differences in frequencies: Hakozume contains only 6 instances of de, i.e. 2 less than Akira 1; Hakozume uses 3 times i as opposed to the 7 in Akira 1, i.e. 4 less; Hakozume contains 6 instances of shi less than Akira 1; etc. Here’s the table of frequencies of de, i, shi, te and no in Hakozume:

a, no, na, n, de, a, no, ga…

  • de: 6
  • i: 3
  • shi: 1
  • te: 6
  • no: 8

The combined difference between Hakozume and Akira 1 for these 5 hiragana would be 2+4+6+1+2 = 15. For all ~50 different hiragana, the sum is 96.

This looks like a large number, and indeed, when we calculate the difference between Akira 1 and Akira 2 in this way, the result is 82. This means, the two Akira chunks are more similar in their usage of hiragana than Hakozume and Akira 1.

However, we’re not done yet. We still need to compare Hakozume to Akira 2. The result of this comparison may come as a surprise: the sum of differences is also 82! So Akira 2 is as similar to Hakozume as it is to Akira 1. If our goal was to find out whether a given piece of text is taken from Akira or not, our method would fail if we used Akira 2 as our base text with which to compare all others.

ha, no, ki, ka, ra, ho, do, de, ki, wo…

Just to make sure, I took another 100 hiragana from a different random manga in the same issue of Morning, Rito Asami’s Ichikei no karasu. I’ll refer to Ichikei no karasu as Morning 2 from now on, and to Hakozume as Morning 1. The results of the comparisons are even ‘worse’: while the sum of differences between Morning 2 and Akira 2 is 98 – i.e. vastly different – the difference between Morning 2 and Akira 1 is only 74, i.e. very similar.

Frequency of all hiragana in each of the four 100-hiragana samples

In a way, the results do make sense though. We’re looking at dialogue, after all, and the way scientists (in Akira 1) speak is closer to that of lawyers (in Morning 2) than that of insurgent thugs (in Akira 2). And apparently, the conversation between the two policewomen (in Morning 1) is not quite unlike the latter.

As ever so often we could now blame the unsatisfactory results on the small sample size – if we had used chunks of 1000 hiragana instead of 100, surely our attribution attempts would have been more successful? We’ll never find out (unless we obtain a complete digital copy of Akira and extract the hiragana automatically). Another way to improve results would be to tweak the methodology: using data mining algorithms, more elaborate metrics such as co-occurrence of several hiragana could be employed. In actual stylometric research, hiragana seem to be used in yet another metric – the ratio of all hiragana to all other characters (kanji, katakana, rōmaji).

Trying to understand Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is one of the most popular algorithms for Topic Modeling, i.e. having a computer find out what a text is about. LDA is also perhaps easier to understand than the other popular Topic Modeling approach, (P)LSA, but even though there are two well-written blog posts that explain LDA (Edwin Chen’s and Ted Underwood’s) to non-mathematicians, it still took me quite some time to grasp LDA well enough to be able to code it in a Perl script (which I have made available on GitHub, in case anyone is interested). Of course, you can always simply use a software like Mallet that runs LDA over your documents and outputs the results, but if you want to know what LDA actually does, I suggest you read Edwin Chen’s and Ted Underwood’s blog posts first, and then, if you still feel you don’t really get LDA, come back here. OK?

Welcome back. Disclaimer: I’m not a mathematician and there’s still the possibility that I got it all wrong. That being said, let’s take a look at Edwin Chen’s first example again, and this time we’re going to calculate it through step by step:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

We immediately see that these sentences are about either eating or pets or both, but even if we didn’t know about these two topics, we still have to make an assumption about the number of topics within our corpus of documents. Furthermore, we have to make an assumption how these topics are distributed over the corpus. (In real life LDA analyses, you’d run the algorithm multiple times with different parameters and then see which fit best.) For simplicity’s sake, let’s assume there are 2 topics, which we’ll call A and B, and they’re distributed evenly: half of the words in the corpus belong to topic A and the other half to topic B.

Apparently, hamsters do indeed eat broccoli. Photograph CC-BY

What exactly is a word, though? I found the use of this term confusing in both Chen’s and Underwood’s text, so instead I’ll speak of tokens and lemmata: the lemma ‘cute’ appears as 2 tokens in the corpus above. Before we apply the actual LDA algorithm, it makes sense to not only tokenise but also lemmatise our 5 example documents (i.e. sentences), and also to remove stop words such as pronouns and prepositions, which may result in something like this:

  • like eat broccoli banana
  • eat banana spinach smoothie breakfast
  • chinchilla kitten cute
  • sister adopt kitten yesterday
  • look cute hamster munch piece broccoli

Now we randomly assign topics to tokens according to our assumptions (2 topics, 50:50 distribution). This may result in e.g. ‘cute’ getting assigned once to topic A and once to topic B. An initial random topic assignment may look like this:

  • like -> A, eat -> B, broccoli -> A, banana -> B
  • eat -> A, banana -> B, spinach -> A, smoothie -> B, breakfast -> A
  • chinchilla -> B, kitten -> A, cute -> B
  • sister -> A, adopt -> B, kitten -> A, yesterday -> B
  • look -> A, cute -> B, hamster -> A, munch -> B, piece -> A, broccoli -> B

Clearly, this isn’t a satisfying result yet; words like ‘eat’ and ‘broccoli’ are assigned to multiple topics when they should belong to only one, etc. Ideally, all words connected to the topic of eating should be assigned to one topic and all words related to pets should belong to the other. Now the LDA algorithm goes through the documents to improve this initial topic assignment: it computes probabilities which topic each token should belong to, based on three criteria:

  1. Which topics are the other tokens in this document assigned to? Probably the document is about one single topic, so if all or most other tokens belong to topic A, then the token in question should most likely also get assigned to topic A.
  2. Which topics are the other tokens in *all* documents assigned to? Remember that we assume a 50:50 distribution of topics, so if the majority of tokens is assigned to topic A, the token in question should get assigned to topic B to establish an equilibrium.
  3. If there are multiple tokens of the same lemma: which topic is the majority of tokens of that lemma assigned to? If most instances of ‘eat’ belong to topic A, then the token in question probably also belongs to topic A.

The actual formulas to calculate the probabilities given by Chen and Underwood seem to differ a bit from each other, but instead of bothering you with a formula, I’ll simply describe how it works in the example (my understanding being closer to Chen’s formula, I think). Let’s start with the first token of the first document (although the order doesn’t matter), ‘like’, currently assigned to topic A.

Should ‘like’ belong to topic B instead? If ‘like’ belonged to topic B, 3 out of 4 tokens in this document would belong to the same topic, as opposed to 2:2 if we stay with topic A. On the other hand, changing ‘like’ to topic B would threaten the equilibrium of topics over all documents: topic B would consist of 12 tokens and topic A of only 10, as opposed to the perfect 11:11 equilibrium if ‘like’ remains in topic A. In this case, the former consideration outweighs the latter, as the two factors get multiplied: the probability for ‘change this token to topic B’ is 3/4 * 1/12 = 6%, whereas the probability for ‘stay with topic A’ is 2/4 * 1/11 = 4.5%. We can also convert these numbers to absolute percentages (so that they add up to 100%) and say: ‘like’ is 57% topic B and 43% topic A.

What are you supposed to do with these percentages? We’ll get there in a minute. Let’s first calculate them for the next token, ‘eat’, because it’s one of those interesting lemmata with multiple tokens in our corpus. Currently, ‘eat’ in the first document is assigned to topic B, but in the second document it’s assigned to topic A. The probability for ‘eat stays in topic B’ is the same as the same as for ‘like stays in topic A’ above: within this document, the ratio of ‘B’ tokens to ‘A’ tokens is 2:2, which gives us 2/4 or 0.5 for the first factor; ‘eat’ would be 1 out of 11 tokens that make up topic B across all documents, giving us 1/11 for the second factor. The probability for ‘change eat to topic A’ is much higher, though, because there is already another ‘eat’ token assigned to this topic in another document. The first factor is 3/4 again, but the second is 2/12, because out of the 12 tokens that would make up topic A if we changed this token to topic A, 2 tokens would be of the same lemma, ‘eat’. In percentages, this means: this first ‘eat’ token is 74% topic A and only 26% topic B.

In this way we can calculate probabilities for each token in the corpus. Then we randomly assign new topics to each token, only this time not on a 50:50 basis, but according to the percentages we’ve figured out before. So this time, it’s more likely that ‘like’ will end up in topic B, but there’s still a 43% chance it will get assigned to topic A again. The new distribution of topics might be slightly better than the first one, but depending on how lucky you were with the random assignment in the beginning, it’s still unlikely that all tokens pertaining to food are neatly put in one topic and the animal tokens in the other.

The solution is to iterate: repeat the process of probability calculations with the new topic assignments, then randomly assign new topics based on the latest probabilities, and so on. After a couple of thousand iterations, the probabilities should make more sense. Ideally, there should now be some tokens with high percentages for each topic, so that both topics are clearly defined.

Only with this example, it doesn’t work out. After 10,000 iterations, the LDA script I’ve written produces results like this:

  • topic A: cute (88%), like (79%), chinchilla (77%), hamster (76%), …
  • topic B: kitten (89%), sister (79%), adopt (79%), yesterday (79%), …

As you can see, words from the ‘animals’ category ended up in both topics, so this result is worthless. The result given by Mallet after 10,000 iterations is slightly better:

  • topic 0: cute kitten broccoli munch hamster look yesterday sister chinchilla spinach
  • topic 1: banana eat piece adopt breakfast smoothie like

Topic 0 is clearly the ‘animal’ topic here. Words like ‘broccoli’ and ‘much’ slipped in because they occur in the mixed-topic sentence, “Look at this cute hamster munching on a piece of broccoli”. No idea why ‘spinach’ is in there too though. It’s equally puzzling that ‘adopt’ somehow crept into topic 1, which otherwise can be identified as the ‘food’ topic.

The reason for this ostensible failure of the LDA algorithm is probably the small size of the test data set. The results become more convincing the greater the number of tokens per document.

Detail from p. 1 of Astonishing X-Men (1995) #1 by Scott Lobdell and Joe Madureira. The text in the caption boxes (with stop words liberally removed) can be tokenised and lemmatised as: begin break man heart sear soul erik lehnsherr know world magneto founder astonishing x-men last bastion hope world split asunder ravage eugenics war human mutant know exact ask homo superior comrade day ask die

For a real-world example with more tokens, I have selected some X-Men comics. The idea is that because they are about similar subject matters, we can expect some words to be used in multiple texts from which topics can be inferred. This new test corpus consists of the first 100 tokens (after stop word removal) from each of the following comic books that I more or less randomly pulled from my longbox/shelf: Astonishing X-Men #1 (1995) by Scott Lobdell, Ultimate X-Men #1 (2001) by Mark Millar, and Civil War: X-Men #1 (2006) by David Hine. All three comics open with captions or dialogue with relatively general remarks about the ‘mutant question’ (i.e. government action / legislation against mutants, human rights of mutants) and human-mutant relations, so that otherwise uncommon lemmata such as ‘mutant’, ‘human’ or ‘sentinel’ occur in all three of them. To increase the number of documents, I have split each 100-token batch into two parts at semantically meaningful points, e.g. when the text changes from captions to dialogue in AXM, or after the voice from the television is finished in CW:XM.

Page 6, panel 1 from UItimate X-Men #1 by Mark Millar and Adam Kubert. Tokens: good evening boaz eshelmen watch channel nine new update tonight top story trial run sentinel hail triumphant success mutant nest los angeles uncover neutralize civilian casualty

I then ran my LDA script (as described above) over these 6 documents with ~300 tokens, again with the assumption that there are 2 equally distributed topics (because I had carelessly hard-coded this number of topics in the script and now I’m too lazy to re-write it). This is the result after 1,000 iterations:

  • topic A: x-men (95%), sentinel (93%), sentinel (91%), story (91%), different (90%), …
  • topic B: day (89%), kitty (86%), die (86%), …

So topic A looks like the ‘mutant question’ issue with tokens like ‘x-men’ and two times ‘sentinel’, even though ‘mutant’ itself isn’t among the high-scoring tokens. Topic B, on the other hand, makes less sense (Kitty Pryde only appears in CW:XM, so that ‘kitty’ occurs in merely 2 of the 6 documents), and its highest percentages are also much lower than those in topic A. Maybe this means that there’s only one actual topic in this corpus.

Page 1, panel 5 from Civil War: X-Men #1 by David Hine and Yanick Paquette. Tokens: incessant rain hear thing preternatural acute hearing cat flea

Running Mallet over this corpus (2 topics, 10,000 iterations) yields an even less useful result. The first 5 words in each topic are:

  • topic 0: mutant, know, x-men, ask, cooper
  • topic 1: say, sentinel, morph, try, ready

(Valerie Cooper and Morph are characters that appear in only one comic, CW:XM and AXM, respectively.)

Topic 0 at least associates ‘x-men’ with ‘mutant’, but then again, ‘sentinel’ is assigned to the other topic. Thus neither topic can be related to an intuitively perceived theme in the comics. It’s clear how these topics were generated though: there’s only 1 document in which ‘sentinel’ doesn’t occur, the first half of the CW:XM excerpt, in which Valerie Cooper is interviewed on television. But ‘x-men’ and ‘mutant’ do occur in this document, the latter even twice, and also ‘know’ occurs more frequently (3 times) here than in other documents.

So the results from Mallet and maybe even my own Perl script seem to be correct, in the sense that the LDA algorithm has been properly performed and one can see from the results how the algorithm got there. But what’s the point of having ‘topics’ that can’t be matched to what we intuitively perceive as themes in a text?

The problem with our two example corpora here was, they were still not large enough for LDA to yield meaningful results. As with all statistical methods, LDA works better the larger the corpus. In fact, the idea of such methods is that they are best applied to amounts of text that are too large for a human to read. Therefore, LDA might be not that useful for disciplines (such as comics studies) in which it’s difficult to gather large text corpora in digital form. But do feel free to e.g. randomly download texts from Wikisource, and you’ll find that within them, LDA is able to successfully detect clusters of words that occur in semantically similar documents.