Trying to understand Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is one of the most popular algorithms for Topic Modeling, i.e. having a computer find out what a text is about. LDA is also perhaps easier to understand than the other popular Topic Modeling approach, (P)LSA, but even though there are two well-written blog posts that explain LDA (Edwin Chen’s and Ted Underwood’s) to non-mathematicians, it still took me quite some time to grasp LDA well enough to be able to code it in a Perl script (which I have made available on GitHub, in case anyone is interested). Of course, you can always simply use a software like Mallet that runs LDA over your documents and outputs the results, but if you want to know what LDA actually does, I suggest you read Edwin Chen’s and Ted Underwood’s blog posts first, and then, if you still feel you don’t really get LDA, come back here. OK?

Welcome back. Disclaimer: I’m not a mathematician and there’s still the possibility that I got it all wrong. That being said, let’s take a look at Edwin Chen’s first example again, and this time we’re going to calculate it through step by step:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

We immediately see that these sentences are about either eating or pets or both, but even if we didn’t know about these two topics, we still have to make an assumption about the number of topics within our corpus of documents. Furthermore, we have to make an assumption how these topics are distributed over the corpus. (In real life LDA analyses, you’d run the algorithm multiple times with different parameters and then see which fit best.) For simplicity’s sake, let’s assume there are 2 topics, which we’ll call A and B, and they’re distributed evenly: half of the words in the corpus belong to topic A and the other half to topic B.

Apparently, hamsters do indeed eat broccoli. Photograph CC-BY

What exactly is a word, though? I found the use of this term confusing in both Chen’s and Underwood’s text, so instead I’ll speak of tokens and lemmata: the lemma ‘cute’ appears as 2 tokens in the corpus above. Before we apply the actual LDA algorithm, it makes sense to not only tokenise but also lemmatise our 5 example documents (i.e. sentences), and also to remove stop words such as pronouns and prepositions, which may result in something like this:

  • like eat broccoli banana
  • eat banana spinach smoothie breakfast
  • chinchilla kitten cute
  • sister adopt kitten yesterday
  • look cute hamster munch piece broccoli

Now we randomly assign topics to tokens according to our assumptions (2 topics, 50:50 distribution). This may result in e.g. ‘cute’ getting assigned once to topic A and once to topic B. An initial random topic assignment may look like this:

  • like -> A, eat -> B, broccoli -> A, banana -> B
  • eat -> A, banana -> B, spinach -> A, smoothie -> B, breakfast -> A
  • chinchilla -> B, kitten -> A, cute -> B
  • sister -> A, adopt -> B, kitten -> A, yesterday -> B
  • look -> A, cute -> B, hamster -> A, munch -> B, piece -> A, broccoli -> B

Clearly, this isn’t a satisfying result yet; words like ‘eat’ and ‘broccoli’ are assigned to multiple topics when they should belong to only one, etc. Ideally, all words connected to the topic of eating should be assigned to one topic and all words related to pets should belong to the other. Now the LDA algorithm goes through the documents to improve this initial topic assignment: it computes probabilities which topic each token should belong to, based on three criteria:

  1. Which topics are the other tokens in this document assigned to? Probably the document is about one single topic, so if all or most other tokens belong to topic A, then the token in question should most likely also get assigned to topic A.
  2. Which topics are the other tokens in *all* documents assigned to? Remember that we assume a 50:50 distribution of topics, so if the majority of tokens is assigned to topic A, the token in question should get assigned to topic B to establish an equilibrium.
  3. If there are multiple tokens of the same lemma: which topic is the majority of tokens of that lemma assigned to? If most instances of ‘eat’ belong to topic A, then the token in question probably also belongs to topic A.

The actual formulas to calculate the probabilities given by Chen and Underwood seem to differ a bit from each other, but instead of bothering you with a formula, I’ll simply describe how it works in the example (my understanding being closer to Chen’s formula, I think). Let’s start with the first token of the first document (although the order doesn’t matter), ‘like’, currently assigned to topic A.

Should ‘like’ belong to topic B instead? If ‘like’ belonged to topic B, 3 out of 4 tokens in this document would belong to the same topic, as opposed to 2:2 if we stay with topic A. On the other hand, changing ‘like’ to topic B would threaten the equilibrium of topics over all documents: topic B would consist of 12 tokens and topic A of only 10, as opposed to the perfect 11:11 equilibrium if ‘like’ remains in topic A. In this case, the former consideration outweighs the latter, as the two factors get multiplied: the probability for ‘change this token to topic B’ is 3/4 * 1/12 = 6%, whereas the probability for ‘stay with topic A’ is 2/4 * 1/11 = 4.5%. We can also convert these numbers to absolute percentages (so that they add up to 100%) and say: ‘like’ is 57% topic B and 43% topic A.

What are you supposed to do with these percentages? We’ll get there in a minute. Let’s first calculate them for the next token, ‘eat’, because it’s one of those interesting lemmata with multiple tokens in our corpus. Currently, ‘eat’ in the first document is assigned to topic B, but in the second document it’s assigned to topic A. The probability for ‘eat stays in topic B’ is the same as the same as for ‘like stays in topic A’ above: within this document, the ratio of ‘B’ tokens to ‘A’ tokens is 2:2, which gives us 2/4 or 0.5 for the first factor; ‘eat’ would be 1 out of 11 tokens that make up topic B across all documents, giving us 1/11 for the second factor. The probability for ‘change eat to topic A’ is much higher, though, because there is already another ‘eat’ token assigned to this topic in another document. The first factor is 3/4 again, but the second is 2/12, because out of the 12 tokens that would make up topic A if we changed this token to topic A, 2 tokens would be of the same lemma, ‘eat’. In percentages, this means: this first ‘eat’ token is 74% topic A and only 26% topic B.

In this way we can calculate probabilities for each token in the corpus. Then we randomly assign new topics to each token, only this time not on a 50:50 basis, but according to the percentages we’ve figured out before. So this time, it’s more likely that ‘like’ will end up in topic B, but there’s still a 43% chance it will get assigned to topic A again. The new distribution of topics might be slightly better than the first one, but depending on how lucky you were with the random assignment in the beginning, it’s still unlikely that all tokens pertaining to food are neatly put in one topic and the animal tokens in the other.

The solution is to iterate: repeat the process of probability calculations with the new topic assignments, then randomly assign new topics based on the latest probabilities, and so on. After a couple of thousand iterations, the probabilities should make more sense. Ideally, there should now be some tokens with high percentages for each topic, so that both topics are clearly defined.

Only with this example, it doesn’t work out. After 10,000 iterations, the LDA script I’ve written produces results like this:

  • topic A: cute (88%), like (79%), chinchilla (77%), hamster (76%), …
  • topic B: kitten (89%), sister (79%), adopt (79%), yesterday (79%), …

As you can see, words from the ‘animals’ category ended up in both topics, so this result is worthless. The result given by Mallet after 10,000 iterations is slightly better:

  • topic 0: cute kitten broccoli munch hamster look yesterday sister chinchilla spinach
  • topic 1: banana eat piece adopt breakfast smoothie like

Topic 0 is clearly the ‘animal’ topic here. Words like ‘broccoli’ and ‘much’ slipped in because they occur in the mixed-topic sentence, “Look at this cute hamster munching on a piece of broccoli”. No idea why ‘spinach’ is in there too though. It’s equally puzzling that ‘adopt’ somehow crept into topic 1, which otherwise can be identified as the ‘food’ topic.

The reason for this ostensible failure of the LDA algorithm is probably the small size of the test data set. The results become more convincing the greater the number of tokens per document.

Detail from p. 1 of Astonishing X-Men (1995) #1 by Scott Lobdell and Joe Madureira. The text in the caption boxes (with stop words liberally removed) can be tokenised and lemmatised as: begin break man heart sear soul erik lehnsherr know world magneto founder astonishing x-men last bastion hope world split asunder ravage eugenics war human mutant know exact ask homo superior comrade day ask die

For a real-world example with more tokens, I have selected some X-Men comics. The idea is that because they are about similar subject matters, we can expect some words to be used in multiple texts from which topics can be inferred. This new test corpus consists of the first 100 tokens (after stop word removal) from each of the following comic books that I more or less randomly pulled from my longbox/shelf: Astonishing X-Men #1 (1995) by Scott Lobdell, Ultimate X-Men #1 (2001) by Mark Millar, and Civil War: X-Men #1 (2006) by David Hine. All three comics open with captions or dialogue with relatively general remarks about the ‘mutant question’ (i.e. government action / legislation against mutants, human rights of mutants) and human-mutant relations, so that otherwise uncommon lemmata such as ‘mutant’, ‘human’ or ‘sentinel’ occur in all three of them. To increase the number of documents, I have split each 100-token batch into two parts at semantically meaningful points, e.g. when the text changes from captions to dialogue in AXM, or after the voice from the television is finished in CW:XM.

Page 6, panel 1 from UItimate X-Men #1 by Mark Millar and Adam Kubert. Tokens: good evening boaz eshelmen watch channel nine new update tonight top story trial run sentinel hail triumphant success mutant nest los angeles uncover neutralize civilian casualty

I then ran my LDA script (as described above) over these 6 documents with ~300 tokens, again with the assumption that there are 2 equally distributed topics (because I had carelessly hard-coded this number of topics in the script and now I’m too lazy to re-write it). This is the result after 1,000 iterations:

  • topic A: x-men (95%), sentinel (93%), sentinel (91%), story (91%), different (90%), …
  • topic B: day (89%), kitty (86%), die (86%), …

So topic A looks like the ‘mutant question’ issue with tokens like ‘x-men’ and two times ‘sentinel’, even though ‘mutant’ itself isn’t among the high-scoring tokens. Topic B, on the other hand, makes less sense (Kitty Pryde only appears in CW:XM, so that ‘kitty’ occurs in merely 2 of the 6 documents), and its highest percentages are also much lower than those in topic A. Maybe this means that there’s only one actual topic in this corpus.

Page 1, panel 5 from Civil War: X-Men #1 by David Hine and Yanick Paquette. Tokens: incessant rain hear thing preternatural acute hearing cat flea

Running Mallet over this corpus (2 topics, 10,000 iterations) yields an even less useful result. The first 5 words in each topic are:

  • topic 0: mutant, know, x-men, ask, cooper
  • topic 1: say, sentinel, morph, try, ready

(Valerie Cooper and Morph are characters that appear in only one comic, CW:XM and AXM, respectively.)

Topic 0 at least associates ‘x-men’ with ‘mutant’, but then again, ‘sentinel’ is assigned to the other topic. Thus neither topic can be related to an intuitively perceived theme in the comics. It’s clear how these topics were generated though: there’s only 1 document in which ‘sentinel’ doesn’t occur, the first half of the CW:XM excerpt, in which Valerie Cooper is interviewed on television. But ‘x-men’ and ‘mutant’ do occur in this document, the latter even twice, and also ‘know’ occurs more frequently (3 times) here than in other documents.

So the results from Mallet and maybe even my own Perl script seem to be correct, in the sense that the LDA algorithm has been properly performed and one can see from the results how the algorithm got there. But what’s the point of having ‘topics’ that can’t be matched to what we intuitively perceive as themes in a text?

The problem with our two example corpora here was, they were still not large enough for LDA to yield meaningful results. As with all statistical methods, LDA works better the larger the corpus. In fact, the idea of such methods is that they are best applied to amounts of text that are too large for a human to read. Therefore, LDA might be not that useful for disciplines (such as comics studies) in which it’s difficult to gather large text corpora in digital form. But do feel free to e.g. randomly download texts from Wikisource, and you’ll find that within them, LDA is able to successfully detect clusters of words that occur in semantically similar documents.