The Flesch reading-ease score (FRES, also called FRE – ‘Flesch Reading Ease’) is still a popular measurement for the readability of texts, despite some criticism and suggestions for improvement since it was first proposed by Rudolf Flesch in 1948. (I’ve never read his original paper, though; all my information is taken from Wikipedia.) On a scale from 0 to 100, it indicates how difficult it is to understand a given text based on sentence length and word length, with a low score meaning difficult to read and a high score meaning easy to read.
Sentence length and word length are also popular factors in stylometry, the idea here being that some authors (or, generally speaking, kinds of text) prefer longer sentences and/or words while others prefer shorter ones. Thus such scores based on sentence length and word length might serve as an indicator of how similar two given texts are. In fact, FRES is used in actual stylometry, albeit only as one factor among many (e.g. in Brennan, Afroz and Greenstadt 2012 (PDF)). Over other stylometric indicators, FRES would have the added benefit that it actually says something in itself about the text, rather than being merely a number that only means something in relation to another.
The original FRES formula was developed for English and has been modified for other languages. In the last few stylometry blogposts here, the examples were taken from Japanese manga, but FRES is not well suited for Japanese. The main reason is that syllables don’t play much of a role in Japanese readability. More important factors are the number of characters and the ratio of kanji, as the number of syllables per character varies. A two-kanji compound, for instance, can have fewer syllables than a single-kanji word (e.g. 部長 bu‧chō ‘head of department’ vs. 力 chi‧ka‧ra ‘power’). Therefore, we’re going to use our old English-language X-Men examples from 2017 again.
The comics in question are: Astonishing X-Men #1 (1995) written by Scott Lobdell, Ultimate X-Men #1 (2001) written by Mark Millar, and Civil War: X-Men #1 (2006) written by David Hine. Looking at just the opening sequence of each comic (see the previous X-Men post for some images), we get the following sentence / word / syllable counts:
- AXM: 3 sentences, 68 words, 100 syllables.
- UXM: 6 sentences, 82 words, 148 syllables.
- CW:XM: 7 sentences, 79 words, 114 syllables.
We don’t even need to use Flesch’s formula to get an idea of the readability differences: the sentences in AXM are really long and those in CW:XM are much shorter. As for word length, UXM stands out with rather long words such as “unconstitutional”, which is reflected in the high ratio of syllables per word.
Applying the formula (cf. Wikipedia), we get the following FRESs:
- AXM: 59.4
- UXM: 40.3
- CW:XM: 73.3
Who would have thought that! It looks like UXM (or at least the selected portion) is harder to read than AXM – a FRES of 40.3 is already ‘College’ level according to Flesch’s table.
But how do these numbers help us if we’re interested in stylometric similarity? All three texts are written by different writers. So far we could only say (again – based on a insufficiently sized sample) that Hine’s writing style is closer to Lobdell’s than to Millar’s. The ultimate test for a stylometric indicator would be to take an additional example text that is written by one of the three authors, and see if its FRES is close to the one from the same author’s X-Men text.
Our 4th example will be the rather randomly selected Nemesis by Millar (2010, art by Steve McNiven) from which we’ll also take all text from the first few panels.
These are the numbers for the selected text fragment from Nemesis:
- 8 sentences, 68 words, 88 syllables.
- This translates to a FRES of 88.7!
In other words, Nemesis and UXM, the two comics written by Millar, appear to be the most dissimilar of the four! However, that was to be expected. Millar would be a poor writer if he always applied the same style to each character in each scene. And the two selected scenes are very different: a TV news report in UXM in contrast to a dialogue (or perhaps more like the typical villain’s monologue) in Nemesis.
Interestingly, there is a TV news report scene in Nemesis too (Part 3, p. 3). Wouldn’t that make for a more suitable comparison?
Here are the numbers for this TV scene which I’ll call N2:
- 4 sentences, 81 words, 146 syllables.
- FRES: 33.8
Now this looks more like Millar’s writing from UXM: the difference between the two scores is so small (6.5) that they can be said to be almost identical.
Still, we haven’t really proven anything yet. One possible interpretation of the scores is that the ~30-40 range is simply the usual range for this type of text, i.e. TV news reports. So perhaps these scores are not specific to Millar (or even to comics). One would have to look at similar scenes by Lobdell, Hine and/or other writers to verify that, and ideally also at real-world news transcripts.
On the other hand, one thing has worked well: two texts that we had intuitively identified as similar – UXM and N2 – indeed showed similar Flesch scores. That means FRES is not only a measurement of readability but also of stylometric similarity – albeit a rather crude one which is, as always, best used in combination with other metrics.
Authors: Max Bemis (writer), various
Publication Dates: June – December 2018
Pages per issue: 20
Price per issue: $3.99
Another year has passed in which Moon Knight was largely ignored by critics. Rightfully so? The last story arc by Max Bemis and Jacen Burrows, collected in a trade paperback titled “Crazy Runs in the Family”, showed great potential. What came afterwards, though, was quite a mixed bag:
#194, drawn by Ty Templeton, is seemingly a one-shot which introduces Uncle Ernst, a supervillain from Marc Spector’s childhood.
#195-196, with brilliant artwork by Paul Davidson, is a weird and charming little story about The Collective, a new supervillain (or group of villains?).
#197-198, drawn by Jacen Burrows again, seem to tell a very similar tale about another group of adversaries, the Société des Sadiques. Their leader turns out to be none other than Uncle Ernst, which in hindsight makes #194 the first part of this story arc.
Although the story appears to be finished with #198 (which is also the last issue to be collected in the TPB, “Phases”), #199 (art by Davidson again) continues it with another face-off between Moon Knight and Ernst.
#200 (still drawn by Davidson), finally, brings back the supervillains from the previous arc, Sun King and The Truth, the former allying with Moon Knight while the latter has been corrupted by Ernst.
Thus, with the interruption of #195-196, we basically have a five-part finale, the cohesion of which is futher damaged by the change of artists. Bemis has injected a lot of clever and darkly humorous ideas into these issues, though their connections to the Nazi Holocaust are sometimes bordering on tastelessness. Still, the cancellation of this series after this anniversary issue is a remarkable marketing failure, even for Marvel. Usually, such an anniversary would be used to invigorate and generate new interest in a series at least for the next couple of issues (which has recently worked well for e.g. Action Comics at DC), but Marvel didn’t even seem to have had that much faith in Moon Knight. The 200th issue itself is not that flashy either: a slightly increased size (30 pages) for an increased prize ($5), some guest artist pages (one each by Jeff Lemire and Bill Sienkiewicz), and an action sequence of two double-page spreads by Davidson – that’s it.
What remains in memory of this Bemis/Burrows/Davidson run is a number of whacky characters, stunningly drawn panels, witty lines of dialogue, and ways of storytelling that at least feel fresh. And three comic creators to watch (although Bemis seems to identify more as a rock musician). However, the lack of success of a rock-solid series such as Moon Knight also says a lot about the current state of American superhero comics in which such a vast amount of material is published each week that the comic books are cannibalising each other in their competition for reader attention.
Rating: ● ● ● ○ ○
Happy May Day everyone, or ‘Warren Ellis Day’ as for some reason it has come to be known in this little corner of the Web. This time we’re going to look at politics in Warren Ellis’s classic, Planetary (art by John Cassaday). Planetary was published in 27 issues by Wildstorm/DC from 1998-2009. As far as the main story is concerned, the political setup of Planetary is a standard Warren Ellis one: it’s a conspiracy of supervillains who pull all the strings in this world, and the democratically elected governments of the world are powerless against them. It takes superheroes – vigilantes, rogues, operating outside of the law – to protect the world from these supervillains.
There is more going on here, though. Among the earlier issues (collected in Planetary Book One, not to be confused with Planetary Volume 1 which only contains #1-6), some stand out in particular from a political perspective because they comment on real-world political events and figures. Of these, we’ll discuss issue #2 (“Island”) here (but #7 and #8 are also noteworthy in this regard).
“Island” is mostly set on “Island Zero”, a fictional island that “forms the far north-western tip of the Japanese archipelago. Also the closest island in the group to the Eurasian landmass – specifically, Russia”, says Shinya Fukuda, a Tokyo-based employee of the Planetary organisation. He continues, “It’s off-limits, due to an issue of war legality still under arbitration. Basically, we think it’s ours, and the Russians think it’s theirs. One of our prime ministers visited Yeltsin to try and iron it out last year, but, you know…”
Another Japanese character, the terrorist Ryu who plans to overthrow the Japanese government, describes Island Zero like this: “The last island between Japan and Siberian Russia. Unpopulated because of its nature as a political football. The Russians claim it as spoils of World War Two. We, naturally, claim it as part of Japan. Legally, this island is a nowhere thing.”
Ellis probably alludes to the Kuril Islands dispute here, even though they are located north-east of Japan, not north-west. The status of the Kuril Islands has been settled in several treaties which say they belong to Russia (as the successor of the Soviet Union). The Japanese government accepts these treaties, but claims that the four islands closest to Hokkaidō do not belong to the Kurils and are therefore not part of the treaties. Another difference between the disputed Kuril Islands and Island Zero is that the former are not entirely uninhabited: almost 20,000 people live on three of them, while on the fourth there’s a Russian border guard outpost.
The interesting thing in Planetary, however, is how the two aforementioned Japanese characters – only one of which is a fanatic nationalist – talk about Island Zero: “we think it’s ours”, “we claim it as part of Japan”. Why do they include themselves in the pronoun? It’s the government that does the claiming, so why do Shinya and Ryu adopt this claim as their own? What would Shinya and Ryu specifically gain if Russia ceded Island Zero to Japan? Sure, if Island Zero was part of Japan, Ryu could go on his hiking trip there without the risk of getting caught by the military, but the reason he goes there in the first place is precisely its remoteness due to its disputed status.
For Shinya and Ryu there’s nothing at stake in the dispute over Island Zero, so they probably don’t really “think” and “claim” much about it. More likely, there are some common but oversimplifying conflations at work here: of state and nation, of individual citizen and nation, and of state and individual politician. As abstract entities, states can’t think or claim anything – politicians such as the Japanese prime minister mentioned by Shinya can. And while it can be said that some views are more prevalent in a given nation than others, the assuredness with which both Shinya and Ryu include all Japanese people in their “we” creates the illusion of a completely homogeneous society in which everyone agrees with their government.
It’s particularly problematic that it’s the Japanese society, because this basically repeats the old prejudice of a purported Japanese conformity that borders on blind obedience. It seems like in the world of Planetary, governmental authority is only questioned by superhumans (who are powerful enough to stand above it anyway). Ryu says he wants to topple the government and become “paramount leader of Japan”, but he never says what his problem with the current government is. He is dismissed by Shinya as having “that Yukio Mishima, Aum Shin Ryko [sic, i.e. Rikyō] smell about” him. However, Aum Shinrikyō had their religious doomsday beliefs and Mishima wanted to restore the divinity of the Emperor. What does Ryu believe in? One of his followers says to him, “I believe in your theories. I believe in armed resurrection and revolution and nerve gas and acceptable casualties and all the rest of it.” But what are Ryu’s theories? Ellis doesn’t say. Ideological debates don’t seem to interest him. Apparently ideology is something for fanatics and terrorists, who make for good plot devices – but these characters must be wrong, because they’re the villains, so their ideology must be wrong too and doesn’t need to be discussed. Neither do we learn much about the political beliefs of the protagonists, the three superhero members of Planetary – they’re the good guys, so if they believe in anything, surely it must be right after all…
Regular readers of this weblog might have gathered from earlier posts that the two previous Moon Knight incarnations, the Ellis/Shalvey run and particularly the Lemire/Smallwood run, ought to be regarded as highlights of the superhero genre of this decade. Now that the first storyarc in the first six issues of the latest Moon Knight run (#188-193 in the annoying new “Legacy” numbering) has been completed, it’s time to ask: how does it hold up?
Authors: Max Bemis (writer), Jacen Burrows (artist), Mat Lopes (colourist)
Publication Dates: November 2017 – March 2018
Pages per issue: 20-25
Price per issue: $3.99
In the afterword to the first issue, artist Jacen Burrows says, “Moon Knight has been in a sort of creative renaissance since Warren Ellis and Declan Shalvey relaunched the character in 2014, all the way through the amazing arc recently completed by Jeff Lemire, Greg Smallwood and company, and we hope to continue this by making the next important chapter in Marc Spector’s life thought-provoking, intense, a little scary, and a little funny.”
It’s reassuring to read that Bemis and Burrows decided to honour the – ahem – legacy of Moon Knight instead of wiping the slate once again, as some previous Moon Knight authors have done. The first issue (#188) is even entirely told from the perspective of Dr. Emmet, Marc Spector’s psychiatrist, a character created only recently by Lemire and Smallwood. Telling a story about a character from the perspective of his or her psychiatrist isn’t a new device. Neither is the introduction of an ‘evil twin’ sort of villain, a character similar to Moon Knight who is set up as his rival. However, combining these two devices to the effect that Moon Knight himself doesn’t directly appear in the whole first issue is quite a daring move.
The second issue (#189), however, introduces another villain, “The Truth”, who is chased and confronted by Moon Knight. The concept of Moon Knight’s split personality disorder (Marc Spector / Steven Grant / Jake Lockley) is expanded to the effect that he now, more deliberately than before, switches between his personalities so that he has e.g. Jake Lockley do all the dirty work. Jake is the personality that contains Moon Knight’s darkest, most violent and ruthless aspects, from which the other personalities are kept clean.
In #190, Jake and Marc have a conversation about this in his (their?) mind. Jake says, “Kid, you sliced me off your personality and sent me to live among freaks, addicts, and criminals. There are things you don’t want to know. […] Look. Steven is the wealthy benefactor. Khonshu is our connection to the bigger picture. You’re the voice of reason. And I deal with the grimy leftovers. You built us this way.” Just how great the divide between these personalities is becomes clear later in this third issue, when Marc visits his ex-girlfriend Marlene and finds out that, unbeknownst to him, as it were, she had been dating Jake instead after having split up with Marc.
Khonshu does a lot of talking too, as he is the narrator for most of this story. In #191, he dispenses a peculiar theological lecture to Moon Knight in which he suggests that the Lovecraftian Old Ones, the Judeo-Christian God, and Ancient Egyptian Ra (father of Khonshu) are one and the same. However, as always, we can’t be sure whether Khonshu is really a supernatural individual or just another aspect of Moon Knight’s twisted mind.
Meanwhile, the other supervillain, who calls himself Ra because he believes he’s the avatar of this Egyptian god, has teamed up with The Truth and lured Moon Knight on a remote island. In the final issue of this storyarc (#193), Moon Knight and Ra fight. It’s not a very fair fight because Ra is a pyrokinetic, whereas Moon Knight doesn’t have any superpowers. Or so one might have thought, but then Steven Grant figures it all out: “Khonhsu. Are you saying […] if Sun King’s [i.e. Ra’s] belief is a part of him, and in some weird metatextual way relates to his abilities, that, in a way, Marc has powers of his own?”
Some weird metatextual way indeed. The power which Moon Knight’s delusion grants him is only his near-superhuman tenacity (“the power of crazy”), but doesn’t that also mean Ra got his pyrokinetic ability because he became mentally ill? More precisely, ironically it was Dr. Emmet who gave him ideas about Egyptian mythology and thus unintentionally awakened his superpower. Quite a problematic plot point, but then again, this is the Marvel Universe, where people acquire supernatural abilities through gamma rays and the like, so why not through the sheer power of imagination…
So the writing is a mixed bag of good and not so good ideas. As for the art, it’s more than solid, even beautiful. Jacen Burrows’s style is perhaps best compared to Frank Quitely’s, with its thin clear outlines and little shading. However, while there are many clever compositions and layouts to be found here, Burrows’s art lacks the groundbreaking creative force and the eagerness to experiment for which his predecessors on the title, Smallwood and Shalvey, will be remembered. An unfair comparison, perhaps, but unavoidable. Nevertheless, I’m looking forward to finding out where Bemis and Burrows are going to take Moon Knight – this still has the potential to turn into another historic run.
Rating: ● ● ● ○ ○
Thanks to Marvel’s ‘Legacy’ reboot, a new Moon Knight series with a new creative team has started recently (more on that in a later blogpost). The last 5 issues of the Lemire/Smallwood run have been collected as trade paperback vol. 3: “Birth and Death” (even though the story arc is titled “Death and Birth” in the individual comic books), and if there was any justice in the world, this comic would now show up on all of those year-end best-of lists for 2017 (it doesn’t – more on that in a later post). For what it’s worth, here’s why you should read it anyway.
Authors: Jeff Lemire (writer), Greg Smallwood (artist), Jordie Bellaire (colourist)
Pages per issue: 20
Price per issue: $3.99
Previously in Moon Knight: Marc Spector has escaped the mental asylum, but his friend Crawley is being held captive by the god Anubis. And Moon Knight has yet to confront Khonshu, the god who created him.
In the beginning of this new story arc, Moon Knight seeks out Anubis. They strike a deal: if Moon Knight succeeds in rescuing Anubis’s wife Anput from the Overvoid (a parallel dimension reminiscent of ancient Egypt, except that people ride on giant dragonflies through the air and pyramids float above the ground), Crawley will be released. This story is intertwined with another, Moon Knight’s origin, the two strands alternating in segments of 3-6 pages each.
The flashback to Moon Knight’s past starts early, in Marc Spector’s childhood. We learn that already back then he created an imaginary friend (or so his psychiatrist says), Steven Grant, who later becomes an aspect of his own personality. And Marc is already visited by Khonshu who introduces himself as Marc’s real father.
Later, we see Marc as a U.S. Marine in Iraq when he gets dishonorably discharged because of his mental illness. He stays in the region and becomes first an illegal prizefighter, then a mercenary. On a mission to plunder an archaeological excavation site “near the Sudanese-Egyptian border”, he turns against his employer, Bushman, when the latter ruthlessly kills the archaeologists. Spector is defeated by Bushman and left to die alone in the desert, but Khonshu resurrects him.
Then we’re back in the present again and Marc faces Khonshu. I won’t spoil the outcome of this confrontation, but let’s look instead at that last transition from past to present in detail: in issue #14, p. 4 we’re in the desert in Marc’s past. Then on p. 5, Moon Knight in his ‘Mr Knight’ persona in the white suit is in the mental asylum again. He enters a room where he is greeted by his “good friends Bobby and Billy and Doc Ammut” – hybrid creatures of asylum staff and mythological figures. They subdue Mr Knight and give him an injection which knocks him out.
On the first panel of p. 6, we’re in the Egyptian temple in the desert again, where Khonshu carries the dying Marc Spector onto an altar before the statue of Khonshu. Marc asks, “Wh-what is this? What’s happening to me?”, and Khonshu replies: “This is a flashback, Marc. It is being intercut with the present.” On the next panel, the unconscious Marc is put on a table too, but this time by Bobby and Billy in the mental hospital. Khonshu’s voice continues though: “Time means little here.” This back-and-forth goes on for the next 4 panels of the page and so does Khonshu: “So past and present intermingle. They blend together and become one. Just like different aspects of your broken mind. The moment of your birth is here and there. It is then and now. All times lead to this instant.”
This is the most (delightfully) confusing and metafictional transition sequence, but there are many more of these mind-bending moments in this comic, and they are the main reason why it’s so brilliant. Add to this all the clever design, layout, composition and colouring decisions that Jeff Lemire, Greg Smallwood and Jordie Bellaire have made and you get one of the most remarkable superhero comics in recent history.
Rating: ● ● ● ● ○
Back in July, Aaron Kashtan concluded his short review of Champions #10 which had come out that same month with the following words:
I’ll have to think twice before buying any more Mark Waid comics, and I say that as someone who’s been a fan of his for almost 25 years.
As a regular reader of both Aaron Kashtan’s weblog and Mark Waid’s comics, I had to check out this comic book for myself. Aaron’s problem with Champions #10 is that writer Mark Waid “defends” the fictitious internment camp in which most of the story is set (or maybe even internment camps in general?) and portrays it in an insensitive manner. Several other people have shared this sentiment on the Internet, e.g. Joe Glass at Bleeding Cool, but not that many to qualify it as a full-blown outrage. Anyway, here’s how I see Champions #10, and please note that this is only about the comic and not about the opinions of Aaron Kashtan or Joe Glass or Mark Waid (who identifies himself as a “liberal” and “progressive” writer for what it’s worth).
In the current status quo of the Marvel universe at the time of Champions #10, the villainous organisation Hydra has taken over the United States, and Inhumans (basically a superpowered alien race living among humans) “are being imprisoned in camps across the country”, as the introductory text puts it. The first three comic pages show life in one of these camps in a nutshell: behind the idyllic appearance, a surveillance regime is in operation in which merely talking about escape can get inmates killed immediately.
The action then switches over to the Champions, a superhero team consisting of (Miles Morales) Spider-Man, (Amadeus Cho) Hulk and Viv (daughter of Vision). They locate their missing fourth member, Ms. Marvel, in one of those camps, and set out to free her. After managing to break into the camp and incapacitating the guards, they face the unexpected problem that “some want to go, but some want to stay”, as Hulk says on p. 14 (or 15 – not sure whether the first page after the cover already counts as part of the story). Ms. Dawood, one of the detainees, expands: “What’s happening here is brutally unjust, but we and our children are well cared for here. Out there, we would be hunted relentlessly. It would be a life of fear and desperation. Some of us are willing to make that trade and fight, even though we may not win. But those who stay may be made to pay for their escape, and that terrifies them.”
So far, so good, but then Hulk comments (still on p. 14): “Trust me, as an Asian American, I have a deep historical hatred for internment, but we might have to retreat and try some other–“. This is the crucial point (the rest of the story is of no importance here): Hulk’s comment links the fictitious camp to real-world history. Even though (or rather because) he doesn’t really say much, it triggers questions in the reader’s mind such as whether Hulk thinks that the US government that imprisoned Japanese Americans (and also Korean Americans – Amadeus Cho is of Korean descent) in WWII is morally as bad as Hydra, or whether he feels that the conditions of living are as bad in the camp he is standing in as they were in WWII internment camps. Such ideas might be offensive to Asian Americans – but they are not explicitly expressed in the comic. (Who knows, maybe Hulk is merely thinking, it’s wrong to imprison someone because of his or her race, then and now.) Even if they were, it would be Hulk who has these controversial opinions, not Mark Waid. In the end, Amadeus Cho is only a teenager who hasn’t experienced WWII internment camps, so why should his opinion have such a weight that it could be mistaken for the ‘message’ of the whole comic or its writer? Waid could have devised a better stand-in for himself to broadcast his opinion, if that had been his aim.
Besides Hulk’s comment, is the plot point itself offensive that some of the inmates choose to stay imprisoned in this particular camp rather than break out? How can Ms. Dawood say she is safer inside than outside the camp when she all but witnesses the execution of two other prisoners? One could argue that, once outside the camp, Inhumans would have a good chance of escaping and hiding from Hydra by using their superpowers. However, the inmates are probably safer inside the camp, for as long everyone plays by the rules and doesn’t try to escape, no one is executed. This is an important difference from real-world Nazi concentration camps, many of which were death camps with the purpose of ultimately killing all inmates. Ms. Dawood is also right about “being well cared for”: from what we see of the Inhuman camp, it looks like they live in spacious, well-kept houses with their own lawns. This is an important difference from real-world Asian American internment camps in WWII, in which conditions were miserable.
However, the problem of Champions #10 lies not in the story but in how it is told. The comic has a serious problem with its pacing and crams too much action into too few pages. The situation of the Inhuman inmates and the opinions of their two conflicted groups are relayed mainly through the Champions instead of the Inhumans themselves, because they have already turned into a raging mob and are busy fighting each other. It’s also telling that – after the camp wall has been breached and the guards have been taken out – it’s up to the Champions to come up with a solution to the problem of approaching Hydra reinforcements. The Inhumans, even though they have superpowers too, are relegated to passive victims in need of rescue. And even though there are “hundreds” of inmates in the camp, the Champions only ever talk to two of them (not counting the terrified Inhumans they first meet on p. 10), so the majority of the Inhumans – despite their portrayal as heterogeneous – lack not only agency but also their own voices.
To sum up: is it allowed to allude to real-world internment camps in a superhero comic book? Of course it is. But if the comic is poorly written and the subject matter is not treated with the necessary sensitivity, don’t be surprised if people are offended. That being said, this whole ‘controversy’ seems to be a non-issue along the lines of Action Comics (2011) #1 / “GD” and Batgirl (2011) #37 / “But you’re a–“ (both of which I haven’t read though).
Latent Dirichlet Allocation (LDA) is one of the most popular algorithms for Topic Modeling, i.e. having a computer find out what a text is about. LDA is also perhaps easier to understand than the other popular Topic Modeling approach, (P)LSA, but even though there are two well-written blog posts that explain LDA (Edwin Chen’s and Ted Underwood’s) to non-mathematicians, it still took me quite some time to grasp LDA well enough to be able to code it in a Perl script (which I have made available on GitHub, in case anyone is interested). Of course, you can always simply use a software like Mallet that runs LDA over your documents and outputs the results, but if you want to know what LDA actually does, I suggest you read Edwin Chen’s and Ted Underwood’s blog posts first, and then, if you still feel you don’t really get LDA, come back here. OK?
Welcome back. Disclaimer: I’m not a mathematician and there’s still the possibility that I got it all wrong. That being said, let’s take a look at Edwin Chen’s first example again, and this time we’re going to calculate it through step by step:
- I like to eat broccoli and bananas.
- I ate a banana and spinach smoothie for breakfast.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.
We immediately see that these sentences are about either eating or pets or both, but even if we didn’t know about these two topics, we still have to make an assumption about the number of topics within our corpus of documents. Furthermore, we have to make an assumption how these topics are distributed over the corpus. (In real life LDA analyses, you’d run the algorithm multiple times with different parameters and then see which fit best.) For simplicity’s sake, let’s assume there are 2 topics, which we’ll call A and B, and they’re distributed evenly: half of the words in the corpus belong to topic A and the other half to topic B.
What exactly is a word, though? I found the use of this term confusing in both Chen’s and Underwood’s text, so instead I’ll speak of tokens and lemmata: the lemma ‘cute’ appears as 2 tokens in the corpus above. Before we apply the actual LDA algorithm, it makes sense to not only tokenise but also lemmatise our 5 example documents (i.e. sentences), and also to remove stop words such as pronouns and prepositions, which may result in something like this:
- like eat broccoli banana
- eat banana spinach smoothie breakfast
- chinchilla kitten cute
- sister adopt kitten yesterday
- look cute hamster munch piece broccoli
Now we randomly assign topics to tokens according to our assumptions (2 topics, 50:50 distribution). This may result in e.g. ‘cute’ getting assigned once to topic A and once to topic B. An initial random topic assignment may look like this:
- like -> A, eat -> B, broccoli -> A, banana -> B
- eat -> A, banana -> B, spinach -> A, smoothie -> B, breakfast -> A
- chinchilla -> B, kitten -> A, cute -> B
- sister -> A, adopt -> B, kitten -> A, yesterday -> B
- look -> A, cute -> B, hamster -> A, munch -> B, piece -> A, broccoli -> B
Clearly, this isn’t a satisfying result yet; words like ‘eat’ and ‘broccoli’ are assigned to multiple topics when they should belong to only one, etc. Ideally, all words connected to the topic of eating should be assigned to one topic and all words related to pets should belong to the other. Now the LDA algorithm goes through the documents to improve this initial topic assignment: it computes probabilities which topic each token should belong to, based on three criteria:
- Which topics are the other tokens in this document assigned to? Probably the document is about one single topic, so if all or most other tokens belong to topic A, then the token in question should most likely also get assigned to topic A.
- Which topics are the other tokens in *all* documents assigned to? Remember that we assume a 50:50 distribution of topics, so if the majority of tokens is assigned to topic A, the token in question should get assigned to topic B to establish an equilibrium.
- If there are multiple tokens of the same lemma: which topic is the majority of tokens of that lemma assigned to? If most instances of ‘eat’ belong to topic A, then the token in question probably also belongs to topic A.
The actual formulas to calculate the probabilities given by Chen and Underwood seem to differ a bit from each other, but instead of bothering you with a formula, I’ll simply describe how it works in the example (my understanding being closer to Chen’s formula, I think). Let’s start with the first token of the first document (although the order doesn’t matter), ‘like’, currently assigned to topic A.
Should ‘like’ belong to topic B instead? If ‘like’ belonged to topic B, 3 out of 4 tokens in this document would belong to the same topic, as opposed to 2:2 if we stay with topic A. On the other hand, changing ‘like’ to topic B would threaten the equilibrium of topics over all documents: topic B would consist of 12 tokens and topic A of only 10, as opposed to the perfect 11:11 equilibrium if ‘like’ remains in topic A. In this case, the former consideration outweighs the latter, as the two factors get multiplied: the probability for ‘change this token to topic B’ is 3/4 * 1/12 = 6%, whereas the probability for ‘stay with topic A’ is 2/4 * 1/11 = 4.5%. We can also convert these numbers to absolute percentages (so that they add up to 100%) and say: ‘like’ is 57% topic B and 43% topic A.
What are you supposed to do with these percentages? We’ll get there in a minute. Let’s first calculate them for the next token, ‘eat’, because it’s one of those interesting lemmata with multiple tokens in our corpus. Currently, ‘eat’ in the first document is assigned to topic B, but in the second document it’s assigned to topic A. The probability for ‘eat stays in topic B’ is the same as the same as for ‘like stays in topic A’ above: within this document, the ratio of ‘B’ tokens to ‘A’ tokens is 2:2, which gives us 2/4 or 0.5 for the first factor; ‘eat’ would be 1 out of 11 tokens that make up topic B across all documents, giving us 1/11 for the second factor. The probability for ‘change eat to topic A’ is much higher, though, because there is already another ‘eat’ token assigned to this topic in another document. The first factor is 3/4 again, but the second is 2/12, because out of the 12 tokens that would make up topic A if we changed this token to topic A, 2 tokens would be of the same lemma, ‘eat’. In percentages, this means: this first ‘eat’ token is 74% topic A and only 26% topic B.
In this way we can calculate probabilities for each token in the corpus. Then we randomly assign new topics to each token, only this time not on a 50:50 basis, but according to the percentages we’ve figured out before. So this time, it’s more likely that ‘like’ will end up in topic B, but there’s still a 43% chance it will get assigned to topic A again. The new distribution of topics might be slightly better than the first one, but depending on how lucky you were with the random assignment in the beginning, it’s still unlikely that all tokens pertaining to food are neatly put in one topic and the animal tokens in the other.
The solution is to iterate: repeat the process of probability calculations with the new topic assignments, then randomly assign new topics based on the latest probabilities, and so on. After a couple of thousand iterations, the probabilities should make more sense. Ideally, there should now be some tokens with high percentages for each topic, so that both topics are clearly defined.
Only with this example, it doesn’t work out. After 10,000 iterations, the LDA script I’ve written produces results like this:
- topic A: cute (88%), like (79%), chinchilla (77%), hamster (76%), …
- topic B: kitten (89%), sister (79%), adopt (79%), yesterday (79%), …
As you can see, words from the ‘animals’ category ended up in both topics, so this result is worthless. The result given by Mallet after 10,000 iterations is slightly better:
- topic 0: cute kitten broccoli munch hamster look yesterday sister chinchilla spinach
- topic 1: banana eat piece adopt breakfast smoothie like
Topic 0 is clearly the ‘animal’ topic here. Words like ‘broccoli’ and ‘much’ slipped in because they occur in the mixed-topic sentence, “Look at this cute hamster munching on a piece of broccoli”. No idea why ‘spinach’ is in there too though. It’s equally puzzling that ‘adopt’ somehow crept into topic 1, which otherwise can be identified as the ‘food’ topic.
The reason for this ostensible failure of the LDA algorithm is probably the small size of the test data set. The results become more convincing the greater the number of tokens per document.
For a real-world example with more tokens, I have selected some X-Men comics. The idea is that because they are about similar subject matters, we can expect some words to be used in multiple texts from which topics can be inferred. This new test corpus consists of the first 100 tokens (after stop word removal) from each of the following comic books that I more or less randomly pulled from my longbox/shelf: Astonishing X-Men #1 (1995) by Scott Lobdell, Ultimate X-Men #1 (2001) by Mark Millar, and Civil War: X-Men #1 (2006) by David Hine. All three comics open with captions or dialogue with relatively general remarks about the ‘mutant question’ (i.e. government action / legislation against mutants, human rights of mutants) and human-mutant relations, so that otherwise uncommon lemmata such as ‘mutant’, ‘human’ or ‘sentinel’ occur in all three of them. To increase the number of documents, I have split each 100-token batch into two parts at semantically meaningful points, e.g. when the text changes from captions to dialogue in AXM, or after the voice from the television is finished in CW:XM.
I then ran my LDA script (as described above) over these 6 documents with ~300 tokens, again with the assumption that there are 2 equally distributed topics (because I had carelessly hard-coded this number of topics in the script and now I’m too lazy to re-write it). This is the result after 1,000 iterations:
- topic A: x-men (95%), sentinel (93%), sentinel (91%), story (91%), different (90%), …
- topic B: day (89%), kitty (86%), die (86%), …
So topic A looks like the ‘mutant question’ issue with tokens like ‘x-men’ and two times ‘sentinel’, even though ‘mutant’ itself isn’t among the high-scoring tokens. Topic B, on the other hand, makes less sense (Kitty Pryde only appears in CW:XM, so that ‘kitty’ occurs in merely 2 of the 6 documents), and its highest percentages are also much lower than those in topic A. Maybe this means that there’s only one actual topic in this corpus.
Running Mallet over this corpus (2 topics, 10,000 iterations) yields an even less useful result. The first 5 words in each topic are:
- topic 0: mutant, know, x-men, ask, cooper
- topic 1: say, sentinel, morph, try, ready
(Valerie Cooper and Morph are characters that appear in only one comic, CW:XM and AXM, respectively.)
Topic 0 at least associates ‘x-men’ with ‘mutant’, but then again, ‘sentinel’ is assigned to the other topic. Thus neither topic can be related to an intuitively perceived theme in the comics. It’s clear how these topics were generated though: there’s only 1 document in which ‘sentinel’ doesn’t occur, the first half of the CW:XM excerpt, in which Valerie Cooper is interviewed on television. But ‘x-men’ and ‘mutant’ do occur in this document, the latter even twice, and also ‘know’ occurs more frequently (3 times) here than in other documents.
So the results from Mallet and maybe even my own Perl script seem to be correct, in the sense that the LDA algorithm has been properly performed and one can see from the results how the algorithm got there. But what’s the point of having ‘topics’ that can’t be matched to what we intuitively perceive as themes in a text?
The problem with our two example corpora here was, they were still not large enough for LDA to yield meaningful results. As with all statistical methods, LDA works better the larger the corpus. In fact, the idea of such methods is that they are best applied to amounts of text that are too large for a human to read. Therefore, LDA might be not that useful for disciplines (such as comics studies) in which it’s difficult to gather large text corpora in digital form. But do feel free to e.g. randomly download texts from Wikisource, and you’ll find that within them, LDA is able to successfully detect clusters of words that occur in semantically similar documents.