Chi-squared: which characteristics do really matter?

In our little stylometric experiments, we compared different manga in terms of their hiragana frequencies. While we were able to say how similar or different the comics are to each other, it’s hard to tell in what way precisely they are different, i.e. which hiragana differed vastly in quantity and which were more or less the same. Intuitively, we ought to be able to answer this by looking at how much the hiragana counts differ from the average, but it would be good to have a more exact measure of what it means to differ “vastly” or to be “more or less” the same. If we could identify those hiragana in which the manga are hardly any different, we could ignore them in future experiments, which would be a relief since we’re otherwise stuck with as many as ~50 hiragana to keep track of.

Enter chi-squared (also called chi-square, or χ²), which is perhaps the most widespread of several statistical tests for this purpose. I first learned about it during my Master’s, but either I forgot about it or I had never really understood it in the first place. But now that I’ve looked it up again, I found it’s actually quite simple: the idea is to not only calculate the difference between the actual (observed) and the “average” (expected) value, but to square the result and then divide it by the expected value. The squaring has the effect to make large differences stand out more, while the division makes different chi-squared values comparable.

So, the formula would be:

(observed – expected)² / expected

[You might have seen this formula with a sum sign at the beginning: when you perform a “chi-squared test”, you take the sum of all calculated values and look it up in a table to determine whether your experiment is random or not (see below). In our case, it definitely isn’t.]

Let’s take the hiragana で de as an example. In our first 100-character sample from Katsuhiro Ōtomo’s Akira (A1), で de occurred 8 times (see the chart here). In the second Akira sample (A2), it is found 3 times. In the two manga samples from Morning magazine, Miko Yasu’s Hakozume (M1) and Rito Asami’s Ichikei no karasu (M2), で de is found 6 and 7 times, respectively. Overall, there are 24 で de in those four manga samples. The sum of all hiragana in these manga samples is 435 (so it turns out I took slightly more than 100 hiragana for each sample; don’t ask me why), which means that on average, で de should occur with a frequency of 24/435 = 0.0552. In other words, roughly every 19th hiragana in any of the four manga should be a で de. For the first of the two Akira samples, A1, which consists of 112 hiragana in total, the expected value for で de is 112 * 0.0552 = 6.18, i.e. we expect to find 6 or 7 で de in A1.

There actually are 8 で de in A1. That’s a difference of 8 – 6.18 = 1.82. Squared and divided by the expected value of 6.18, this results in a chi-squared value of 0.536.

Compare this to the frequency of で de in the other Akira sample, A2, where it occurs only 3 times, i.e. much less than one would have thought. Given a hiragana total of 106 for A2, we get an expected value of 106 * 0.0552 = 5.85. Accordingly, chi-squared for で/A2 is (3 – 5.85)² / 5.85 = 1.39.

However, our aim was to compare different hiragana, so let’s also calculate the chi-squared values for し shi, which occurs 7 times in A1, 1 time in A2, and 6 times in the other two manga, so the total for し shi is 14. Chi-squared for し in A1 is (7 – (14/435)*112)² / ((14/435)*112) = 3.199 and chi-squared for し in A2 is (1 – (14/435)*106)² / (14/435)*106 = 1.705.

As you can see, the chi-squared values for し shi are higher than for で de, which means that the former hiragana contributes more to the overall difference between A1 and A2 than the latter. In other words, the usage of で de throughout Akira is close to the average, thus comparatively unremarkable and perhaps not the most relevant stylometric property.

Here’s a chart of the chi-squared values for all 51 hiragana characters that occur in the four manga samples (click to enlarge):

A dialogue excerpt from Hakozume by Miko Yasu which illustrates the above-average frequency of the hiragana character え e. With regard to our little example corpus, か ka and ん n are relatively frequent too.

One can easily see several spikes at the hiragana え e, ん n, と to and ず zu, though more important than the individual values are the sums, which are also high for お o and こ ko. These 6 hiragana alone contribute roughly 70% towards the overall sum of chi-squares! If our corpus was of a sufficient size (which it is definitely not), we could focus on these 6 hiragana in further experiments, as difference in hiragana usage among manga would be most likely connected to them.

In contrast, hiragana like び bi, く ku and に ni, with chi-square values close to zero, seem to have very little explanatory power over stylometric differences; their usage differs hardly among the four manga in question.

Of course, chi-squared can not only be applied to character counts in stylometry, but also to anything else that is countable. For instance, I recently mentioned the 1:1 gender ratio as a potential criterion for corpus building. One possible null hypothesis would be that good (or popular) comics are equally likely to be authored by men or by women. If we look at the 60 people who authored the top 10 comics from each of the last four years’ best-of lists (only counting the first-mentioned author when there are more than 3), we end up with 41 men and 19 women. This distribution isn’t quite the 30:30 we might have expected, but can it still be said to be roughly equal?

To answer this with the help of chi-squared, we calculate the two chi-squared values, one for male authors:

(41 – 30)² / 30 = 4.033

and one for female authors:

(19 – 30)² / 30 = 4.033

Now we add those two numbers together and look up the result in a table like this one. We need to use the first row as we have 1 “degree of freedom” in our essentially binary variable. There, our chi-squared sum of 8.07 lies between the p=0.01 and the p=0.001 column, meaning that the null hypothesis can be rejected with high confidence. In other words, the deviation of our sample from a 30:30 gender ratio is statistically significant. Of course, what exactly this gender bias means and where it comes from is another question.

In case all of this didn’t make any sense to you, there are many online tutorials on chi-squared which perhaps explain it better, among which I recommend this video by Paul Andersen on YouTube.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s