Corpus building, the ELTeC wayPosted: May 30, 2020
Some time ago, I attended a fascinating presentation about ELTeC (European Literary Text Collection), a multilingual corpus of novels. Such a corpus is not a new idea, but the way in which novels are chosen for inclusion in ELTeC is so thoughtful and transparent that Humanities scholars (and perhaps particularly art historians) might learn something from it. Because usually, they (i.e. we) don’t think much about which objects they select for an analysis, much less justify their choices, thus leading to an inaccurate or distorted representation of reality with little scholarly merit.
The ELTeC criteria for inclusion can be seen on the Summary Page that shows the texts included so far:
- language: the number of texts per language varies, but that is surely going to change; they seem to be capped at 100, and even languages with relatively few speakers such as Slovenian and Hungarian have reached this number already. Thus the project appears to strive for equal representation of all languages considered.
- male author / female author: some of the numbers show that ELTeC aims at a quota of either 50:50 (English) or 2:1 (German, French). In other cases the ratio of female authors is lower though.
- short/medium/long: probably based on word counts, the novels are divided into three categories of length. The idea was to represent all lengths equally, but this doesn’t seem to have worked out in all languages: e.g. only 8% of the Slovenian novels are ‘long’.
- year of first publication: most likely due to copyright restrictions, only novels published in or before 1920 are included in the corpus. The earliest date is 1840, but they plan to extend the corpus to earlier novels eventually. This 1840-1920 period is divided into four 20-year segments, and again the aim is to represent all segments equally – in French, for example, exactly 25 texts are included from each segment.
- frequent/rare: this criterion concerns the canonicity of the novels, as measured by the number of reprints. Both well-known and less widely known texts should be equally represented, although there doesn’t yet seem to be a strict rule in place how many reprints constitute a “frequent” or “rare” text.
For Comics Studies, a sampling approach based on these criteria is intriguing. As an example, albeit not actually a scholarly one, let’s look at the titles of the “best manga of 2016” reviews on this weblog, of which there are currently 11. So far, these manga have only been chosen for review because I happened to have been reading them (or meaning to read them) anyway, but what if I wanted to take a more systematic approach?
- language: of course they are all originally published in Japanese, but the starting point of my blogpost series was to find out which manga were popular according to English and German sources. Who knows, maybe completely different manga would surface when one turns to other parts of the world?
- male author / female author: the current ratio is 3 male mangaka to 8 female mangaka (including a team of two women). If I wanted to achieve a ratio more like 50:50, the next review should be about a manga authored by a man (spoiler: yes, it’s going to be).
- short/medium/long: instead of word counts, the number of tankōbon volumes per series should be a feasible measure of length (although my reviews only refer to one individual volume each). Based on the number of volumes published in Japan at the time of reviewing, the 3-quantiles of our current ‘corpus’ would be the following:
- short: 1-5 volumes
- medium: 7-13 volumes
- long: 15-29 volumes
That’s not terribly helpful though: what if there already is a bias in the current sample? A better way would be to calculate the quantiles from all manga published in 2016. That would be a lot of work, but maybe the picture would change quite a bit due to the consideration of long-running series such as One Piece (83 volumes by 2016), or a higher number of one-shots.
- time segments: while the manga are supposed to be from a single year, 2016, there is some leeway as sometimes the date of the American or German publication is the one that led to the inclusion of the manga in the “best comics of 2016 meta list”. The most extreme time lag is perhaps that of Goodnight Punpun (not yet reviewed here) which was originally published from 2007-2013; due to its American publication in 2016 it was included in that list (and even ranked within the top 20). As mentioned in an earlier blogpost, this focus on ‘2016’ is not so much about that particular year but more about getting an idea what manga in the 2010s were like. Perhaps it’s not worth the trouble to categorise them into such small time brackets though.
- frequent/rare: while the number of reprints would not be a suitable indicator for relatively new manga, one could complement the popular manga from the 2016 meta list with lesser-known ones that were ignored by English- and German-language media. I already did that, though not systematically: in fact, 6 out of the 11 manga reviewed were not ‘nominated’ by anyone as best manga of 2016 as far as I could see.
Regardless of the purpose of your corpus, the ELTeC criteria might help you detect biases. There’s no need to follow them religiously and strive for exact equality in all categories, but they are a good starting point for thinking about how you want to select the objects of your study. In other words: if there are e.g. no female authors in your corpus, you’d better be prepared to explain why.