A number of years ago, I gave a talk on mapping correspondence – that is, about the ways in which you can plot letters and epistolary exchanges on a map. Perhaps the most important point arising from that talk, for me anyway, was the understanding that mapping correspondence was by no means a straightforward matter. What exactly do you map, when you map correspondence? The writers’ locations? Or that of both the writers and the recipients? Or the path of delivery the letter? The duration of conveyance? The amount of correspondence? And in doing any or all of these, what *use* is the map?

Similar ponderings are behind this blog post – not about mapping, but about *counting* correspondence. What do you count, really, when you count letters? How can counting help? Are graphs useful?

These thoughts arose from reading Brenton Dickieson’s blog post, “A Statistical Look at C.S. Lewis’ Letter Writing”. Working from three published volumes of C.S. Lewis’s collected letters, Dickieson plotted the 3,274 letters on graphs, basically looking at the volume of letters Lewis wrote over time, and discussing contextual events that are reflected in the sheer numbers of Lewis’s letters. Here’s his graph of the number of letters over time (copied from his blog, with my apologies and thanks):

This graph is much as you’d expect for someone like Lewis, whose fame grew over time, bringing the inevitable mountain of letters with it: it shows overall growth over time.^{1} I’m sure you can immediately see that some of the peaks can be mapped to publications (the Narnia books started coming out in 1950), and other events (WWI in 1914-1918).

But hang on: what does this graph actually show? *What* does it count?

I think that when we look at a graph like this we tend to make a lot of assumptions. For instance, it is easy to take the above graph as depicting ‘the amount of letters written by Lewis during his lifetime’ – especially as the number of letters is so high. Dickieson actually titles the chart as “the number of letters we have from Lewis each year” – which you might call ‘the amount of letters which are extant today’. But what the chart in fact shows is a third figure, namely ‘the amount of letters published in this one edition’. These are different things:

- the actual number of letters written by a writer during their lifetime;
- a subset of (1), being the number of letters which survive; and
- a subset of (2), being the number of letters which we (or the editors, rather) know about.

These are all cases of ‘*all* the letters of X’ – a common phrase in titles of editions is “the complete correspondence”. Of course, attaining a true count of (1) is practically impossible – do you have copies of all the emails you ever sent? Exactly. So editions of “the complete letters of X” tend to strive to be (2) and say that they are (2), while they are of course (3). In fairness, (3) can equal (2), but it is not uncommon for further letters to be found after the publication of volumes of “the complete letters”.

So if we look back to the graph of Lewis’s letters above, now with the understanding that it represents (3) and, possibly, (2), but that it does *not* show (1), a second question arises. Given that the graph shows a subset of (1), is this subset **representative** of the whole? More specifically:

- Do the ups and downs of the graph reflect
*actual*fluctuations in the number of letters written by Lewis, or just in the number of letters that survive? - Similarly, does the overall trend reflect the
*actual*overall trend – that of (1)?

For instance, for much of the 1930s, only about 20 letters per year survive from Lewis. Did he really write fewer letters during this decade?

Given that the graph is based on more than three thousand letters, I think that the overall trend – an increasing number of letters over time – probably does reflect (1). But its minor fluctuations are more likely to reflect what has survived than what was originally there.

More commonly, editions of letters offer only a selection of the correspondence of a writer or a group of writers. In these cases, the points I have raised become even more significant.

As an example, let’s take the letters of J.R.R. Tolkien. As far as I know, only one volume of his letters has been published, being *The Letters of J.R.R. Tolkien* (Allen & Unwin, 1981) – although more letters have been published since in various books and articles.^{2}

The *Letters* published 354 letters. A very quick search online found about the same number again in a list on the Tolkien Gateway site; if we exclude letters of uncertain date, this list gives us another 349 letters, for a total of 703. A far cry from Lewis’s 3,000+, and I would imagine that many more of Tolkien’s letters survive; but this is enough to plot in a chart to make my point:

In this chart, the blue columns show the number of letters per decade published in the *Letters*, and the red columns the number of further letters given in the list on the Tolkien Gateway website. Obviously, neither set of letters reflects (1), or even (2) or (3) as discussed above. But the point I want to make here regards overall trends teased from these counts. If we look at the blue columns, it would appear that the peak of Tolkien’s letter-writing activities was in the 1950s, being fairly even from the 1940s through the 1960s. But the red columns indicate that the peak was not until the 1960s, and not many letters date from the 1940s. So we can immediately see that neither the blue nor the red columns appear to be representative of (1), of the actual number of letters written by Tolkien.

So, to recap and summarize. **The overall trend we can extract from data depends on the dataset**. This is really quite obvious. What is harder to remember is that the constitution of the dataset can be something else than what is expected by the reader, and this can have serious implications on the interpretation and understanding of the data.

This discrepancy becomes especially relevant in situations when only a fraction of (1) survives. Which of course in the case of historical material is almost always. Unless we have an extremely carefully made estimate of a letter-writer’s full output, we need to be really careful when counting their letters and making inferences based on those numbers.

Here’s an example. The following chart shows the number of letters sent from England by Thomas Wilson, servant and secretary to Sir Robert Cecil, to the English merchant Richard Cocks in Bayonne, France.

In this chart, blue shows the number of letters that survive (N = 1), red the number of letters that are mentioned in other surviving sources, but which don’t survive (N = 29). Based on the surviving letters, there is no trend. Based on the number of letters reconstructed from intertextual references, there was a continuous correspondence over this whole period.

I hope I haven’t given the impression with this blog post that I’m somehow criticizing Dickieson’s exploration of Lewis’s letters. On the contrary, I found his blog post fascinating, and I have in the past made similar graphs when trying to make sense of correspondences (as the last graph shows). I just wanted to raise some quite basic questions regarding the assumptions we make when using quantitative methods to make sense of data that we usually explore and study qualitatively.

[This post didn’t quite go where I thought it would, but it’s too long to rewrite. I’m not sure it’s particularly interesting, either, but I hope to remedy that anon with a post about dates (the calendar, not the fruit) in Early Modern letters.]

**Notes**

1. This feels quite obvious if we think about general human lifespans, too: the longer you live, the more people you meet => the more communication events are likely to follow. And this reminds me of an article in *Science* (Malmgren et al, “On universality in human correspondence activity”, *Science* 325 (1696), 2009) in which, through some serious number-crunching, the authors discovered that i) the amount of letters a person writes increases over their lifespan, ii) letter-writing is a correspondence event (when you receive a letter, you are likely to write a reply), and iii) letter-writing times correlate with the hours the writer is awake. My summary here is probably partly wrong, and certainly rather dismissive, and I have no idea about the calculations involved which I expect are the real beef of the article, but there are two points to make from their article, both of which are relevant to my present discussion: (1) number-crunching doesn’t necessarily tell you anything new; and (2) you can only get out what you put in, aka. what, *exactly*, are you counting? (Actually there’s also a third: (3) the humanities and sciences are interested in different things, ask different questions, take into account different contexts, etc etc. But let’s not go there today).

2. And I have to take this opportunity to ~~boast~~ confess that I’ve edited one previously unpublished letter myself: see Alaric Hall & Samuli Kaislaniemi (2013), “‘You tempt me grievously to a mythological essay’: J. R. R. Tolkien’s correspondence with Arthur Ransome”, in *Ex Philologia Lux: Essays in Honour of Leena Kahlas-Tarkka* ed. by Jukka Tyrkkö, Olga Timofeeva & Maria Salenius. [*Mémoires de la Société Néophilologique* XC]. Helsinki: Société Néophilologique. pp. 261-280. Link to pdf.