Copious but not compendious – Page 2

Counting correspondence, listing letters

A number of years ago, I gave a talk on mapping correspondence – that is, about the ways in which you can plot letters and epistolary exchanges on a map. Perhaps the most important point arising from that talk, for me anyway, was the understanding that mapping correspondence was by no means a straightforward matter. What exactly do you map, when you map correspondence? The writers’ locations? Or that of both the writers and the recipients? Or the path of delivery the letter? The duration of conveyance? The amount of correspondence? And in doing any or all of these, what use is the map?

Similar ponderings are behind this blog post – not about mapping, but about counting correspondence. What do you count, really, when you count letters? How can counting help? Are graphs useful?

These thoughts arose from reading Brenton Dickieson’s blog post, “A Statistical Look at C.S. Lewis’ Letter Writing”. Working from three published volumes of C.S. Lewis’s collected letters, Dickieson plotted the 3,274 letters on graphs, basically looking at the volume of letters Lewis wrote over time, and discussing contextual events that are reflected in the sheer numbers of Lewis’s letters. Here’s his graph of the number of letters over time (copied from his blog, with my apologies and thanks):

This graph is much as you’d expect for someone like Lewis, whose fame grew over time, bringing the inevitable mountain of letters with it: it shows overall growth over time.¹ I’m sure you can immediately see that some of the peaks can be mapped to publications (the Narnia books started coming out in 1950), and other events (WWI in 1914-1918).

But hang on: what does this graph actually show? What does it count?

I think that when we look at a graph like this we tend to make a lot of assumptions. For instance, it is easy to take the above graph as depicting ‘the amount of letters written by Lewis during his lifetime’ – especially as the number of letters is so high. Dickieson actually titles the chart as “the number of letters we have from Lewis each year” – which you might call ‘the amount of letters which are extant today’. But what the chart in fact shows is a third figure, namely ‘the amount of letters published in this one edition’. These are different things:

the actual number of letters written by a writer during their lifetime;
a subset of (1), being the number of letters which survive; and
a subset of (2), being the number of letters which we (or the editors, rather) know about.

These are all cases of ‘all the letters of X’ – a common phrase in titles of editions is “the complete correspondence”. Of course, attaining a true count of (1) is practically impossible – do you have copies of all the emails you ever sent? Exactly. So editions of “the complete letters of X” tend to strive to be (2) and say that they are (2), while they are of course (3). In fairness, (3) can equal (2), but it is not uncommon for further letters to be found after the publication of volumes of “the complete letters”.

So if we look back to the graph of Lewis’s letters above, now with the understanding that it represents (3) and, possibly, (2), but that it does not show (1), a second question arises. Given that the graph shows a subset of (1), is this subset representative of the whole? More specifically:

Do the ups and downs of the graph reflect actual fluctuations in the number of letters written by Lewis, or just in the number of letters that survive?
Similarly, does the overall trend reflect the actual overall trend – that of (1)?

For instance, for much of the 1930s, only about 20 letters per year survive from Lewis. Did he really write fewer letters during this decade?

Given that the graph is based on more than three thousand letters, I think that the overall trend – an increasing number of letters over time – probably does reflect (1). But its minor fluctuations are more likely to reflect what has survived than what was originally there.

More commonly, editions of letters offer only a selection of the correspondence of a writer or a group of writers. In these cases, the points I have raised become even more significant.

As an example, let’s take the letters of J.R.R. Tolkien. As far as I know, only one volume of his letters has been published, being The Letters of J.R.R. Tolkien (Allen & Unwin, 1981) – although more letters have been published since in various books and articles.²

The Letters published 354 letters. A very quick search online found about the same number again in a list on the Tolkien Gateway site; if we exclude letters of uncertain date, this list gives us another 349 letters, for a total of 703. A far cry from Lewis’s 3,000+, and I would imagine that many more of Tolkien’s letters survive; but this is enough to plot in a chart to make my point:

In this chart, the blue columns show the number of letters per decade published in the Letters, and the red columns the number of further letters given in the list on the Tolkien Gateway website. Obviously, neither set of letters reflects (1), or even (2) or (3) as discussed above. But the point I want to make here regards overall trends teased from these counts. If we look at the blue columns, it would appear that the peak of Tolkien’s letter-writing activities was in the 1950s, being fairly even from the 1940s through the 1960s. But the red columns indicate that the peak was not until the 1960s, and not many letters date from the 1940s. So we can immediately see that neither the blue nor the red columns appear to be representative of (1), of the actual number of letters written by Tolkien.

So, to recap and summarize. The overall trend we can extract from data depends on the dataset. This is really quite obvious. What is harder to remember is that the constitution of the dataset can be something else than what is expected by the reader, and this can have serious implications on the interpretation and understanding of the data.

This discrepancy becomes especially relevant in situations when only a fraction of (1) survives. Which of course in the case of historical material is almost always. Unless we have an extremely carefully made estimate of a letter-writer’s full output, we need to be really careful when counting their letters and making inferences based on those numbers.

Here’s an example. The following chart shows the number of letters sent from England by Thomas Wilson, servant and secretary to Sir Robert Cecil, to the English merchant Richard Cocks in Bayonne, France.

In this chart, blue shows the number of letters that survive (N = 1), red the number of letters that are mentioned in other surviving sources, but which don’t survive (N = 29). Based on the surviving letters, there is no trend. Based on the number of letters reconstructed from intertextual references, there was a continuous correspondence over this whole period.

I hope I haven’t given the impression with this blog post that I’m somehow criticizing Dickieson’s exploration of Lewis’s letters. On the contrary, I found his blog post fascinating, and I have in the past made similar graphs when trying to make sense of correspondences (as the last graph shows). I just wanted to raise some quite basic questions regarding the assumptions we make when using quantitative methods to make sense of data that we usually explore and study qualitatively.

[This post didn’t quite go where I thought it would, but it’s too long to rewrite. I’m not sure it’s particularly interesting, either, but I hope to remedy that anon with a post about dates (the calendar, not the fruit) in Early Modern letters.]

Notes

1. This feels quite obvious if we think about general human lifespans, too: the longer you live, the more people you meet => the more communication events are likely to follow. And this reminds me of an article in Science (Malmgren et al, “On universality in human correspondence activity”, Science 325 (1696), 2009) in which, through some serious number-crunching, the authors discovered that i) the amount of letters a person writes increases over their lifespan, ii) letter-writing is a correspondence event (when you receive a letter, you are likely to write a reply), and iii) letter-writing times correlate with the hours the writer is awake. My summary here is probably partly wrong, and certainly rather dismissive, and I have no idea about the calculations involved which I expect are the real beef of the article, but there are two points to make from their article, both of which are relevant to my present discussion: (1) number-crunching doesn’t necessarily tell you anything new; and (2) you can only get out what you put in, aka. what, exactly, are you counting? (Actually there’s also a third: (3) the humanities and sciences are interested in different things, ask different questions, take into account different contexts, etc etc. But let’s not go there today).

2. And I have to take this opportunity to ~~boast~~ confess that I’ve edited one previously unpublished letter myself: see Alaric Hall & Samuli Kaislaniemi (2013), “‘You tempt me grievously to a mythological essay’: J. R. R. Tolkien’s correspondence with Arthur Ransome”, in Ex Philologia Lux: Essays in Honour of Leena Kahlas-Tarkka ed. by Jukka Tyrkkö, Olga Timofeeva & Maria Salenius. [Mémoires de la Société Néophilologique XC]. Helsinki: Société Néophilologique. pp. 261-280. Link to pdf.

Editing is Hell, and normalization is an illusion

As a procrastinatory excursion, here are some thoughts about editing historical texts. Rather than an insightful comment on editorial philosophy, the following stems from practical matters and contains nitty-gritty details, and is not written in conversation with other editors (sorry). I’m sure everything I say here has been said before, but repetitio etc.

1. Why normalization is an illusion

Years back, I found I had a problem. I was considering the expansion of abbreviated words, and one of the words in question was merchant. This word was very frequent in the text I was working on, and occurred in various forms, some of them abbreviated: “marchant”, “m^rchnt”, etc. I thought it would be a straightforward matter to settle on an expanded form. I decided I would normalize according to most attested usage. That is to say, editorial expansions would reproduce the most frequent forms fully spelled out.

In other words, if for instance “marchant” was the most frequent unabbreviated spelling of merchant, then “m^rchnt” would be expanded as “marchant”. (With editorial expansions indicated, e.g. “marchant”.)

My next step was to establish what was the authorial preferred spelling of the word merchant. This is what I found (word stem forms only):

m^rchant*	1
m^rchnt*	31
m^rcht*	1
marchand*	1
marchant*	31
marchnt*	21
marshant*	1

Out of 87 hits of merchant*, only 33 were spelled out fully, i.e. 60% were abbreviated forms. And the abbreviated form “m^rchnt*” was as frequent as the fully spelled out form “marchant*”.

This made me stop and think: if the fully-spelled-out form of a (stem of a) word is much rarer than abbreviated forms, can we really claim it represents authorial preference? If we chart the possible spellings of merchant as a sequence of graphs, we get the following:

Note that this image shows both the actual realized spellings in the texts, and also possible spellings which do not occur (such as “m^rchand*”). (Note also that other possible contemporary spellings are not given, such as “m^rchāt”, or indeed “merchant”).

I consequently completely abandoned the idea that we can ‘reconstruct’ authorial spelling, and also determined to make sure to explicitly indicate editorial intervention at all times. A spelling like “m^rchnt” should never be expanded as “marchant”. Instead, modern forms should be used: the only correct expansion here is “merchant”.

Having said that, using modern spellings is of course not an option in dead language varieties, such as Middle English. And further, even in Early Modern English there are graphemic and orthographical features without counterpart in Present-Day English(es). What to do with obsolete suffixes, like in “hath”, or “didst”? And what about graphs which were already obsolete but replaced with other contemporary ones, such as <y> for thorn <þ>, as in “y^e” ‘the’ and “y^t” ‘that’? And what about graphs and special characters which were common in the period but which we no longer use?

2. Editing is Hell

Here is one example of how just such a special character – a brevigraph – can cause serious problems when deciding on best editorial practice.

The image below is from a document from 1599, written in a late Elizabethan cursive (aka English secretary hand), with a slightly old-fashioned ductus. It reads: “her ma^tꝭ servyce” (or “ſervyce”, if we want to retain the long <s>¹). The second word is an abbreviation of majesty’s.

The <e>-shaped graph in the abbreviated second word is what I usually call an -es-graph. Originally a medieval brevigraph² for (usually word-final) “-is”, “-es”, and “-ys”, in Early Modern English cursive hands <ꝭ> is also used for (word-final) “-s”. Some sources claim it indicates a plural or possessive, but I have found it used in proper names, prepositions, verbs and adverbs too. (Possibly it is more frequent in one rather than another, but I expect that in its distribution scribal preference is more significant than part of speech).

So the question for today’s workshop is, what to do with the abbreviation “ma^tꝭ”?

What indeed.

Depending on your editorial principles, I can come up with 23 different possible outcomes, based on choices regarding i) spelling, ii) contractions (abbreviations), iii) superscripts, and iv) brevigraphs (special characters).

The outcomes – and the choices – are shown in the massive table below. Let me help you read it. The top three rows indicate options when retaining original spelling. The blue section in the middle indicate choices and outcomes when normalizing spelling (since regardless of my personal opinion as stated above, normalization is a common editorial practice), and the bottom ten rows the same for modernized spelling.

The numbers show whether editorial intervention has been indicated or not. “1” indicates that the editorial intervention is marked (e.g. “majesty’s”), and “0” that it is unmarked (e.g. “majesty’s”). And “(1)” means that the feature is marked, but it is so in the original text (i.e. the superscript and the brevigraph).

Some of the features are not strictly speaking divisible – that is to say, the contraction is marked by the superscript, and hence you can’t show that you’ve changed one without doing the same for the other: expanding the contraction and lowering the superscript are in essence the same thing. Thus “(1)ss” means that contraction is marked since what has been done to superscripts is marked, and “(1)c” means that superscript is marked since what has been done to contraction is marked.

Some notes are in order:

* Arguably since the contraction (and superscript) is not marked, this does not qualify as a representation of the original spelling.

† When lowering of the superscript or expansion of the contraction are indicated, indication of the normalization or modernization of the brevigraph gets subsumed – when using italics to mark editorial interventions. If other methods are used – apostrophes, parentheses – it is possible to make this distinction, as seen in the rightmost column.

‡ Some of the possible outcomes are unlikely to be chosen by the editor, because editorial intervention can also make the word more difficult to parse, rather than less. So for instance the edited forms “mates” and “mates” seem to me undesirable outcomes – “mates” is scarcely better, whereas “ma’tes” (or “ma’tes”) at least indicates that the word is an abbreviated form.

3. Um, argh?

Well, yes exactly. And this is without going into the jungle of brackets and other symbols used by editors to distinguish between different kinds of things in the text, such as interlineal insertions, deletions, damage, etc etc (a good discussion of which, with clear examples, can be found in the appendix to Michael Hunter’s Editing Early Modern Texts (2007)).

And also, other cases – other graphs, other methods of abbreviation, other hands – produce different problems, so I’m certain that the above table does not suffice for all editorial problems.³

What can the editor do, then?

I think this question can be answered: The editor can do whatever the hell they please. What they should do, in any case, however, is to make sure that all editorial interventions in the text are visible and the original form is recoverable. How they do this is another matter, as is the extent of their meddling. But there should in any case be a chapter or document setting out very clearly the editorial principles and practices followed in the edition.

Notes

1) Who would want to do that!, you exclaim. Well, I would, for one. Our knowledge of Early Modern English handwriting is ridiculously limited, and in particular quantifiable information is scarce. So you need geeks like me to count them long <s>s.

2) The -es-graph in Unicode – <ꝭ> – may be okay for medieval texts, but it is quite unlike the <e>-shaped -es-graphs in Early Modern English secretary hands. For a type facsimile edition, I would need to find a better character. And indeed I have done so – as seen by the -es-graph used in the table (which I got from the Electronic Text Edition of Depositions 1560–1760, available on the CD accompanying Merja Kytö, Peter J. Grund & Terry Walker, Testifying to Language and Life in Early Modern England (Benjamins, 2011)).

3) Something to avoid are what might be called hybrid forms. For instance, combining a normalized expanded stem with the word-final brevigraph: “maiestꝭ ”. Or then expanding the contraction and modernizing the stem of the word, but normalizing the expansion of the brevigraph: “majestes”. Expanding and modernizing the contraction but leaving the superscript just looks silly: “ma^jesty’s”.

quantity +/- quality

For a long time, I’ve felt that the pressure to produce MORE publications – more Things To Count, since the system as it is now uses quantitative methods to establish quality of academics – is doing everyone a disservice, with lots of half-formed publications seeing the light of day.* In this publish or(/and) perish world, I’ve been seeing it as a quantity VERSUS quality issue, and have felt that less might be more – a point that has been raised by many others, quite often using citing Nobel laureates of yore who only ever published half a dozen articles. Clearly quantity is not an objective measure of academic worth.

However, there is another way of looking at this issue, one which is highly important to anyone writing for a living. To paraphrase various authors, this is the general system:

write
finish what you write
send it out and get it published
revise texts only if and when necessary
repeat from step 1

Which makes well good sense: it all comes down to writing things, finishing them, and getting them out into the world. On a conceptual level, writing a blog post and writing a monograph are par. Yes, my interminable monograph is taking forever to get done, but I also have a draft for a post on this blog that dates to December 2012. Things can – and do! – end up hanging in limbo, unfinished for a slew of reasons, all of which come down to one: not writing.

But the other step in the system – where step 1 is To Write – is To Write LOTS.

To produce Quantity.

If you’re a writer trying to make a living, I can see you need to sell your writings in order to feed your kids. But if you’re an academic..? And in any case won’t the quality suffer?

Generally speaking, I’ve tended to equate academic writing with something that takes a long time to accomplish. It takes time to do the background reading, to do the research, to do the thinking required. (To do the lab experiments, to run the programs (and correct the bugs), to process the results…) Perhaps that’s what I find infuriating about the present state of affairs: here’s this thing which takes A Long Time To Do (never mind Do Well), and we are required – Demanded – to do more of it in less time!

But things can also get written very rapidly.

Now, I’ve written articles that took years to produce. But I’ve also written things at speed: there are a couple pages in one of the things I have published that I well remember writing – long hand (!) – in one sitting, in a pub in London.† And last fall I put together an article within a week – admittedly based on a presentation I gave earlier last year, but my presentations are not article-text read out loud. We all have moments when words just flow out, and it feels like you are channeling something or someone, rather than producing new text. But I can’t help feeling this is rarer in academia than in (some) other fields (I suspect this is something that gets easier the further along your academic career you get, ie. experience helps, even more so than with many if not most other genres).

..okay, so if I’m willing to concede, after all, that academic writing can be fairly rapid, what am I struggling over?

Maybe it’s the business with killing your darlings. Just now, I read a short but great post on Tumblr. Here’s an excerpt:

Pottery, particularly wheel-throwing, is wonderful for this, incidentally. You fail over and over and you fail fast and you are creating quantity to lead to quality. You throw and throw and throw and things die on the wheel and things die when you take them off the wheel and things explode in the kiln and after you have made a dozen or two dozen or a thousand, none of them are precious any more. There is always more clay.

..but it’s only a page long, so go read it on Squash Tea. I can wait.

Done? Groovy.

Here’s the bit that particularly got me:

after you have made a dozen or two dozen or a thousand, none of them are precious any more

Now, I am apparently able to pull together a blog post like this one in an hour or two, and even write one based on more extensive impromptu (not to mention ad hoc) research over an evening. Am I, then, just being overly precious about my ‘real’ academic publications? (Or even about ‘real’ academic publications overall..?)

To put it another way, perhaps my problem with the perceived Quality vs Quantity issue is just misguided?

Given that the system is what it is, churning out pots and hoping most of them will be servicable and at least some also beautiful, and also not forgetting to smash the ones that are downright bad, does not appear to be at all a bad approach to academic writing. It’s easy to get stuck on polishing texts – awareness of the impossibility of perfection notwithstanding – but it’s equally difficult to see, years later, what exactly were the faults that you so much wanted to redress.

Anyway, time to stop this rambling. I am aware that I have touched upon a slew of other points related to academia that are worth addressing (and re-addressing), but I will avoid all of them for now. This was supposed to be a short note..

But just one point as a coda. Not all academic writing is done for publications. In fact, probably the vast majority of it is produced when planning and preparing teaching and writing lectures, but also in putting together talks and conference presentations and guest lectures. And most of these are one-off shows. In the past, when I’ve attended dance, theatre or music performances, I’ve wondered about the ration of preparation versus performance in the arts. Choreographies will be practiced for weeks, scripts rehearsed and polished and rewritten up until curtain up, and bands spend hundreds of hours playing together in preparation for ten performances, or only five, or even just one single performance – to an audience of a thousand, a hundred, or just half a dozen people. But then it struck me that this is what we do as academics: I will spend days putting together a conference paper, and then give it to a roomful of scholars in 20 minutes – and that’s it. Potentially I will write it up for publication later, but this is by no means always the case (although making it so is a worthwhile habit to create).

I guess my point is that these, really, are our pots and dishes: we churn them out by the dozen, and they do include many duds. Sometimes you work for weeks but only on actually presenting it do you see why your paper doesn’t work. But most of the time, you produce a serviceable dish. And then it’s time to reach for more clay.

* This is meant to be a short blog post so pardon me for not engaging with questions relating to quality of academic publications over time, or any other parameter for that matter. As also with issues such as other reasons why texts of dubious quality get published in the first place.

† I actually wrote out twice as much as ended in the article but had to scrap everything written after the first pint. Alcohol can be a muse but when Clio morphs into Thalia you know you’ve had one too many.

Datamoaning

At the beginning of this week, I attended the two-day Big Data Approaches to Intellectual and Linguistic History symposium at the Helsinki Collegium for Advanced Studies, University of Helsinki. Since Tuesday, I’ve found myself pondering on topics that came up at the symposium. So I thought I would write up my thoughts in order to unload them somewhere (and thus hopefully stop thinking about them) (I have a chapter to finish, and not on digital humanities stuff), and also in order to try to articulate, more clearly than the jumbled form inside my head, my reflections upon what was discussed there. I.e. the usual refrain, ‘I need to hear what I say in order to find out what I think’.

So here goes.

NB this is not a conference report, in that I’m not going to talk about specific presentations given at the symposium. For that, check out slides from the presentations linked to from the conference conference website, and see also the Storify of the tweets from the event (both including those from the workshop that followed on Wednesday, Helsinki Digital Humanities Day).

I’ve been a part of the DH (Digital Humanities) community for about ten years now. I started off working on digital resources – linguistic corpora, digital scholarly editing; I’ve even fiddled with mapping things – but have in recent years not been actively engaged in resource- or tool-creation as such. Yet I use digital and digitised resources on a daily basis: EEBO frequently, the broad palette of resources available on British History Online all the time, and, when I have access to them, State Papers Online and Cecil Papers (Online). (I work on British state papers from around 1600, and am lucky in that much of the material I need has been digitised and put online in one form or another). I also keep an eye on what happens in the DH world: I attend DH-related conferences and seminars and whatnot when I can, subscribe to LLC (Literary & Linguistic Computing, about to be renamed DSH, Digital Scholarship in the Humanities), and hang out with DHers both online (Twitter, mostly) and in real life.

All this goes to say that I feel quite confident about my understanding of DH projects at the macro level. (Details, certainly not: implementation, encoding, programming, etc etc).

Thus, attending a DH symposium on ‘big data’, I expected to hear presentations about things I was already familiar with. And this turned out to be the case: there were descriptions of/results from projects, descriptions of methodologies (explaining to those from other disciplines ‘what is it we do’), and explorations of concepts that keep coming up in DH work.

Don’t get me wrong: I found all the presentations (that I saw) very good, and listening to talks by people in other disciplines does give you new perspectives. Maybe not profound ones, and often you end up thinking/feeling there’s little or no common ground so why do we even bother? But it’s not a completely useless exercise. Yet what I felt to be the take-away points from this symposium were ones I feel keep coming up at DH events that I have attended over the years, and ones that we – meaning the DH community – are well aware of. Such as (by no means a comprehensive list):

1. Issues with the data

“Big Data” in the humanities is not very big when compared to Big Data in some other fields
We know Big Data is good for quantity, but rubbish for quality
- We are aware of the importance and value of the nitty-gritty details
We know that manual input is a required part of both processing/methodology – in order to fine-tune the automatic parts of the process – and more importantly, for the analysis of the results (Matti Rissanen’s maxim: “research begins where counting ends”)
We know that Our data – however Big it is – is never All data (our results are not God’s Truth)
- We are aware of the limits of the historical record (“known unknowns, unknown unknowns”)

2. Sharing tools and resources

We need to develop better tools, cross-disciplinary ones
- Our research questions may be different, but we are all accessing and querying text
We need to develop our tools as modular “building blocks”, ‘good enough’ is good enough
We need to share data/sources/databases/corpora/materials – open access; copyright is an issue, but we’re all (painfully) aware of this

Clearly, these are important points that we need to keep in mind, and challenges that we want to address. And repetitio mater studiorum est. So why do I feel that their reiteration on Monday and Tuesday only served to make me grumpier than usual?*

In the pub after Wednesday’s workshop, we talked a little bit about how pessimistically these points tend to be presented. “We can’t (yet) do XYZ”. “We need to understand that our tools and resources are terrible”. …which now reminds me of what I commented in a previous discussion, on Twitter, early this year:

"Digital resources are awesome! Except for limited access, very problematic contents, and utter rubbish metadata."#DigitalHumanities

— THE POSTDOCTOR (@samklai) February 27, 2014

One element in how I feel about the symposium could be the difficulty of cross-disciplinary communication. This, too, is familiar to me seeing as I straddle several disciplines, hanging out with historical linguists on the one hand, historians on th’other, and then DHers too. I once attended a three-day conference convened by linguists where the aim was to bring linguists and historians together. I think only one of the presentations was by a historian… So yeah, we don’t talk – as disciplines, that is: I know many individuals who talk across disciplinary borders. …and, come to think of it, I know a number of scholars who straddle such borders. But perhaps it’s just that at interdisciplinary events there’s a required level of dumbing-down on the part of the presentators on the one hand, and inevitable incomprehension on the part of the audience on the other. Admittedly, it is incredibly difficult to give a interdisciplinary paper.

A final point, perhaps, in these meandering reflections, is of course the wee fact that I don’t, in fact, work on research questions that require Big Data.† (At the moment, anyway). So I’m just not particularly interested in learning how to use computers to tell me something interesting about large amounts of texts – something that it would be impossible to see without using computational power. It’s not that the methodologies, or indeed the results produced, are not fascinating. It’s just that I guess I lack a personal connection to applying them. ..but then, I suppose this can be filed under the difficulty of interdisciplinary communication! ‘I see what you’re doing but I fail to see how it can help me in what I do’.

Hmm.

So how to conclude? I guess, first of all, kudos to HCAS for putting the symposium together – and, judging from upcoming events, for playing an important part in getting DH in Finland into motion. It’s not as if there’s been nothing previously, and HCAS definitely cannot be credited for ‘starting’ DH activities in Finland in any way – some of us have been doing this for 10 years, some for 30 years or more, and along the way, there have been events which fall under the DH umbrella. But only in the past year or so has DH become established institutionally in the University of Helsinki: we have a professor of DH now, and 4 fully-funded DH-related PhD positions. Perhaps it was the lack of institutional recognition that made previous efforts at organizing DH-related activities here for the large part intermittent and disconnected. But we’ll see how things proceed: certainly many of us are glad to see DH becoming established in Finnish academia as an entity. And judging by the full house at the symposium and the workshop that followed, it would appear that there are many of us in the local scholarly community interested in these topics. The future looks promising.

It should also be said that DH has come a long way from what it was ten years ago. The resources and tools we have today allow us to do amazing things. Just about all of the presentations at the symposium described and discussed projects that use complicated tools to do complex things. I am seriously impressed by what is being done in various fields – and simply, by what can be done today. And there is no denying that there is a Lot of work being done across and between disciplines: DH projects are often multidisciplinary by design, and many are working on and indeed producing tools and resources that can be useful to different disciplines.

Maybe it’s just the season making me cranky. You’ll certainly see me at the next local DH event. Watch this space..

* ..Maybe it’s just conference fatigue that I’m struggling with? There are only so many conference papers one can listen to attentively, and almost without exception there is never time to do much but scratch the surface and provide but a thin sketch of the material/problem/results/etc. (It’s rather like watching popular history/science documentaries/programs on tv: oh look, here’s Galileo and the heliocentric model again, ooh with pictures of the Vatican and dramatic Hollywood-movie music, for chrissakes). (I mean, yes it’s interesting and cool and all that but oh we’re out of time and have to go to commercials/questions). (So in order to retain my interest there needs to be some seriously new and exciting material/results to show, like those baby snow geese jumping off a 400ft cliff (!!!!) in David Attenborough’s fantastic new documentary Life Story, or all the fantastic multilingual multiscriptal stuff in historical manuscripts that we have only just started to look at in more detail. If it’s yet another documentary about Serengeti lions / paper about epistolary formulae in Early Modern English letters, I’m bound to skip it. I’m willing to be surprised, but these are well-trodden ground). /rant

† Incidentally, I disagree with the notion that in the Humanities we don’t have Big Data – I would say that this depends on your definition of “big”. While historical text corpora may at best only be some hundreds of millions of words, and this pales in comparison to the petabytes (or whatever) produced by, say, CERN, or Amazon, every minute, I see (historical) textual data as fractal: the closer you look at it, the more detail emerges. Admittedly, a lot of the detail does not usually get encoded in the digitised corpora (say, material and visual aspects of manuscript texts), but there’s more there than per byte than in recordings of the flight paths of electrons or customer transactions. Having said this, I’m sure someone can point out how wrong I am! But really, “my data > your data”? I don’t find spitting contests particularly useful in scholarship, any more than in real life.

No signal, just noise

One of the (oh too many) things I work on is code-switching* in historical texts. Or, more broadly, how multilingual environments are reflected in Early Modern English (merchants’) letter-writing. In particular I’ve done some work on the letters of early English East India Company merchants – some of it published – and then of course a bit more on the focus of my (never-ending) dissertation, the early letters of Richard Cocks. This year, I’ve joined my interest in historical code-switching to my interest in palaeography, and have given papers on if and how and why script and typeface are also switched when there is a code-switch in Early Modern English texts. In short, I have been looking at the historical development of why we still italicise words and passages in aliis linguis – and also at other practices of typographical flagging We Still Employ On A Daily Basis. This is all still work in progress, although I will write it up for publication in due course. But I would here like to share an observation gained from conferences and discussions on these and other topics this year.

Last June, there was an excellent symposium at the University of Tampere on historical code-switching. Really the first meeting of its kind, it was hugely enlightening to have three days of papers on code-switching in historical texts, covering over 1,500 years and, although focussing on English texts (OE, ME, EModE, LModE), also including some papers on texts produced elsewhere in Europe. Although all of it was fascinating and informative, I was naturally primarily interested in finding out about visual flagging of code-switching – whether by script-switching, as in the letters I presented on, or by other means. Only a couple of the papers actually focussed on visual aspects of code-switching, but visuality did crop up often enough to give me an idea of the range of the phenomenon and variation within it.

Overall, though, I was struck particularly with the realization that what we were conferencing on under the rubric of “historical code-switching”, actually was/is a hugely diverse … er, thing. Not a practice, but a vast set of practices. Not a phenomenon, but a huge array of phenomena. One of the conclusions of the conference that came up in the closing open discussion was in fact that although most scholars working on historical code-switching have been applying methodologies developed for present-day conversational code-switching to historical texts, we have all been discovering how inadequate such conceptual models and practical approaches are for our purposes. (The same has been realized by scholars working on code-switching in present-day texts). So developing models that are more suitable for the analysis of code-switching in (historical) texts is an important part of future work.

So, practices of code-switching in historical texts vary greatly depending on which period, region, languages, text types, genres, etc etc are involved – and much is down to scribal idiolects. That is to say, code-switching in the Early Modern English letters I work on is very different from code-switching in Late Modern English literary texts, or Early Modern English printed tracts, or practices in present-day northern India, or those in the Jewish community in medieval Cairo. And equally, the code- and script-switching practices of the writer I work on are completely different to those of some of his peers.

Simply put, there appears to be so much variation over time and space and text type, that it is difficult, at this stage, to see any patterns – except when restricting the study to a single place, time, text type, or indeed writer.

Okay. So what? How is the situation different from pretty much any other field?

Well, of course it isn’t. The primary difference to many other fields is the lack of data – historical code-switching is a emerging field and studies are still thin on the ground and cover disparate material. Give it another 20 years and the picture will be clearer. And actually, the fact that we know so little about any of this means that there is unexplored material aplenty, so it is ridiculously easy to come up with further topics and sources to study. Which is exciting!

But this is my point: the variation between texts (types, times, places) is so great as to render generalizations based on a single corpus void. Thus anyone making any general points about historical code-switching is, in my view, bound to be wrong.

And all of this applies equally strongly to script-switching, and also to material aspects of letter-writing: at the moment, we know next to nothing about either of these things.

Which brings me back to my work.

I guess what’s bugging me is the fact that, particularly in PhD work, you quite desperately want to be able to contribute to scholarship, and preferably with A Point: something that can be drawn out of your study and generalized; something that can be applied to other sources. Thus it’s eminently frustrating to cover new ground through painstaking attention to detail in your sources, only to end up with the realization that you have indeed made an important finding in itself, but all you can really say, based on This Material, is how This Material behaves.

* When I say “code-switching”, I use it in the broadest possible sense to mean any use of L2 in L1, including such things as quotations (which arguably require no competence in L2) and lexical borrowings (if cul-de-sac is not French, why do we keep italicising it?) as well as ‘real’ code-switching, be it inter- or intrasentential.

Did English spelling variation end in the 1630s?

1. Early Modern English spelling variation

Yesterday, rather late in the evening, I followed a link on Twitter:

So there's an EEBO-TCP spelling variation google ngram browser http://t.co/OLxUv5NLBQ (via @dr_heil)

— heather froehlich (@heatherfro) April 24, 2014

This led to the great Early Modern Print : Text Mining Early Printed English website where there was an interface like the Google Books Ngram Viewer but for the EEBO-TCP corpus, called EEBO Spelling Browser (or more technically, EEBO-TCP Ngram Browser). With the delight of a researcher falling upon a new toy I started to play with it – but hadn’t even started when I was struck by the figure that is displayed when you navigate to the EEBO-TCP Ngram Browser page. It looks like this:

The idea of an ngram viewer – as per Google – is to look at the frequency of occurrences over time, of a word (a 1-gram) or a phrase (2-, 3-, 4- … N-gram). Frequency here means the proportion of the search phrase to all the words in the corpus, plotted over time. So for instance, the frequency of the word “war” rises during wartime, and falls in peacetime. But things get much more interesting when you look at less obvious things.

Anyway. The point of the EEBO-TCP spelling variant ngram viewer is to compare the change and development of spelling variants over time: for instance, plotting “spell” against “spelle”, “spel”, etc:

English spelling only became standardized in the 18th century, and anyone who wants to read earlier texts has to learn to deal with the fact that apparently all spellings were equally acceptable, and that writers haphazardly used the first spelling that came to their mind* – one of the most famous (or notorious) examples being how William Shakespeare signed his name in six different ways. Despite eventual standardization, spelling variation in English has not completely disappeared today, for although varying how you spell your name today sounds outrageous and unthinkable, all students of English as a foreign language have to learn that there are British and American spellings for many familiar words: colour and color, standardize and standardise, etc.

2. What the hell happened in 1625?

But to return to the EEBO-TCP Spelling Browser, what struck me was the dramatic change in the 1630s. If you look back to the first figure above, you can see that of two spellings of the word above, the spelling “aboue” is essentially the given form until 1625, when it rapidly loses to the alternative spelling “above”, which is firmly established by about 1640.

..Hang on, what? The centuries-old practice of not differentiating between the graphemes <u> and <v> according to the phonemes they indicate – /u/ and /v/ – is replaced, over the stunningly short period of 15 years – across the board (!?) in printed texts by consistent mapping of <u> to /u/ and <v> to /v/..!?

@heatherfro …That was 90mins in the middle of the night playing with EModE spelling variation. What the hell happened in 1625?!?

— Sam Kaislaniemi (@samklai) April 24, 2014

Sooo many questions.

My very first thought was that it must be an artefact of the dataset. One word, of course, hardly tells the whole story. Did this change hold for other words that show u/v variation? What about i/j variation? Or perhaps the EEBO-TCP material was somehow skewed?

But however much I fiddled with the browser, the period between 1620 and 1640 remained the significant factor. And it also applied for i/j-words:

But I did also check EEBO proper – knowing that the results may well be different from those of the EEBO-TCP Ngram Browser. However, it turned out once again that the Browser had been right:

So what on earth happened in the 1620s and 1630s to explain this dramatic shift into standardized spelling?

…actually, I don’t know. Googling revealed that, on the one hand, this is a known phenomenon – although I so far have not found a definitive study of the phenomenon nor a good explanation. (Clearly it has something to do with what’s going on in printing houses). But for instance in her article in the Cambridge History of the English Language vol 3 (2000), Vivian Salmon discusses historical variation in using <u> and <v> to indicate both /u/ and /v/, and then quite casually mentions how “the distinction was made in the 1630s” (p. 39). I think that a corpus-based study of this change remains to be done – although I could be wrong.

Yet rather than starting to look at this point in more detail, I pursued another question that had come to mind: how did this shift in orthographical practices manifest in non-printed texts, such as letters?

3. Non-printed texts and manuscripts

Happily, I am in a perfect position to ask this question, being part of the team who have compiled the Corpus of Early English Correspondence (CEEC). The CEEC is a corpus of English personal letters, spanning 1400-1800 and presently containing about 12,000 letters (5.2m words). It was designed for historical sociolinguistics – to apply modern sociolinguistics methods on historical texts.

Of course, there’s a caveat: CEEC is based on printed editions of letters. “Hang on”, you might say, “is a corpus built from such sources linguistically reliable? Shouldn’t the corpus have been compiled from manuscript texts?” Well, yes – but we have been careful not to use editions that modernize the letter texts, as well as editions that normalize the texts extensively. For the kinds of linguistic queries that the corpus was designed for, the normalization of features such as u/v variation was deemed acceptable. And we have always been careful to stress that the CEEC is not suitable for studying English orthography.

Anyway, I nonetheless rushed right in to see what the CEEC threw up. Not having fancy tools (like DICER) to reveal the proper extent of variant spellings in the corpus, I used a short list of sample words (euer/ever, ouer/over, aboue/above, vp/up). But the results were underwhelming:

In this figure, the ratio of the old form of u/v-spelling variants was far too low through the whole period – it should have been at least around 80%, if the EEBO data was indicative of English spelling practices overall, rather than just those restricted to printed texts.

In order to have better data, I spent some time extracting a subcorpus from the CEEC† consisting of texts only from editions of 17th-century letters in which I could find u/v and i/j-variation. This time, the results were more interesting:

Although the ratio of old spelling variants is still much lower than I had expected, in this figure there is a sharp decline from the 1640s on – which would be in accordance to a prescribed change. (For example, if all schoolchildren are taught to spell according to certain rules, it takes a while for the older generations of writers to die out (or change their spelling habits). Similarly, it makes sense that the influence of a standard orthography in printed texts would reflect in manuscript texts with a slight time lag.)

Yet I remained unhappy with this data. In EEBO, the shift is from nearly 100% old form to 100% new form. Clearly the texts of the editions used for CEEC were normalized more than I had thought. Even given that this was a quick pilot study, the discrepancy was simply too large to accept as a difference between orthographical practices of manuscript and print.

I had one last trick up my sleeve: I did have a fairly good-sized corpus of letters from the first decade of the 1600s transcribed from manuscript, which retained original spellings and other orthographical features. It wouldn’t show me change over time, but it would give me a control figure for how much, exactly, were letter-writers using the old forms in their letters.‡

The result can be seen in the figure above – it is the red X, marking a whopping 87.9% old forms. Finally, something resembling the situation in EEBO.

There was a fair bit of variation between different words in the manuscript sources, and in some cases the new form was dominant:

	euer/ever	adu/adv	haue/have
old spelling	35	90	1017
new spelling	28	191	20
% old	56%	32%	98%

The greatest discrepancy between the manuscript sources and CEEC (namely the second extracted subcorpus) could be seen in the fact that in the manuscripts, words beginning /un-/ were spelled with a <v> 99.6% of the time (of 987 tokens), whereas in CEEC, the <v>-form occurred only 31% of the time (of 295 tokens). Even editions which claim to retain original spellings clearly cannot be taken at face value.

4. Summing up

So what can we say about that dramatic end to spelling variation in the 1630s seen in the figures from the EEBO-TCP Ngram Browser? Actually, not much.

1. It would appear that in the EEBO corpus, u/v variation became standardized between 1620 and 1640. However, without a comprehensive survey even this conclusion may be wrong – cf. for instance i/j variation in the proper name James, where it takes longer for the <i>-form to start declining, nor is it gone by the end of the century (this might have to do with capitalisation):

2. In manuscript texts, it looks like the spelling standardization process occurred 20 or more years after it took place in print. But without a broader survey, even this estimate may be well wrong.

3. The EEBO Spelling Browser is awesome!

I do remain curious about what happened in the 1620s & 30s. Particularly in whether the standardization of spelling was something more than a development in printing house practices. But I think I’ve done my share of midnight rabbit chasing for the moment.

* Students of Early Modern English beware: this is not true! There are methods in the apparent madness, although the rules may be subtle, and they do vary between writers.

† My first search of CEEC material was of c. 5,000 letters (2.2m words), finding 6,657 tokens of which 700 were old spellings. My second dataset consisted of c. 1,900 letters (just under 800k words), and 3,613 tokens of which 887 were old forms (types: euer/ever, ouer/over, aboue/above, vp/up, vs/us).

‡ This manuscript-based corpus contained about 200 letters (130k words). I expanded my sample word list (types: euer/ever, ouer/over, aboue/above, haue/have, giue/give, vp/up, vs/us, adu*/adv*, vn*/un*), extracting 2,712 tokens – of which 2,385 were old forms.

—

ETA 5.9.2024

Belatedly realized there are links to this blog post out there, even in print! Here’s the old, now defunct, link; my blog used to be hosted by the University of Helsinki, while I was affiliated there:

http://blogs.helsinki.fi/kaislani/2014/04/26/spelling-variation

(I’m hoping web crawlers will catch this so it becomes googlable, and leads here.)

How should you cite a book viewed in EEBO?

Earlier today, there was a discussion on Twitter on citing Early Modern English books seen on EEBO. But 140 characters is not enough to get my view across, so here ’tis instead.

The question: how should you cite a book viewed on EEBO in your bibliography?

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

(What’s more, digital scholarship is not yet getting the credit it deserves – and as a creator of digital resources myself, I feel quite strongly that this needs to change.)

Anyway; so how should you cite a work you’ve read in EEBO, then?

This is what the EEBO FAQ says (edited slightly; bold emphasis mine):

When citing material from EEBO, it is helpful to give the publication details of the original print source as well as those of the electronic version. You can view the original publication details of works in EEBO by clicking on the Full Record icon that appears on the Search Results, Document Image and Full Text page views, as well as on the list of Author’s Works.

Joseph Gibaldi’s MLA Handbook for Writers of Research Papers, 7th ed. (New York: The Modern Language Association of America, 2009), deals with citations of online sources in section 5.6, pp.181-93. For works on the web with print publication data, the MLA Handbook suggests that details of the print publication should be followed by (i) the title of the database or web site, (ii) the medium of publication consulted (i.e. ‘Web’), and (iii) the date of access (see 5.6.2.c, pp. 187-8).

… When including URLs in EEBO citations, use the blue Durable URL button that appears on each Document Image and Full Record display to generate a persistent URL for the particular page or record that you are referencing. It is not advisable to copy and paste URLs from the address bar of your browser as these will not be persistent.

Here is an example based on these guidelines:

Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Early English Books Online. Web. 13 May 2003. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

If you are citing one of the keyed texts produced by the Text Creation Partnership (TCP), the following format is recommended:

Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Text Creation Partnership digital edition. Early English Books Online. Web. 13 October 2010. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

Here’s why I think this is a ridiculous way to cite a book viewed on EEBO:

Outrageous URL. Bibliographies should be readable by humans: the above URL is illegible. Further, while the URL may indeed be persistent, no-one outside the University of Helsinki network can check the validity of this particular URL. And to quote Peter Shillingsburg on giving web addresses in your references, “All these sites are more reliably found by a web search engine than by URLs mouldering in a footnote”. If you’d want to find this resource, you’d use a web search engine and look for “Spenser Faerie Queen EEBO”. Or go directly to EEBO and search there – in any case, you wouldn’t ever use this URL.
Redundant information. Both “Early English Books Online” and “Web”? Don’t be silly.
Access date. If the digital resource you are accessing is stable, there’s no need for this. If it’s a newspaper or a blog, dating is necessary (especially if the contents of the target are likely to change). In the case of resources such as the Oxford English Dictionary – which, though largely stable, undergoes constant updates – each article (headword entry) is marked with which edition of the dictionary it belongs to, which information is enough (and which explains notations like OED² and OED³, for 2nd and 3rd ed. entries, respectively).

Instead, I suggest and recommend a citation format something like the following:

Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. EEBO. Huntington Library.

With a separate entry in your bibliography for EEBO:

EEBO = Early English Books Online. Chadwyck-Healey. <http://eebo.chadwyck.com/home>.

And if you’ve used the TCP version, add “-TCP” to the book reference, and include a separate entry for the Text Creation Partnership (EEBO-TCP).

This makes for much shorter entries in your bibliography, and clears away pages of redundant clutter which doesn’t tell the reader anything.

Why cite the source library?

Book historians will tell you – at some length – that there is no such thing as an edition of a hand-printed book. No two books printed by hand are exactly identical (in the way that modern printed books are identical) – due to misprints and the like, but also because for instance the paper they are printed on will be different from one codex to another (since a printer’s paper stock came from many different paper mills). So two copies of an Early Modern book (the same work, the same ‘edition’) will always differ from each other – sometimes in significant ways.

For this reason, really we should cite books-as-artefacts rather than books-as-works. Happily, EEBO gives the source library of each book, and including that information is straightforward and simple enough.

Problems and questions – can you not cite EEBO?

Some of the books on EEBO are available as images digitized from different microfilm surrogates of the source book. That is, there is more than one microfilm of the same book. Technically, these surrogate images are different artefacts and we should really reference the microfilm too… I see that this could be a problem, but have not come across an issue where citing the microfilm would have been relevant to the work I was doing.

Q: Which brings us to another important point: if you are only interested in the work, is it really necessary to cite the format, never mind the artefact?

A: Well, yes, for the reasons outlined above – and simply because it is good scholarly practice.

Q: What if you only use EEBO to double-check a page reference or the correct quotation of something you’d made a note of when you viewed (a different copy of) the work in a library?

A: Ah. Well, if you are feeling conscientious, maybe make a note that you’ve viewed the work in EEBO as well as a physical copy – say, use parentheses: “(EEBO. Huntington Library.)”.

Incidentally, since Early Modern books-as-artefacts differ from each other, technically we should always state in the bibliography which copy of the work we have seen. But I’m not sure anyone is quite that diligent – book historians perhaps excepted – and I can’t be bothered to check right now.

Q: Argh. Look, can’t we just go back to not quoting the work and not bother with all this?

A: No. Sorry.

However, I think we’ve drifted a bit far from our departure point.

All this serves to illustrate how citing Early Modern books – be it as physical copies, printed editions or facsimiles, or digital surrogates – is no simple matter. (And we haven’t discussed whether good practice should also include giving the ESTC number in order to identify the work…) So no wonder no standard practice has emerged on how to cite a work seen on EEBO.

Yet in sum, if you consult books on EEBO, I strongly urge you to give credit to EEBO in your bibliography.

ETA 27.2.2014 8am:

Another argument for why to make sure to cite EEBO is the rather huge matter of what, exactly, is EEBO, and how what it is affects scholarship. In the words of others:

Daniel Powell notes that:

[I]t seems important to realize that EEBO is quite prone to error, loss, and confusion–especially since it’s based on microfilm photographed in the 1930s-40s based on lists compiled in some cases the 1880s.

And Jacqueline Wernimont adds:

EEBO isn’t a catalogue of early modern books – it’s a catalogue of copies. More precisely, it is a repository of digital images of microfilms of single copies of books, and, if your institution subscribes to the Text Creation Partnership (TCP) phases one and/or two, text files that are outsourced transcriptions of microfilm images of single texts.

These points are particularly relevant if you treat EEBO as a library of early modern English works, but they apply equally when you access one or two books to check a reference. As Sarah Werner (among others) has shown us, digital facsimiles of (old) microfilms of early books can miss a lot of details that are clearly visible when viewing the physical books (like coloured ink). While in many cases the scans in EEBO are perfectly serviceable surrogates of the original printed book – black text on white paper tends to capture well in facsimile – the exceptions drive across the point that accessing a book as microfilm images is not the same as looking at a physical copy of the book.

This is not to say that all surrogates, and especially microfilms, are bad as such. In many cases it is the copy that survives whereas the original has been lost. And I have come across cases where the microfilm retains information that has been lost when the manuscripts have been cleaned by conservators and archivists some time after being microfilmed. (Pro tip for meticulous scholars: have a look at all the surrogates, even if you don’t need to!) Also, modern digital imaging is enabling us to read palimpsests and other messy texts with greater ease than before (or indeed at all).

In essence, then, you should make sure to cite EEBO when you use it – not only because of things you may miss due to problems with the images in EEBO, but also because digital resources enable us to do things which are simply impossible or would take forever when using physical copies.

Ok this was a long rant. But I hope this might be of use to someone!

Kindness is the child of money

Thomas Wilson (c. 1565-1629; ODNB link) – among other things, intelligencer, secretary to Sir Robert Cecil, MP, and Keeper of the State Papers at Whitehall – left quite an impressive paper trail of his life post-1600. Yet thus far I have only come across one letter from him to a family member, being CP 83/47 (in the Cecil Papers at Hatfield House), which is a letter from Wilson to his wife Margaret (née Meautys). The letter is dated the first of August 1600, and was written by Wilson on one of his tours of continental Europe, where he was engaged in gathering intelligence.

I found the following passage striking:

As I was takinge my iorney into Italie in that rude vnkind contrye of savoye , I was taken w^th myne ordinary enemy the burninge fever, who charged me w^th soe many fetters that I was not able to move one foote further, soe that all my companye and honorable frends having all stayed long for me wer forced at length to leaue me and I left desolate in the handes of such people in whom kindness is onely the chyld of monye and wherof god wott I hadd butt smale abondance the rest I leave to yo^u to coniecture / god I thanke him it is past, I am nowe in better helth & plentye and proceed alonge on my voyage though solitarye yett w^th more corage ^{\& hope/} then euer, God hath not appoynted that I shal dye yett but lyue & doe better then myne enemies wish or my frends hope

Wilson was, er, plagued by tertiary ague (malaria), which recurred throughout his life; it crops up in his letters several times over the years. I am not sure whether “kindness is only the child of money” is original – googling reveals nothing, but I suspect it may be from some Latin text, and perhaps can be found in some other form in English. (I checked the Helsinki Corpus (XML version) and the Corpus of Early English Correspondence, but couldn’t find it in either).

Wilson begins the letter to his wife by apologising for not having written, writing:

I was loth to send yo^u such ill newes as I sent them vntill it was past for that it wold haue encreased yo^r sorowe wherof I knowe yo^u haue too much

..which is fair enough. But although he assures her that he is now perfectly recovered, he goes on to say that he will not be able to write for some time as he is heading into enemy territory in Italy – one of his objectives was to learn what the King of Spain is up to, and the Kingdom of Naples belonged to the Spanish crown at this time. And as if that was not enough, he concludes his letter:

out of sauoye wher the warres ar beginning the 1 of August 1600 / Thy most loving Tho: wilson

Hardly reassuring reading! Happily, he made it back safe and sound, and didn’t have to engage in too much Bond-esque action (although there are some letters where he ponders going all Jason Bourne on a fellow Englishman..).

—

This blog has been rather quiet for some time. I expect I won’t be updating for another several months still, as there is a thesis that needs finishing. I might put my July conference paper up here, provided I write one instead of just babbling. But we shall see.

On the numbering and foliation of the Cecil Papers

While discussing the provenance of the manuscripts in my PhD edition, delving into the histories of various collections and repositories, I ran somewhat off on a tangent when writing about the Cecil Papers. Turns out that the foliation in the Cecil Papers is problematic, and references to documents in the Cecil Papers can be obscure. The little pedant in me ended up producing the following ~~rant~~ text, which is a bit too off-topic for even my thesis; for which reason it is now published here. I hope someone, one day, finds it useful. (Hope springs eternal, etc).

Oddly, Perry’s (2010) explanation of the numbering of Cecil Papers documents is incorrect. She claims that Cecil Paper numbers are formed of the volume number and the number “on the first page of that particular document”. Perry further says that “each page has been through-numbered within the volume, irrespective of where a new individual document begins”, so that consecutive document numbers may have gaps, such as her example of CP 56/1 being three “pages” long, and followed by CP 56/4. Yet browsing through the Cecil Papers reveals that the reality is more complex.

If we take Perry to mean “folio” when she says “page”, she is essentially correct. For instance, bifoliums have been given successive folio numbers on their rectos. However, a page has only been given a folio number if there is text (or other markings) on the page. Therefore, while the bifolium CP 29/17 is foliated on both its rectos (which contain text), as 17 and 18 respectively, the following document, CP 29/19, is a bifolium without text on the second recto, and this second folio has not been assigned a folio number. The document following CP 29/19 is thus CP 29/20, and not CP 29/21 as it would be if the foliation followed Perry’s description.

Since a bifolium is the most typical document form (a sheet of paper folded in half), and bifoliums with the second recto blank are very common, this means that a substantial amount of the Cecil Papers remain unfoliated. To complicate matters further, some of the Cecil Papers have been foliated incorrectly. For instance, CP 111/119/2 has presumably been mistakenly assigned the folio number 119 before the archivist noticed that he had already assigned 119 to the previous foliated recto, and had to correct it by adding the /2. It goes without saying that there are thus misfoliated bifolio documents with a blank second recto!

Finally, while the foliation allocating the CP numbers is done in a red ink or crayon, some of the Cecil Papers have also been foliated in pencil, including the blank rectos. For instance, CP 143/114, CP 143/115, CP 143/116 and CP 143/117 are all bifolios with blank second rectos. Their rectos, however, also carry pencilled foliation numbers in order, from “155” on CP 143/114_1r to “162” on CP 143/117_2r.

Top right corner of CP 143 f. 115r (CP 143/115)

Top right corner of CP 143 f. 115_1r (CP 143/115)

(Images from the Cecil Papers, this counts as fair use I think.)

These images are of the top right corners of the rectos of the bifolium CP 143/115, being folios 115r and a blank unfoliated-in-red-ink page I am calling 115_1r. Note the pencilled foliation which I referred to above: unlike the red ink, it is consistent, foliating these successive rectos as 157 and 158.

While emended misfoliation and secondary folio numbers may not prove insurmountable obstacles, the scholar should nonetheless be aware that many document and folio references to the Cecil Papers are thus potentially obscure. For instance, CP 143/115v – or CP 143 f. 115v – can refer both to page 2 of the said document (f. 115_1v), or to the dorse of the document, being the cover of the letter (f. 115_2v).

Reference

Perry, Vicki. 2010. “Notes on the numbering of the Cecil Papers and the scope of the digital collection”. Cecil Papers. ProQuest and Hatfield House.

The Permissive Digital Archive

Samuli Kaislaniemi (University of Helsinki)

[This is the paper I gave at The Permissive Archive conference at UCL in London on 9 November 2012. This versions includes sections that I skipped when giving the talk – these are indented in the text below. My apologies to those whose images I cribbed: I have linked to my sources, but will remove any and all borrowed images if asked.]

Let me start by saying how happy I am to be here. I don’t think I am the only one at this conference whose life has been positively changed by CELL. And I can’t think of any other academic institution that manages to host conferences that feel like parties!

0. Introduction

The digitisation revolution – for it is a revolution – has changed the way we do historical research. This applies equally to archaeologists and historical linguists, literary scholars and historians: anyone working on the past cannot but be affected by new digital tools and resources. They bring their own share of new challenges – many of which turn out to be old challenges. And they also promise – or seem to promise – to deliver new and exciting results.

I. Terminology: What is a digital archive?

What is a digital archive? The previous two presentations both talked about digital archives, but the term was not defined – so there seems to be a general understanding of what we mean by this term. Kenneth Price [1] has tried to tease out the nuances between different terms used for essentially similar digital resources, but discovered that distinctions are blurred. An Electronic Edition, according to Price, can mean almost anything. They certainly are not restricted to being digital versions of print editions. A digital project, on the other hand, is even more amorphous – but the word “project” has a sense of time, in that projects have a beginning and an end. Projects are either unfinished, or finished. In comparison, a database is usable from the moment it is set up. The term “database”, however, carries connotations of a technical nature – we think of relational databases – but when it is used as a word to describe a digital historical resource, it should be taken metaphorically. “In a digital environment”, says Price, “archive has gradually come to mean a purposeful collection of surrogates.” This is exactly what is more adequately implied by his last term, thematic research collection – and arguably, most digital resources are exactly this. But it doesn’t exactly roll off the tip of your tongue..

I’m afraid a discussion of what is an archive did not fit into this paper in the end, but to give you an idea, here is what archivist Kate Theimer [2] had to say about digital “archives”..

In other words, a digital “archive” is not an archive, but a collection. In contrast, here is Price’s comment again:

I think the use of the word archive is justifiable, sincefor the scholar, a repository is a repository: the details may differ from place to place, but any place you go to for access to original sources is, in essence, an archive.

Given this loose definition, “digital archives” include not only large-scale resources such as EEBO and State Papers Online, but also smaller resources such as the digital editions made here at CELL. And more importantly, I think one’s own personal research collection can be viewed as an archive. I work on archival materials, and my primary tool – after this laptop – is a digital camera. I have compiled a fairly large digital collection, having photographed almost a thousand manuscripts. These will never get published as a collection, of course, but they do form, in essence, my primary archive, which contains in essence surrogates of all the archival materials that I (think I) need.

What can be found in a digital archive? Digitised versions of original sources, of course, as well as metadata and all the other things Jenny Bann mentioned in her paper.

II. Digital dualism

We do not need to be constantly reminded that digitised books and manuscripts are not the same thing as looking at the original, material sources. However, this division into physical and electronic is not always useful, or even accurate.

Nathan Jurgenson [3] has coined the term digital dualism to refer to the false dichotomy between digital and physical worlds. (He actually differentiates between four “ideal” types of digital dualism, which you can see on the slide here – but which I don’t have time to go into.) Digital dualists are those who “believe that the digital world is ‘virtual’ and the physical world is ‘real’”. This is of course a familiar refrain to all of us, included in comments that disparage online communities in general, and the social web in particular. Facebook “is not real”, they say. But Jurgenson criticises the idea that time and energy spent in the digital world subtracts from the physical – he quotes Luciano Floridi: “we are probably the last generation to experience a clear difference between offline and online”. The digital and physical worlds may be ontologically separate, but they are both “real” in the sense of being authentic. That they have very different properties is of course true; but we live in both, and the two worlds interact. Reality, writes Jurgenson, “is always some simultaneous combination of materiality and the many different types of information, digital included.”

Jurgenson notes that “for the vast majority of writers, the relationship between the physical and digital looks like a big conceptual mess”. To remedy the situation, he provides a model of four ideal types of dualism, with “Strong Digital Dualism” at one end – which states that the physical and digital are different realities and do not interact – and Strong Augmented Reality at the other, which states that the realms are party of one single reality and have the same properties. Jurgenson himself takes a milder view, that of “Mild Augmented Reality” – same reality, different properties, interaction.

Lorna Hughes [4] has noted that digital tools and methodologies can well reveal more than traditional approaches: working “with a digital object (a surrogate created from a primary source that has been subject to a process of digitization, or data that were born digital) enables us to recover and challenge the ways in which our senses of time and place are historically and archaeologically understood, something that cannot be effectively communicated through traditional media.”

The usual “argument [is] that digital surrogates distance the scholar from the original sources. They do not. They give the scholar far greater control over the primary evidence, and therefore allow a previously unimaginable empowerment and democratization of source materials”. One great example of studying materiality with digital tools is Kathryn Rudy’s study of “dirty books” – using a densitometer to measure finger grease on pages of late medieval books of hours, revealing the reading habits of their readers, each unique and different from the others. And then there is multi-spectral analysis of palimpsests in order to read the erased text.

In the future, should we strive for haptic digital representations of manuscripts? Do we want to be able to feel the paper or parchment of a manuscript when viewing it on an iPad? I believe Alison Wiggins made a comment at the recent Digital Humanities Congress at Sheffield to the effect of, it is more useful for the scholar to know what kind of paper is used in a manuscript, than to have the feel of the paper recreated digitally. So perhaps haptic encoding would be more of a Turning-the-Pages –type show-off feature, than something that scholars would find useful. But I digress.

Arguably, then, the materiality of our sources does not get lost in the remediation from physical to digital format. But in any case, we are far more familiar with the visual and textual aspects of digital resources.

III. How using digital archives has changed the way we work and think

The first thing to note about digital archives is that they can be huge. SPOL contains digital images of some 2.2 million manuscripts. As they span 200 years, this comes to, on average, just over 100,000 manuscripts per year. EEBO, while significantly smaller, now has 15 or 20 thousand books available as full text. And the thing about full text is that you can conduct word-searches on it.

Tim Hitchcock [5] has noted that EEBO, ECCO, and other similar resources “have in ten years essentially made redundant 300 years of carefully structured and controlled systems for the categorization and retrieval of information. In the process these developments have also had a profound impact on the way … scholars go about doing research. … it is now possible to perform keyword searches on billions of words of printed text – both literary and historical.”

But what is more, scholars “are expected to search across a large number of electronic sources” – but the process strips them of the opportunity to get to understand the context from which individual elements of information come. (The problem may be also seen to be imposed upon them: scholars – especially students – need to look at “everything” in order not to be considered lazy or neglectful).

And keyword searches make new findings very easy indeed.

Here’s one I did earlier: I looked up the word archive in the Oxford English Dictionary. Then I did a simple keyword search in EEBO, and managed to find an instance of usage of the word 70 years before the first instance recorded by the OED.

(..This is not as amazing as it may seem: in fact, antedating the OED is very easy! But that is what I just showed you.)

But less superficially – to quote Tim Hitchcock [6] again: Keyword searching of printed text “radically transforms the nature of what historians do … in two ways. First, it fundamentally undermines several versions of our claim to social authority and authenticity as interpreters of the past. … If historians speak for the archives, their role is largely finished, as the material they contain is newly liberated and endlessly replicated.” … “Second, the development of searchable electronic archives challenges historians to re-examine the broad meta-narratives which have developed to explain social change. If historians no longer ‘ventriloquize’ on behalf of the archival clerk, then they are free to rethink the nature of social change.” That is to say, if publishing archival findings becomes unneccessary since “everything is accessible online”, then we are free to try to say something bigger.

That, in any case, is the theory: but in practice we are burdened by the curse of Convenience.

Peter Shillingsburg [7] recently wrote: “I was once told that the likelihood that a scholar or student will check the accuracy of a supposed fact is in inverse proportion to the distance that has to be travelled to do the checking. If it can be checked without getting up, high likelihood; across the room, probably but maybe not; out the door across the campus to the library, only if highly motivated. Why? Convenience.”

We are all guilty of this convenience. We say that physical books are better than digital, but we are increasingly likely to prefer online sources.

The constant refrain is that “it’s so much easier to work with whatever is online, and it means you don’t have to travel to see things”.[8] This is particularly true of younger generations, who may only have ever encountered early modern books in EEBO. So we should not be surprised when “[t]hey stay at home and expect archives to work like Google”.[9] And we are also biased towards convenience in using these online sources – if something doesn’t work, we will not do it. We can’t be bothered to learn to use features we don’t know exist. So quite often we end up using EEBO as an online repository of books, without even making full use of its search capabilities.

However, convenience means that we are limited by these convenient sources: our research questions end up being constrained by the digital sources – and by what you can search for in them! Keyword searching, however, falls on its face in front of Early Modern English spelling variation. And don’t get me started on the reliability and accuracy of the transcriptions in EEBO!

But there is a more serious problem with our convenient sources. Last week, at the meeting of the Consortium of European Research Libraries at the British Library, Tim Hitchcock [10] gave what he described on his blog as “a five minute rant”, in which he noted that most digitisation projects – such as EEBO, ECCO, Old Bailey Online, but also the papers of Darwin, Newton, and others – these projects are certainly transformative, but ironically they consist of the Western canon: texts written by the dead, white, male, elite. So, while digitisation projects have produced masses of data – well enough for sophisticated data-mining experiments – the problem is that this data is skewed.

Of course, the counter-argument is that in the humanities we are trained to be aware of the limitations of our sources. But we are also pressed for time and money, and going for the low-hanging fruit is only natural: we are designed for convenience. And in the process we often “forget” to approach our digital sources critically.

And when scholars and others from outside the humanities start to mine this data, for instance by using tools such as Google Ngrams, the results they produce are doubly skewed: first by a poor understanding of the data, and secondly by the limitations of the data itself. (This results in cases like ‘mining’ Google Ngrams for evidence of the history and development of English [11] – but in fact GBooks metadata (that the Ngrams tool uses) is atrocious, with modern editions are frequently mis-tagged as historical texts, and thus the results presented in the Ngram viewer in fact contain, for instance, 3-grams (frequently occuring strings of 3 words) from the “1540s” including 3-grams such as “an edition of” and “in the Bodleian” – which most certainly do not occur in texts from the 1540s).

This is familiar to us from the reporting of experiments in newspapers – all too often in the case of a social psychology experiment, where what has happened is that the researchers have only taken what is known as a “convenience sample” – ie. asked their students. This is not necessarily good or representative, but it sure is convenient! All too often the subjects of study in psychological tests are WEIRD –Western, Educated, Industrial, Rich and Democratic.[12] In biology, the same phenomenon is known as “taxonomic bias” – it is easier to decide to do research on big, cuddly mammals that are easy to find, than small beetles in the rainforest canopy. And in the case of biology, it is also, unfortunately, easier to get funding to do research on animals that seem more “important” to the layman.

(Another problematic issue relating to digital resources is that while they are used increasingly by scholars, they do not receive anything like the number of citations they should. Scholars will use EEBO to conduct their study, but then cite the original books – showing a preference for “the real thing” (in spite of their behaviour!).)

IV. The promissory nature of digital humanities and the permissive digital archive

I will wrap up my huge topic with a comment on the promissory nature of digital humanities, and the permissive nature of the digital archive.

Digital humanities is not a new discipline, but there remains a sense of newness and urgency. You might even call it millennialism – the revolution or paradigm shift is said to be “just around the corner”! But I would like to argue that in fact, we are there already. It is just a slow revolution, a revolution in small steps. When I started my studies, early modern English books could only be consulted in specialist collections, or as printed facsimiles. Students today have probably never even seen a printed facsimile – for them, the digital versions on EEBO are “Early Modern English books”.

Digital resources like EEBO are promissive in the sense that their scale and nature theoretically allow for entirely new research questions to be asked, thus paving the way for the promise of new and exciting results. The proliferation of digital resources and tools reflects this – there is a sense that if only we build enough of these things, we will figure out the meaning of it all.

This view has its critics. But as Steven Ramsay has pointed out, “I can now search for the word “house” (maybe “domus”) in every work ever produced in Europe during the entire period in question (in seconds). To suggest that this is just the same old thing with new tools, or that scholarship based on corpora of a size unimaginable to any previous generation in history is just “a fascination with gadgets,” is to miss both the epochal nature of what’s afoot, and the ways in which technology and discourse are intertwined”.[13]

The most striking feature of the digital archive in terms of how it can be permissive, is the way in which these archives can be connected to each other, using and reusing data, adding user-created content, and functioning like a database as well as like an edition, thanks to sophisticated digital analytical tools. There are already projects that have some or all of these features – most of them are relatively small-scale, but that does not detract from their worth. I have to conclude by saying how sorry I am that I had not the time to show you some examples! Luckily the previous two papers gave you some excellent examples.

Thank you very much.

—————————

Postscript 13.11.2012

This paper was, in part, about the dangers of using digital resources uncritically. At the same time, I tried to look at some of the ways in which the existence of these resources has affected our research habits. But the following day, thinking over all the excellent papers presented at the conference, and conversations with people during the day, I realized that in fact, I was not convinced that digital resources presented a serious problem, at least to this community of scholars. To be sure, almost everyone uses resources like EEBO – and many participate in the creation of other digitised or digital archives – but everyone makes use of them while being very conscious of their failings in comparison to the physical sources. Everyone is also aware of why we use them: because they greatly facilitate research (making it easier to do some ‘old’ kinds of research, and making it possible to look at new things); and because they are convenient. But convenience is not a bad thing when one has a good understanding of the compromises involved in creating the convenience. As long as we teach this to our students – which we demonstrably are indeed doing – the existence of these resources and tools is nothing less than a blessing.

I think, however, that we could all be more diligent in citing the digital sources we use – not only for scholarly integrity, but also in order to help raise the standing of and appreciation for digital resources. Those of us who create such resources well know how little credit we receive for our tasks, a matter particularly painful considering our output is linked to funding.

[1] Kenneth Price, “Edition, Project, Database, Archive, Thematic Research Collection: What’s in a Name?”. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009. http://digitalhumanities.org/dhq/vol/3/3/000053/000053.html

[2] Kate Theimer, “Archives in Context and as Context”. Journal of Digital Humanities Vol. 1 no. 2, 2012. http://journalofdigitalhumanities.org/1-2/archives-in-context-and-as-context-by-kate-theimer/.

[3] Nathan Jurgenson coined the term in “Digital duality versus augmented reality”, 24 Feb. 2011, on the Cybergology blog on the Society Pages website. The above discussion is drawn from “How to kill digital dualism without erasing differences” of 16 Sep. 2012, and “Strong and mild digital dualism”, 29 Oct. 2012, on the same blog. http://thesocietypages.org/cyborgology/.

[4] Lorna Hughes, “Conclusion: Virtual Representation of the Past – New Research Methods, Tools and Communities of Practice”, p. 192. In The Virtual Representation of the Past, ed. by Mark Greengrass and Lorna Hughes. Ashgate, 2007.

[5] Tim Hitchcock, “Digital Searching and the Re-formulation of Historical Knowledge”, pp. 84-85. In Virtual Representation of the Past.

[6] Hitchcock, ibid. p. 89.

[7] Peter Shillingsburg, “How Literary Works Exist: Convenient Scholarly Editions”, paragraph 25. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009. http://digitalhumanities.org/dhq/vol/3/3/000054/000054.html.

[8] Emma Huber, “Using digitised text collections in research and learning”, talk given at the JISC-funded workshop “Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text”, Bath on 24 Sep. 2009. http://www.slideshare.net/ekhuber/using-digitised-text-collections-in-research-and-learning.

[9] Brooks, Stephen. (@Stephen_Brooks_). “@RuthNRoberts @UkNatArchives #digitaltrail they stay at home and expect archives to work like Google.” 30 Aug 2012, 2:21 PM. Tweet. Part of the #digitaltrail discussion hosted by TNA on 30 Aug. 2012, http://blog.nationalarchives.gov.uk/blog/beyond-paper-the-digital-trail. Twitter conversation archived at http://storify.com/LauraCowdrey/beyond-paper-the-digital-trail.

[10] Tim Hitchcock, “A Five Minute Rant for the Consortium of European Research Libraries” (given on 31.10.2012 at the British Library), 29 Oct. 2012, Histryonics blog. http://historyonics.blogspot.co.uk/2012/10/a-five-minute-rant-for-consortium-of.html.

[11] The following example is from John Lavagnino, “Scholarship in the EEBO-TCP Age”, talk by John Lavagnino at the conference Revolutionizing Early Modern Studies? The Early English Books Online Text Creation Partnership in 2012, Oxford, 17 September 201. http://www.slideshare.net/jlavagnino/scholarship-in-the-eebotcp-age.

[12] Samuel Arbesman, “Big data: Mind the gaps”. IDEAS column in The Boston Globe, 30 Sep. 2012. http://www.bostonglobe.com/ideas/2012/09/29/big-data-mind-gaps/QClupxdwdPWHtRrZO0259O/story.html.

[13] From Patrik Svensson, “Envisioning the Digital Humanities”, DHQ: Digital Humanities Quarterly 6.1, 2012. http://digitalhumanities.org/dhq/vol/6/1/000112/000112.html.