No signal, just noise

One of the (oh too many) things I work on is code-switching* in historical texts. Or, more broadly, how multilingual environments are reflected in Early Modern English (merchants’) letter-writing. In particular I’ve done some work on the letters of early English East India Company merchants – some of it published – and then of course a bit more on the focus of my (never-ending) dissertation, the early letters of Richard Cocks. This year, I’ve joined my interest in historical code-switching to my interest in palaeography, and have given papers on if and how and why script and typeface are also switched when there is a code-switch in Early Modern English texts. In short, I have been looking at the historical development of why we still italicise words and passages in aliis linguis – and also at other practices of typographical flagging We Still Employ On A Daily Basis. This is all still work in progress, although I will write it up for publication in due course. But I would here like to share an observation gained from conferences and discussions on these and other topics this year.

Last June, there was an excellent symposium at the University of Tampere on historical code-switching. Really the first meeting of its kind, it was hugely enlightening to have three days of papers on code-switching in historical texts, covering over 1,500 years and, although focussing on English texts (OE, ME, EModE, LModE), also including some papers on texts produced elsewhere in Europe. Although all of it was fascinating and informative, I was naturally primarily interested in finding out about visual flagging of code-switching – whether by script-switching, as in the letters I presented on, or by other means. Only a couple of the papers actually focussed on visual aspects of code-switching, but visuality did crop up often enough to give me an idea of the range of the phenomenon and variation within it.

Overall, though, I was struck particularly with the realization that what we were conferencing on under the rubric of “historical code-switching”, actually was/is a hugely diverse … er, thing. Not a practice, but a vast set of practices. Not a phenomenon, but a huge array of phenomena. One of the conclusions of the conference that came up in the closing open discussion was in fact that although most scholars working on historical code-switching have been applying methodologies developed for present-day conversational code-switching to historical texts, we have all been discovering how inadequate such conceptual models and practical approaches are for our purposes. (The same has been realized by scholars working on code-switching in present-day texts). So developing models that are more suitable for the analysis of code-switching in (historical) texts is an important part of future work.

So, practices of code-switching in historical texts vary greatly depending on which period, region, languages, text types, genres, etc etc are involved – and much is down to scribal idiolects. That is to say, code-switching in the Early Modern English letters I work on is very different from code-switching in Late Modern English literary texts, or Early Modern English printed tracts, or practices in present-day northern India, or those in the Jewish community in medieval Cairo. And equally, the code- and script-switching practices of the writer I work on are completely different to those of some of his peers.

Simply put, there appears to be so much variation over time and space and text type, that it is difficult, at this stage, to see any patterns – except when restricting the study to a single place, time, text type, or indeed writer.

 

Okay. So what? How is the situation different from pretty much any other field?

Well, of course it isn’t. The primary difference to many other fields is the lack of data – historical code-switching is a emerging field and studies are still thin on the ground and cover disparate material. Give it another 20 years and the picture will be clearer. And actually, the fact that we know so little about any of this means that there is unexplored material aplenty, so it is ridiculously easy to come up with further topics and sources to study. Which is exciting!

But this is my point: the variation between texts (types, times, places) is so great as to render generalizations based on a single corpus void. Thus anyone making any general points about historical code-switching is, in my view, bound to be wrong.

And all of this applies equally strongly to script-switching, and also to material aspects of letter-writing: at the moment, we know next to nothing about either of these things.

Which brings me back to my work.

I guess what’s bugging me is the fact that, particularly in PhD work, you quite desperately want to be able to contribute to scholarship, and preferably with A Point: something that can be drawn out of your study and generalized; something that can be applied to other sources. Thus it’s eminently frustrating to cover new ground through painstaking attention to detail in your sources, only to end up with the realization that you have indeed made an important finding in itself, but all you can really say, based on This Material, is how This Material behaves.

 


* When I say “code-switching”, I use it in the broadest possible sense to mean any use of L2 in L1, including such things as quotations (which arguably require no competence in L2) and lexical borrowings (if cul-de-sac is not French, why do we keep italicising it?) as well as ‘real’ code-switching, be it inter- or intrasentential.

Did English spelling variation end in the 1630s?

1. Early Modern English spelling variation

Yesterday, rather late in the evening, I followed a link on Twitter:

This led to the great Early Modern Print : Text Mining Early Printed English website where there was an interface like the Google Books Ngram Viewer but for the EEBO-TCP corpus, called EEBO Spelling Browser (or more technically, EEBO-TCP Ngram Browser). With the delight of a researcher falling upon a new toy I started to play with it – but hadn’t even started when I was struck by the figure that is displayed when you navigate to the EEBO-TCP Ngram Browser page. It looks like this:

The idea of an ngram viewer – as per Google – is to look at the frequency of occurrences over time, of a word (a 1-gram) or a phrase (2-, 3-, 4- … N-gram). Frequency here means the proportion of the search phrase to all the words in the corpus, plotted over time. So for instance, the frequency of the word “war” rises during wartime, and falls in peacetime. But things get much more interesting when you look at less obvious things.

Anyway. The point of the EEBO-TCP spelling variant ngram viewer is to compare the change and development of spelling variants over time: for instance, plotting “spell” against “spelle”, “spel”, etc:

English spelling only became standardized in the 18th century, and anyone who wants to read earlier texts has to learn to deal with the fact that apparently all spellings were equally acceptable, and that writers haphazardly used the first spelling that came to their mind* – one of the most famous (or notorious) examples being how William Shakespeare signed his name in six different ways. Despite eventual standardization, spelling variation in English has not completely disappeared today, for although varying how you spell your name today sounds outrageous and unthinkable, all students of English as a foreign language have to learn that there are British and American spellings for many familiar words: colour and color, standardize and standardise, etc.

2. What the hell happened in 1625?

But to return to the EEBO-TCP Spelling Browser, what struck me was the dramatic change in the 1630s. If you look back to the first figure above, you can see that of two spellings of the word above, the spelling “aboue” is essentially the given form until 1625, when it rapidly loses to the alternative spelling “above”, which is firmly established by about 1640.

..Hang on, what? The centuries-old practice of not differentiating between the graphemes <u> and <v> according to the phonemes they indicate – /u/ and /v/ – is replaced, over the stunningly short period of 15 years – across the board (!?) in printed texts by consistent mapping of <u> to /u/ and <v> to /v/..!?

Sooo many questions.

My very first thought was that it must be an artefact of the dataset. One word, of course, hardly tells the whole story. Did this change hold for other words that show u/v variation? What about i/j variation? Or perhaps the EEBO-TCP material was somehow skewed?

But however much I fiddled with the browser, the period between 1620 and 1640 remained the significant factor. And it also applied for i/j-words:

But I did also check EEBO proper – knowing that the results may well be different from those of the EEBO-TCP Ngram Browser. However, it turned out once again that the Browser had been right:

So what on earth happened in the 1620s and 1630s to explain this dramatic shift into standardized spelling?

…actually, I don’t know. Googling revealed that, on the one hand, this is a known phenomenon – although I so far have not found a definitive study of the phenomenon nor a good explanation. (Clearly it has something to do with what’s going on in printing houses). But for instance in her article in the Cambridge History of the English Language vol 3 (2000), Vivian Salmon discusses historical variation in using <u> and <v> to indicate both /u/ and /v/, and then quite casually mentions how “the distinction was made in the 1630s” (p. 39). I think that a corpus-based study of this change remains to be done – although I could be wrong.

Yet rather than starting to look at this point in more detail, I pursued another question that had come to mind: how did this shift in orthographical practices manifest in non-printed texts, such as letters?

3. Non-printed texts and manuscripts

Happily, I am in a perfect position to ask this question, being part of the team who have compiled the Corpus of Early English Correspondence (CEEC). The CEEC is a corpus of English personal letters, spanning 1400-1800 and presently containing about 12,000 letters (5.2m words). It was designed for historical sociolinguistics – to apply modern sociolinguistics methods on historical texts.

Of course, there’s a caveat: CEEC is based on printed editions of letters. “Hang on”, you might say, “is a corpus built from such sources linguistically reliable? Shouldn’t the corpus have been compiled from manuscript texts?” Well, yes – but we have been careful not to use editions that modernize the letter texts, as well as editions that normalize the texts extensively. For the kinds of linguistic queries that the corpus was designed for, the normalization of features such as u/v variation was deemed acceptable. And we have always been careful to stress that the CEEC is not suitable for studying English orthography.

Anyway, I nonetheless rushed right in to see what the CEEC threw up. Not having fancy tools (like DICER) to reveal the proper extent of variant spellings in the corpus, I used a short list of sample words (euer/ever, ouer/over, aboue/above, vp/up). But the results were underwhelming:

In this figure, the ratio of the old form of u/v-spelling variants was far too low through the whole period – it should have been at least around 80%, if the EEBO data was indicative of English spelling practices overall, rather than just those restricted to printed texts.

In order to have better data, I spent some time extracting a subcorpus from the CEEC† consisting of texts only from editions of 17th-century letters in which I could find u/v and i/j-variation. This time, the results were more interesting:

Although the ratio of old spelling variants is still much lower than I had expected, in this figure there is a sharp decline from the 1640s on – which would be in accordance to a prescribed change. (For example, if all schoolchildren are taught to spell according to certain rules, it takes a while for the older generations of writers to die out (or change their spelling habits). Similarly, it makes sense that the influence of a standard orthography in printed texts would reflect in manuscript texts with a slight time lag.)

Yet I remained unhappy with this data. In EEBO, the shift is from nearly 100% old form to 100% new form. Clearly the texts of the editions used for CEEC were normalized more than I had thought. Even given that this was a quick pilot study, the discrepancy was simply too large to accept as a difference between orthographical practices of manuscript and print.

I had one last trick up my sleeve: I did have a fairly good-sized corpus of letters from the first decade of the 1600s transcribed from manuscript, which retained original spellings and other orthographical features. It wouldn’t show me change over time, but it would give me a control figure for how much, exactly, were letter-writers using the old forms in their letters.‡

The result can be seen in the figure above – it is the red X, marking a whopping 87.9% old forms. Finally, something resembling the situation in EEBO.

There was a fair bit of variation between different words in the manuscript sources, and in some cases the new form was dominant:

euer/ever adu*/adv* haue/have
old spelling 35 90 1017
new spelling 28 191 20
% old 56% 32% 98%

The greatest discrepancy between the manuscript sources and CEEC (namely the second extracted subcorpus) could be seen in the fact that in the manuscripts, words beginning /un-/ were spelled with a <v> 99.6% of the time (of 987 tokens), whereas in CEEC, the <v>-form occurred only 31% of the time (of 295 tokens). Even editions which claim to retain original spellings clearly cannot be taken at face value.

4. Summing up

So what can we say about that dramatic end to spelling variation in the 1630s seen in the figures from the EEBO-TCP Ngram Browser? Actually, not much.

1. It would appear that in the EEBO corpus, u/v variation became standardized between 1620 and 1640. However, without a comprehensive survey even this conclusion may be wrong – cf. for instance i/j variation in the proper name James, where it takes longer for the <i>-form to start declining, nor is it gone by the end of the century (this might have to do with capitalisation):

2. In manuscript texts, it looks like the spelling standardization process occurred 20 or more years after it took place in print. But without a broader survey, even this estimate may be well wrong.

3. The EEBO Spelling Browser is awesome!

I do remain curious about what happened in the 1620s & 30s. Particularly in whether the standardization of spelling was something more than a development in printing house practices. But I think I’ve done my share of midnight rabbit chasing for the moment.


* Students of Early Modern English beware: this is not true! There are methods in the apparent madness, although the rules may be subtle, and they do vary between writers.

† My first search of CEEC material was of c. 5,000 letters (2.2m words), finding 6,657 tokens of which 700 were old spellings. My second dataset consisted of c. 1,900 letters (just under 800k words), and 3,613 tokens of which 887 were old forms  (types: euer/ever, ouer/over, aboue/above, vp/up, vs/us).

‡ This manuscript-based corpus contained about 200 letters (130k words). I expanded my sample word list  (types: euer/ever, ouer/over, aboue/above, haue/have, giue/give, vp/up, vs/us, adu*/adv*, vn*/un*), extracting 2,712 tokens – of which 2,385 were old forms.

How should you cite a book viewed in EEBO?

Earlier today, there was a discussion on Twitter on citing Early Modern English books seen on EEBO. But 140 characters is not enough to get my view across, so here ’tis instead.

The question: how should you cite a book viewed on EEBO in your bibliography?

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

(What’s more, digital scholarship is not yet getting the credit it deserves – and as a creator of digital resources myself, I feel quite strongly that this needs to change.)

Anyway; so how should you cite a work you’ve read in EEBO, then?

This is what the EEBO FAQ says (edited slightly; bold emphasis mine):

When citing material from EEBO, it is helpful to give the publication details of the original print source as well as those of the electronic version. You can view the original publication details of works in EEBO by clicking on the Full Record icon that appears on the Search Results, Document Image and Full Text page views, as well as on the list of Author’s Works.

Joseph Gibaldi’s MLA Handbook for Writers of Research Papers, 7th ed. (New York: The Modern Language Association of America, 2009), deals with citations of online sources in section 5.6, pp.181-93. For works on the web with print publication data, the MLA Handbook suggests that details of the print publication should be followed by (i) the title of the database or web site, (ii) the medium of publication consulted (i.e. ‘Web’), and (iii) the date of access (see 5.6.2.c, pp. 187-8).

… When including URLs in EEBO citations, use the blue Durable URL button that appears on each Document Image and Full Record display to generate a persistent URL for the particular page or record that you are referencing. It is not advisable to copy and paste URLs from the address bar of your browser as these will not be persistent.

Here is an example based on these guidelines:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Early English Books Online. Web. 13 May 2003. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

If you are citing one of the keyed texts produced by the Text Creation Partnership (TCP), the following format is recommended:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Text Creation Partnership digital edition. Early English Books Online. Web. 13 October 2010. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

Here’s why I think this is a ridiculous way to cite a book viewed on EEBO:

  1. Outrageous URL. Bibliographies should be readable by humans: the above URL is illegible. Further, while the URL may indeed be persistent, no-one outside the University of Helsinki network can check the validity of this particular URL. And to quote Peter Shillingsburg on giving web addresses in your references, “All these sites are more reliably found by a web search engine than by URLs mouldering in a footnote”. If you’d want to find this resource, you’d use a web search engine and look for “Spenser Faerie Queen EEBO”. Or go directly to EEBO and search there – in any case, you wouldn’t ever use this URL.
  2. Redundant information. Both “Early English Books Online” and “Web”? Don’t be silly.
  3. Access date. If the digital resource you are accessing is stable, there’s no need for this. If it’s a newspaper or a blog, dating is necessary (especially if the contents of the target are likely to change). In the case of resources such as the Oxford English Dictionary – which, though largely stable, undergoes constant updates – each article (headword entry) is marked with which edition of the dictionary it belongs to, which information is enough (and which explains notations like OED2 and OED3, for 2nd and 3rd ed. entries, respectively).

Instead, I suggest and recommend a citation format something like the following:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. EEBO. Huntington Library.

With a separate entry in your bibliography for EEBO:

  • EEBO = Early English Books Online. Chadwyck-Healey. <http://eebo.chadwyck.com/home>.

And if you’ve used the TCP version, add “-TCP” to the book reference, and include a separate entry for the Text Creation Partnership (EEBO-TCP).

This makes for much shorter entries in your bibliography, and clears away pages of redundant clutter which doesn’t tell the reader anything.

Why cite the source library?

Book historians will tell you – at some length – that there is no such thing as an edition of a hand-printed book. No two books printed by hand are exactly identical (in the way that modern printed books are identical) – due to misprints and the like, but also because for instance the paper they are printed on will be different from one codex to another (since a printer’s paper stock came from many different paper mills). So two copies of an Early Modern book (the same work, the same ‘edition’) will always differ from each other – sometimes in significant ways.

For this reason, really we should cite books-as-artefacts rather than books-as-works. Happily, EEBO gives the source library of each book, and including that information is straightforward and simple enough.

Problems and questions – can you not cite EEBO?

Some of the books on EEBO are available as images digitized from different microfilm surrogates of the source book. That is, there is more than one microfilm of the same book. Technically, these surrogate images are different artefacts and we should really reference the microfilm too…  I see that this could be a problem, but have not come across an issue where citing the microfilm would have been relevant to the work I was doing.

Q: Which brings us to another important point: if you are only interested in the work, is it really necessary to cite the format, never mind the artefact?

A: Well, yes, for the reasons outlined above – and simply because it is good scholarly practice.

Q: What if you only use EEBO to double-check a page reference or the correct quotation of something you’d made a note of when you viewed (a different copy of) the work in a library?

A: Ah. Well, if you are feeling conscientious, maybe make a note that you’ve viewed the work in EEBO as well as a physical copy – say, use parentheses: “(EEBO. Huntington Library.)”.

Incidentally, since Early Modern books-as-artefacts differ from each other, technically we should always state in the bibliography which copy of the work we have seen. But I’m not sure anyone is quite that diligent – book historians perhaps excepted – and I can’t be bothered to check right now.

Q: Argh. Look, can’t we just go back to not quoting the work and not bother with all this?

A: No. Sorry.

However, I think we’ve drifted a bit far from our departure point.

All this serves to illustrate how citing Early Modern books – be it as physical copies, printed editions or facsimiles, or digital surrogates – is no simple matter. (And we haven’t discussed whether good practice should also include giving the ESTC number in order to identify the work…)  So no wonder no standard practice has emerged on how to cite a work seen on EEBO.

Yet in sum, if you consult books on EEBO, I strongly urge you to give credit to EEBO in your bibliography. 

 

ETA 27.2.2014 8am:

Another argument for why to make sure to cite EEBO is the rather huge matter of what, exactly, is EEBO, and how what it is affects scholarship. In the words of others:

Daniel Powell notes that:

[I]t seems important to realize that EEBO is quite prone to error, loss, and confusion–especially since it’s based on microfilm photographed in the 1930s-40s based on lists compiled in some cases the 1880s.

And Jacqueline Wernimont adds:

EEBO isn’t a catalogue of early modern books – it’s a catalogue of copies. More precisely, it is a repository of digital images of microfilms of single copies of books, and, if your institution subscribes to the Text Creation Partnership (TCP) phases one and/or two, text files that are outsourced transcriptions of microfilm images of single texts.

These points are particularly relevant if you treat EEBO as a library of early modern English works, but they apply equally when you access one or two books to check a reference. As Sarah Werner (among others) has shown us, digital facsimiles of (old) microfilms of early books can miss a lot of details that are clearly visible when viewing the physical books (like coloured ink). While in many cases the scans in EEBO are perfectly serviceable surrogates of the original printed book – black text on white paper tends to capture well in facsimile – the exceptions drive across the point that accessing a book as microfilm images is not the same as looking at a physical copy of the book.

This is not to say that all surrogates, and especially microfilms, are bad as such. In many cases it is the copy that survives whereas the original has been lost. And I have come across cases where the microfilm retains information that has been lost when the manuscripts have been cleaned by conservators and archivists some time after being microfilmed. (Pro tip for meticulous scholars: have a look at all the surrogates, even if you don’t need to!) Also, modern digital imaging is enabling us to read palimpsests and other messy texts with greater ease than before (or indeed at all).

In essence, then, you should make sure to cite EEBO when you use it – not only because of things you may miss due to problems with the images in EEBO, but also because digital resources enable us to do things which are simply impossible or would take forever when using physical copies.

 


Ok this was a long rant. But I hope this might be of use to someone!

Kindness is the child of money

Thomas Wilson (c. 1565-1629; ODNB link) – among other things, intelligencer, secretary to Sir Robert Cecil, MP, and Keeper of the State Papers at Whitehall – left quite an impressive paper trail of his life post-1600. Yet thus far I have only come across one letter from him to a family member, being CP 83/47 (in the Cecil Papers at Hatfield House), which is a letter from Wilson to his wife Margaret (née Meautys). The letter is dated the first of August 1600, and was written by Wilson on one of his tours of continental Europe, where he was engaged in gathering intelligence.

I found the following passage striking:

As I was takinge my iorney into Italie in that rude vnkind contrye of savoye , I was taken wth myne ordinary enemy the burninge fever, who charged me wth soe many fetters that I was not able to move one foote further, soe that all my companye and honorable frends having all stayed long for me wer forced at length to leaue me and I left desolate in the handes of such people in whom kindness is onely the chyld of monye and wherof god wott I hadd butt smale abondance the rest I leave to you to coniecture / god I thanke him it is past, I am nowe in better helth & plentye and proceed alonge on my voyage though solitarye yett wth more corage \& hope/ then euer, God hath not appoynted that I shal dye yett but lyue & doe better then myne enemies wish or my frends hope

Wilson was, er, plagued by tertiary ague (malaria), which recurred throughout his life; it crops up in his letters several times over the years. I am not sure whether “kindness is only the child of money” is original – googling reveals nothing, but I suspect it may be from some Latin text, and perhaps can be found in some other form in English. (I checked the Helsinki Corpus (XML version) and the Corpus of Early English Correspondence, but couldn’t find it in either).

Wilson begins the letter to his wife by apologising for not having written, writing:

I was loth to send you such ill newes as I sent them vntill it was past for that it wold haue encreased yor sorowe wherof I knowe you haue too much

..which is fair enough. But although he assures her that he is now perfectly recovered, he goes on to say that he will not be able to write for some time as he is heading into enemy territory in Italy – one of his objectives was to learn what the King of Spain is up to, and the Kingdom of Naples belonged to the Spanish crown at this time. And as if that was not enough, he concludes his letter:

out of sauoye wher the warres ar beginning the 1 of August 1600 / Thy most loving Tho: wilson

Hardly reassuring reading! Happily, he made it back safe and sound, and didn’t have to engage in too much Bond-esque action (although there are some letters where he ponders going all Jason Bourne on a fellow Englishman..).

This blog has been rather quiet for some time. I expect I won’t be updating for another several months still, as there is a thesis that needs finishing. I might put my July conference paper up here, provided I write one instead of just babbling. But we shall see.

On the numbering and foliation of the Cecil Papers

While discussing the provenance of the manuscripts in my PhD edition, delving into the histories of various collections and repositories, I ran somewhat off on a tangent when writing about the Cecil Papers. Turns out that the foliation in the Cecil Papers is problematic, and references to documents in the Cecil Papers can be obscure. The little pedant in me ended up producing the following rant text, which is a bit too off-topic for even my thesis; for which reason it is now published here. I hope someone, one day, finds it useful. (Hope springs eternal, etc).


Oddly, Perry’s (2010) explanation of the numbering of Cecil Papers documents is incorrect. She claims that Cecil Paper numbers are formed of the volume number and the number “on the first page of that particular document”. Perry further says that “each page has been through-numbered within the volume, irrespective of where a new individual document begins”, so that consecutive document numbers may have gaps, such as her example of CP 56/1 being three “pages” long, and followed by CP 56/4. Yet browsing through the Cecil Papers reveals that the reality is more complex.

If we take Perry to mean “folio” when she says “page”, she is essentially correct. For instance, bifoliums have been given successive folio numbers on their rectos. However, a page has only been given a folio number if there is text (or other markings) on the page. Therefore, while the bifolium CP 29/17 is foliated on both its rectos (which contain text), as 17 and 18 respectively, the following document, CP 29/19, is a bifolium without text on the second recto, and this second folio has not been assigned a folio number. The document following CP 29/19 is thus CP 29/20, and not CP 29/21 as it would be if the foliation followed Perry’s description.

Since a bifolium is the most typical document form (a sheet of paper folded in half), and bifoliums with the second recto blank are very common, this means that a substantial amount of the Cecil Papers remain unfoliated. To complicate matters further, some of the Cecil Papers have been foliated incorrectly. For instance, CP 111/119/2 has presumably been mistakenly assigned the folio number 119 before the archivist noticed that he had already assigned 119 to the previous foliated recto, and had to correct it by adding the /2. It goes without saying that there are thus misfoliated bifolio documents with a blank second recto!

Finally, while the foliation allocating the CP numbers is done in a red ink or crayon, some of the Cecil Papers have also been foliated in pencil, including the blank rectos. For instance, CP 143/114, CP 143/115, CP 143/116 and CP 143/117 are all bifolios with blank second rectos. Their rectos, however, also carry pencilled foliation numbers in order, from “155” on CP 143/114_1r to “162” on CP 143/117_2r.

Top right corner of CP 143 f. 115r (CP 143/115)

Top right corner of CP 143 f. 115_1r (CP 143/115)

(Images from the Cecil Papers, this counts as fair use I think.)

These images are of the top right corners of the rectos of the bifolium CP 143/115, being folios 115r and a blank unfoliated-in-red-ink page I am calling 115_1r. Note the pencilled foliation which I referred to above: unlike the red ink, it is consistent, foliating these successive rectos as 157 and 158.

While emended misfoliation and secondary folio numbers may not prove insurmountable obstacles, the scholar should nonetheless be aware that many document and folio references to the Cecil Papers are thus potentially obscure. For instance, CP 143/115v – or CP 143 f. 115v – can refer both to page 2 of the said document (f. 115_1v), or to the dorse of the document, being the cover of the letter (f. 115_2v).

Reference

Perry, Vicki. 2010. “Notes on the numbering of the Cecil Papers and the scope of the digital collection”. Cecil Papers. ProQuest and Hatfield House.

The Permissive Digital Archive

Samuli Kaislaniemi (University of Helsinki)

[This is the paper I gave at The Permissive Archive conference at UCL in London on 9 November 2012. This versions includes sections that I skipped when giving the talk – these are indented in the text below. My apologies to those whose images I cribbed: I have linked to my sources, but will remove any and all borrowed images if asked.]

Let me start by saying how happy I am to be here. I don’t think I am the only one at this conference whose life has been positively changed by CELL. And I can’t think of any other academic institution that manages to host conferences that feel like parties!

0. Introduction

The digitisation revolution – for it is a revolution – has changed the way we do historical research. This applies equally to archaeologists and historical linguists, literary scholars and historians: anyone working on the past cannot but be affected by new digital tools and resources. They bring their own share of new challenges – many of which turn out to be old challenges. And they also promise – or seem to promise – to deliver new and exciting results.

I. Terminology: What is a digital archive?

What is a digital archive? The previous two presentations both talked about digital archives, but the term was not defined – so there seems to be a general understanding of what we mean by this term. Kenneth Price[1] has tried to tease out the nuances between different terms used for essentially similar digital resources, but discovered that distinctions are blurred. An Electronic Edition, according to Price, can mean almost anything. They certainly are not restricted to being digital versions of print editions. A digital project, on the other hand, is even more amorphous – but the word “project” has a sense of time, in that projects have a beginning and an end. Projects are either unfinished, or finished. In comparison, a database is usable from the moment it is set up. The term “database”, however, carries connotations of a technical nature – we think of relational databases – but when it is used as a word to describe a digital historical resource, it should be taken metaphorically. “In a digital environment”, says Price, “archive has gradually come to mean a purposeful collection of surrogates.” This is exactly what is more adequately implied by his last term, thematic research collection – and arguably, most digital resources are exactly this. But it doesn’t exactly roll off the tip of your tongue..

I’m afraid a discussion of what is an archive did not fit into this paper in the end, but to give you an idea, here is what archivist Kate Theimer[2] had to say about digital “archives”..

In other words, a digital “archive” is not an archive, but a collection. In contrast, here is Price’s comment again:

I think the use of the word archive is justifiable, sincefor the scholar, a repository is a repository: the details may differ from place to place, but any place you go to for access to original sources is, in essence, an archive.

Given this loose definition, “digital archives” include not only large-scale resources such as EEBO and State Papers Online, but also smaller resources such as the digital editions made here at CELL. And more importantly, I think one’s own personal research collection can be viewed as an archive. I work on archival materials, and my primary tool – after this laptop – is a digital camera. I have compiled a fairly large digital collection, having photographed almost a thousand manuscripts. These will never get published as a collection, of course, but they do form, in essence, my primary archive, which contains in essence surrogates of all the archival materials that I (think I) need.

What can be found in a digital archive? Digitised versions of original sources, of course, as well as metadata and all the other things Jenny Bann mentioned in her paper.

II. Digital dualism

We do not need to be constantly reminded that digitised books and manuscripts are not the same thing as looking at the original, material sources. However, this division into physical and electronic is not always useful, or even accurate.

Nathan Jurgenson[3] has coined the term digital dualism to refer to the false dichotomy between digital and physical worlds. (He actually differentiates between four “ideal” types of digital dualism, which you can see on the slide here – but which I don’t have time to go into.) Digital dualists are those who “believe that the digital world is ‘virtual’ and the physical world is ‘real’”. This is of course a familiar refrain to all of us, included in comments that disparage online communities in general, and the social web in particular. Facebook “is not real”, they say. But Jurgenson criticises the idea that time and energy spent in the digital world subtracts from the physical – he quotes Luciano Floridi: “we are probably the last generation to experience a clear difference between offline and online”. The digital and physical worlds may be ontologically separate, but they are both “real” in the sense of being authentic. That they have very different properties is of course true; but we live in both, and the two worlds interact. Reality, writes Jurgenson, “is always some simultaneous combination of materiality and the many different types of information, digital included.”

Jurgenson notes that “for the vast majority of writers, the relationship between the physical and digital looks like a big conceptual mess”. To remedy the situation, he provides a model of four ideal types of dualism, with “Strong Digital Dualism” at one end – which states that the physical and digital are different realities and do not interact – and Strong Augmented Reality at the other, which states that the realms are party of one single reality and have the same properties. Jurgenson himself takes a milder view, that of “Mild Augmented Reality” – same reality, different properties, interaction.

Lorna Hughes[4] has noted that digital tools and methodologies can well reveal more than traditional approaches: working “with a digital object (a surrogate created from a primary source that has been subject to a process of digitization, or data that were born digital) enables us to recover and challenge the ways in which our senses of time and place are historically and archaeologically understood, something that cannot be effectively communicated through traditional media.”

The usual “argument [is] that digital surrogates distance the scholar from the original sources. They do not. They give the scholar far greater control over the primary evidence, and therefore allow a previously unimaginable empowerment and democratization of source materials”. One great example of studying materiality with digital tools is Kathryn Rudy’s study of “dirty books” – using a densitometer to measure finger grease on pages of late medieval books of hours, revealing the reading habits of their readers, each unique and different from the others. And then there is multi-spectral analysis of palimpsests in order to read the erased text.

In the future, should we strive for haptic digital representations of manuscripts? Do we want to be able to feel the paper or parchment of a manuscript when viewing it on an iPad? I believe Alison Wiggins made a comment at the recent Digital Humanities Congress at Sheffield to the effect of, it is more useful for the scholar to know what kind of paper is used in a manuscript, than to have the feel of the paper recreated digitally. So perhaps haptic encoding would be more of a Turning-the-Pages –type show-off feature, than something that scholars would find useful. But I digress.

Arguably, then, the materiality of our sources does not get lost in the remediation from physical to digital format. But in any case, we are far more familiar with the visual and textual aspects of digital resources.

III. How using digital archives has changed the way we work and think

The first thing to note about digital archives is that they can be huge. SPOL contains digital images of some 2.2 million manuscripts. As they span 200 years, this comes to, on average, just over 100,000 manuscripts per year. EEBO, while significantly smaller, now has 15 or 20 thousand books available as full text. And the thing about full text is that you can conduct word-searches on it.

Tim Hitchcock[5] has noted that EEBO, ECCO, and other similar resources “have in ten years essentially made redundant 300 years of carefully structured and controlled systems for the categorization and retrieval of information. In the process these developments have also had a profound impact on the way … scholars go about doing research. … it is now possible to perform keyword searches on billions of words of printed text – both literary and historical.”

But what is more, scholars “are expected to search across a large number of electronic sources” – but the process strips them of the opportunity to get to understand the context from which individual elements of information come. (The problem may be also seen to be imposed upon them: scholars – especially students – need to look at “everything” in order not to be considered lazy or neglectful).

And keyword searches make new findings very easy indeed.

Here’s one I did earlier: I looked up the word archive in the Oxford English Dictionary. Then I did a simple keyword search in EEBO, and managed to find an instance of usage of the word 70 years before the first instance recorded by the OED.

(..This is not as amazing as it may seem: in fact, antedating the OED is very easy! But that is what I just showed you.)

But less superficially – to quote Tim Hitchcock[6] again: Keyword searching of printed text “radically transforms the nature of what historians do … in two ways. First, it fundamentally undermines several versions of our claim to social authority and authenticity as interpreters of the past. … If historians speak for the archives, their role is largely finished, as the material they contain is newly liberated and endlessly replicated.” … “Second, the development of searchable electronic archives challenges historians to re-examine the broad meta-narratives which have developed to explain social change. If historians no longer ‘ventriloquize’ on behalf of the archival clerk, then they are free to rethink the nature of social change.” That is to say, if publishing archival findings becomes unneccessary since “everything is accessible online”, then we are free to try to say something bigger.

That, in any case, is the theory: but in practice we are burdened by the curse of Convenience.

Peter Shillingsburg[7] recently wrote: “I was once told that the likelihood that a scholar or student will check the accuracy of a supposed fact is in inverse proportion to the distance that has to be travelled to do the checking. If it can be checked without getting up, high likelihood; across the room, probably but maybe not; out the door across the campus to the library, only if highly motivated. Why? Convenience.”

We are all guilty of this convenience. We say that physical books are better than digital, but we are increasingly likely to prefer online sources.

The constant refrain is that “it’s so much easier to work with whatever is online, and it means you don’t have to travel to see things”.[8] This is particularly true of younger generations, who may only have ever encountered early modern books in EEBO. So we should not be surprised when “[t]hey stay at home and expect archives to work like Google”.[9] And we are also biased towards convenience in using these online sources – if something doesn’t work, we will not do it. We can’t be bothered to learn to use features we don’t know exist. So quite often we end up using EEBO as an online repository of books, without even making full use of its search capabilities.

However, convenience means that we are limited by these convenient sources: our research questions end up being constrained by the digital sources – and by what you can search for in them! Keyword searching, however, falls on its face in front of Early Modern English spelling variation. And don’t get me started on the reliability and accuracy of the transcriptions in EEBO!

But there is a more serious problem with our convenient sources. Last week, at the meeting of the Consortium of European Research Libraries at the British Library, Tim Hitchcock[10] gave what he described on his blog as “a five minute rant”, in which he noted that most digitisation projects – such as EEBO, ECCO, Old Bailey Online, but also the papers of Darwin, Newton, and others – these projects are certainly transformative, but ironically they consist of the Western canon: texts written by the dead, white, male, elite. So, while digitisation projects have produced masses of data – well enough for sophisticated data-mining experiments – the problem is that this data is skewed.

Of course, the counter-argument is that in the humanities we are trained to be aware of the limitations of our sources. But we are also pressed for time and money, and going for the low-hanging fruit is only natural: we are designed for convenience. And in the process we often “forget” to approach our digital sources critically.

And when scholars and others from outside the humanities start to mine this data, for instance by using tools such as Google Ngrams, the results they produce are doubly skewed: first by a poor understanding of the data, and secondly by the limitations of the data itself. (This results in cases like ‘mining’ Google Ngrams for evidence of the history and development of English[11] – but in fact GBooks metadata (that the Ngrams tool uses) is atrocious, with modern editions are frequently mis-tagged as historical texts, and thus the results presented in the Ngram viewer in fact contain, for instance, 3-grams (frequently occuring strings of 3 words) from the “1540s” including 3-grams such as “an edition of” and “in the Bodleian” – which most certainly do not occur in texts from the 1540s).

This is familiar to us from the reporting of experiments in newspapers – all too often in the case of a social psychology experiment, where what has happened is that the researchers have only taken what is known as a “convenience sample” – ie. asked their students. This is not necessarily good or representative, but it sure is convenient! All too often the subjects of study in psychological tests are WEIRD –Western, Educated, Industrial, Rich and Democratic.[12] In biology, the same phenomenon is known as “taxonomic bias” – it is easier to decide to do research on big, cuddly mammals that are easy to find, than small beetles in the rainforest canopy. And in the case of biology, it is also, unfortunately, easier to get funding to do research on animals that seem more “important” to the layman.

(Another problematic issue relating to digital resources is that while they are used increasingly by scholars, they do not receive anything like the number of citations they should. Scholars will use EEBO to conduct their study, but then cite the original books – showing a preference for “the real thing” (in spite of their behaviour!).)

IV. The promissory nature of digital humanities and the permissive digital archive

I will wrap up my huge topic with a comment on the promissory nature of digital humanities, and the permissive nature of the digital archive.

Digital humanities is not a new discipline, but there remains a sense of newness and urgency. You might even call it millennialism – the revolution or paradigm shift is said to be “just around the corner”! But I would like to argue that in fact, we are there already. It is just a slow revolution, a revolution in small steps. When I started my studies, early modern English books could only be consulted in specialist collections, or as printed facsimiles. Students today have probably never even seen a printed facsimile – for them, the digital versions on EEBO are “Early Modern English books”.

Digital resources like EEBO are promissive in the sense that their scale and nature theoretically allow for entirely new research questions to be asked, thus paving the way for the promise of new and exciting results. The proliferation of digital resources and tools reflects this – there is a sense that if only we build enough of these things, we will figure out the meaning of it all.

This view has its critics. But as Steven Ramsay has pointed out, “I can now search for the word “house” (maybe “domus”) in every work ever produced in Europe during the entire period in question (in seconds). To suggest that this is just the same old thing with new tools, or that scholarship based on corpora of a size unimaginable to any previous generation in history is just “a fascination with gadgets,” is to miss both the epochal nature of what’s afoot, and the ways in which technology and discourse are intertwined”.[13]

The most striking feature of the digital archive in terms of how it can be permissive, is the way in which these archives can be connected to each other, using and reusing data, adding user-created content, and functioning like a database as well as like an edition, thanks to sophisticated digital analytical tools. There are already projects that have some or all of these features – most of them are relatively small-scale, but that does not detract from their worth. I have to conclude by saying how sorry I am that I had not the time to show you some examples! Luckily the previous two papers gave you some excellent examples.

Thank you very much.

—————————

Postscript 13.11.2012

This paper was, in part, about the dangers of using digital resources uncritically. At the same time, I tried to look at some of the ways in which the existence of these resources has affected our research habits. But the following day, thinking over all the excellent papers presented at the conference, and conversations with people during the day, I realized that in fact, I was not convinced that digital resources presented a serious problem, at least to this community of scholars. To be sure, almost everyone uses resources like EEBO – and many participate in the creation of other digitised or digital archives – but everyone makes use of them while being very conscious of their failings in comparison to the physical sources. Everyone is also aware of why we use them: because they greatly facilitate research (making it easier to do some ‘old’ kinds of research, and making it possible to look at new things); and because they are convenient. But convenience is not a bad thing when one has a good understanding of the compromises involved in creating the convenience. As long as we teach this to our students – which we demonstrably are indeed doing – the existence of these resources and tools is nothing less than a blessing.

I think, however, that we could all be more diligent in citing the digital sources we use – not only for scholarly integrity, but also in order to help raise the standing of and appreciation for digital resources. Those of us who create such resources well know how little credit we receive for our tasks, a matter particularly painful considering our output is linked to funding.


[1] Kenneth Price, “Edition, Project, Database, Archive, Thematic Research Collection: What’s in a Name?”. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009. http://digitalhumanities.org/dhq/vol/3/3/000053/000053.html

[2] Kate Theimer, “Archives in Context and as Context”. Journal of Digital Humanities Vol. 1 no. 2, 2012. http://journalofdigitalhumanities.org/1-2/archives-in-context-and-as-context-by-kate-theimer/.

[3] Nathan Jurgenson coined the term in “Digital duality versus augmented reality”, 24 Feb. 2011, on the Cybergology blog on the Society Pages website. The above discussion is drawn from “How to kill digital dualism without erasing differences” of 16 Sep. 2012, and “Strong and mild digital dualism”, 29 Oct. 2012, on the same blog. http://thesocietypages.org/cyborgology/.

[4] Lorna Hughes, “Conclusion: Virtual Representation of the Past – New Research Methods, Tools and Communities of Practice”, p. 192. In The Virtual Representation of the Past, ed. by Mark Greengrass and Lorna Hughes. Ashgate, 2007.

[5] Tim Hitchcock, “Digital Searching and the Re-formulation of Historical Knowledge”, pp. 84-85. In Virtual Representation of the Past.

[6] Hitchcock, ibid. p. 89.

[7] Peter Shillingsburg, “How Literary Works Exist: Convenient Scholarly Editions”, paragraph 25. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009. http://digitalhumanities.org/dhq/vol/3/3/000054/000054.html.

[8] Emma Huber, “Using digitised text collections in research and learning”, talk given at the JISC-funded workshop “Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text”, Bath on 24 Sep. 2009. http://www.slideshare.net/ekhuber/using-digitised-text-collections-in-research-and-learning.

[9] Brooks, Stephen. (@Stephen_Brooks_). “@RuthNRoberts @UkNatArchives #digitaltrail they stay at home and expect archives to work like Google.” 30 Aug 2012, 2:21 PM. Tweet. Part of the #digitaltrail discussion hosted by TNA on 30 Aug. 2012, http://blog.nationalarchives.gov.uk/blog/beyond-paper-the-digital-trail. Twitter conversation archived at http://storify.com/LauraCowdrey/beyond-paper-the-digital-trail.

[10] Tim Hitchcock, “A Five Minute Rant for the Consortium of European Research Libraries” (given on 31.10.2012 at the British Library), 29 Oct. 2012, Histryonics blog. http://historyonics.blogspot.co.uk/2012/10/a-five-minute-rant-for-consortium-of.html.

[11] The following example is from John Lavagnino, “Scholarship in the EEBO-TCP Age”, talk by John Lavagnino at the conference Revolutionizing Early Modern Studies? The Early English Books Online Text Creation Partnership in 2012, Oxford, 17 September 201. http://www.slideshare.net/jlavagnino/scholarship-in-the-eebotcp-age.

[12] Samuel Arbesman, “Big data: Mind the gaps”. IDEAS column in The Boston Globe, 30 Sep. 2012. http://www.bostonglobe.com/ideas/2012/09/29/big-data-mind-gaps/QClupxdwdPWHtRrZO0259O/story.html.

[13] From Patrik Svensson, “Envisioning the Digital Humanities”, DHQ: Digital Humanities Quarterly 6.1, 2012. http://digitalhumanities.org/dhq/vol/6/1/000112/000112.html.

A Brief Treatise of Arithmeticke (1588)

In looking for something completely different, I browsed through bits of John Mellis’s 1588 manual on bookkeeping, A briefe instruction and maner hovv to keepe bookes of accompts after the order of debitor and creditor & as well for proper accompts partible, &c. […] (London. STC 18794. EEBO. Huntington Library). It contains A Short and Plaine Treatise of Arithmeticke in whole numbers, comprised into a briefer method than hetherto hath bin published, from whence the following lovely little late Elizabethan arithmetick problem cometh:

p. 15

But now in manner of a recreation, as wel as for exercise, I propose one question more: as thus.

A Gentlewoman for a certayne trespasse committed, was enioyned by her Soueraigne a certaine penance, which was this: That in her owne person going a foote, and being accompanied with two of her honest seruants she should goe from Saint Dauids in Wales to Douer, which is accempted to bee the breadth of Englande, And at each Furlongs ende, being eight in a mile, she and her men should gather in a heape, great and small togeather, two hundred and fourty stones. Uppon which harde sentence geuen by her Soueraigne, after she heard that her iourney was three hundred miles, she tooke the matter heauilie, and humbly sought and craued tolleration herein. Which in fine vpon her humble suite, and the earnest request of other Ladies and Gentlewomen, was absolutely remitted, vpon a condition, which was this. That if the Gentlewoman there presently before her Soueraigne, without the ayde of any other, could of her owne pregnant capacitie, make an absolute resolution, and accompte how [p. 16] how many stones in all she ought to haue gathered, that thereupon she should be cleerely dismissed of this pennance.

The Gentlewoman glad of this, and hauing a little sight in Arithmeticke, called for penne, inke, and paper, and wrought as here appeareth, and hauing finished the worke, did giue vp her accompt thus, that shee shoulde haue gathered iust 576000. stones in all. Which was most true, and thereuppon shee was remitted and pardoned &c.

(This is followed by calculations, but I leave those, for I am sure my readers are perfectly able to calculate the number of stones and arriving at the same result.)

The next question is, what’s the Elizabethan equivalent of “a train leaves London heading North at 100mph; a second train leaves York heading South at 75mph: where will they meet and at what time?”..?

rabbits and open veins

Hmm, coming across nice little peeks into Early Modern life today:

This inclosed for your lordship was sent me euen now by sir mychell hickes, with a message that it requyrd hast and withall came thes 4 rabitts which I send by this bearer a footman, not being willing to truble a messenger vnless I had knowne the busines to be of importance, I truble your lordship with noe other matter nowe bycause I wryght with the hand whose arme hath had a vayne opened but an howre since /

Ruttland howse 4 Sept 1607

Your lordships most bonden servant

Tho wilson

(Thomas Wilson to the Earl of Salisbury, CP 193/147)

More offerings of dead animals, this time for eating. But it’s the reference to blood-letting which is more interesting here. Wilson suffered from malaria (and possible other recurring illnesses). I remain amazed at the Early Modern propensity for drawing blood – surely at least someone noticed that patients lost rather than gained strength when leeched? (Thus I reveal my ignorance of Early Modern medicine. I’m sure my colleagues down the corridor could enlighten me – but this is digression enough.)

Deer heads for Mr Secretary

This inclosed to your lordship is from francis Seagar seruant to the lantgraue of Hess [.]  he hath sent also to your lordship 2 deeres heades the one of a Rayne deere, the other of an Ealand a kynd of deer soe caled ther [.] the heades are heer att my chamber att somersett howse vntill I vnderstand your lordhip‘s pleasure for the tyme it please you to haue them brought to Cort to see them, they were sent to me by Garter king att Armes francis seagars brother, who wold haue attended your lordship with them him selfe but that he is sicke. It may please your lordship to lett me knowe wheI shall send bringe them vp, bycause the messenger that brought them desyres to be att the deliuery of them.

Somersett howse
9ber 30. 1605 .

your lordships most humble seruant
Tho: wilson.

***

(Thomas Wilson to the Earl of Salisbury, CP 113/60)

I should of course only post this blog with some commentary on the above, but it will have to suffice for me to say that I was amused by the idea of someone sending deers’ heads as a gift to Cecil. I am presuming they are mounted for display. Which raises the question, how long has taxidermy been around to decorate the smoking rooms of the rich and famous more rich..?