What were English East India Company merchants drinking in Japan?

A note on terminology, and an addendum (and correction) to my PhD thesis

1. How did I miss that?

Doing research, it’s easy to find yourself going down rabbit holes, chasing answers that seem to always elude your grasp. You do your best, but still have to resign yourself to unsatisfactory results.

In my case, my PhD thesis consisted of lexical studies – mostly of words borrowed from various languages into English, as found in the letters of English East India Company merchants residing in Japan and elsewhere in Southeast Asia, in the early 1600s. For two of the words – out of about 120 – I had to concede defeat in pinpointing their exact sources. I knew what they meant, as that was very clear from the context: it was just their etymology which I was unable to pinpoint.

Fast-forward about 16 months from finishing the corrections to my PhD. I was revisiting my list of borrowed words for a talk (that I gave to the Helsinki Society for Historical Lexicography), when I stumbled upon a source which immediately gave me the etymology to one of the two unclear terms, and pointed me in the right direction to solve the other one.

At times like this, you have to ask yourself: is it pure chance that I found this source now, or was I sloppy when I conducted the original survey? I suppose the answer lies somewhere in between – after all, you are never able to find or see every relevant source; but at the same time, in searching for sources, the keywords you use can make all the difference between finding answers and finding none.

2. First word: “xxij barsos singe

The material I wrote my PhD on are letters written by English EIC merchants stationed in Japan, 1613–1623 (Farrington 1991). In their letters to each other, the English merchants frequently discuss food and drink. In one stretch of letters from 1620, they write about ordering “singe”. From the context, this appears to be a local alcoholic beverage, at a guess some type of sake or rice wine: it is ordered from “Ichemon Dono our wyne man”.

Now, “singe” is not immediately etymologically transparent. Phonologically, if read as something like [∫ɪnʤ], it could be Japanese (or Chinese) – but I could not find corresponding words in PDJ dictionaries. An alternative that I suggested in my thesis is that the writer is punning on French vin du singe, “wine of ape”, referring to a state or type of inebriation. (The English merchants did like their puns and nicknames, so this is not as unlikely as it may sound).

In the end, however, I could find nothing corroborating either interpretation.

Until, that is, 16 months after submitting my PhD to the printer, when I stumbled upon (i.e. located via googling), an article titled the ‘Introduction of Japanese sake by foreign visitors’ (Yoshida 1993). The writer had done something similar to what I was doing in the talk I was preparing, and looked at Japanese words recorded in the Vocabvlario da Lingoa de Iapam, a Japanese-Portuguese dictionary published in Japan by the Jesuits in 1603. He had searched through the dictionary for words relating to sake (and alcohol more generally), and in the article gives them in a list grouped thematically. Each line/entry has a headword occurring in the Vocabulario, and then a definition. The list starts with:

In English:

A. Types of sake
(1) shinju (xinxu, xinju)   new/fresh sake (ataraxij saqe).
(2) koshu (coxu)   old sake (furui saqe).

(The words in parentheses are from the Vocabulario: the dictionary uses <x> for /sh/, <j> for /j/ (word-final <j> after <i> is a spelling convention: there, <j> stands for /i/), and word-medial <q> and word-initial <c> for /k/).

The very first word in Yoshida’s list is ‘new sake’, which, as the Vocabulario records, has two pronunciations: shinshu and shinju.

Having thus found a contemporary European instance of the English merchants’ “singe”, together with its identification in Japanese, there is, I think, no need to hesitate in identifying “singe” as 新酒 shinju ‘new sake’.

3. Second word: “xxij barsos of singe”

The other word I couldn’t find the root of occurs in the same letters as “singe”: where “singe” is the liquid, “barsos” are the containers. It’s quite clear from the context that a “barso” is a small barrel – from other sources it’s possible to determine that a “barso” holds c.10 litres. But the etymology evaded me: I couldn’t find “barso” in any form in Present-Day Portuguese or Spanish dictionaries, and despite the apparent connection to other words meaning ‘cask’ such as barrel, but also barrillo, barillejo, and barrico, the lack of evidence made me put my hands up.

Once again, Yoshida (1993: 59) comes to the rescue. On the same page as discussed above, his list continues to section B, 酒屋, 酒造道具, 製造工程など ‘sake shop/brewery, sake production tools, manufacturing process etc’. And number 9 in this list is as follows:

In English:

(9) saka-oke (saca uoqe)   a container for sake, like an oke [barrel] or a taru [cask] (barça).

(The <u> in “saka uoqe” reflects historical pronunciation, a /u/ or /w/ before /o/).

To repeat what I said above, the words in parentheses are from the Vocabulario – in other words, the dictionary contains the very word I was looking for. But not as a headword, since it’s a Japanese-Portuguese dictionary. So I turned to its definition of “saca uoqe” – as well as for just “uoqe”, 桶 oke, and also 樽 taru:

The definition of “saca uoqe” does indeed include the word barça! The definition translates as ‘A type of vessel, like a tina or barça for wine’. Meanwhile, “Voqe” is defined as Balde, and Taru is defined as a ‘Piparote, or barça‘.

Google translate gives ‘tub’ for tina, ‘bucket’ for balde, but doesn’t help with piparote. Happily, there are plenty of old dictionaries of Portuguese on Google Books, and some of them purposely contain obsolete and trade terms. One from 1871 (de Lacerda) has the following definitions:


PIPOTE, s. dimin. from Pipa, a small cask or vessel.

TINA, s. f. a tub, a wooden vessel. — de vinho, a vat for wine.

In fine, then: it seems quite evident that barça was a common Portuguese word for a (small) barrel or a cask.

(There seems to be overlap in the terms for making and keeping alcohol, and for transporting it. That is to say, I wouldn’t immediately consider buckets, tubs and vats as vessels for carrying liquids over longer distances in, but on the other hand, I suppose it can be simply a matter of whether the vessel has a lid or not).

4. A third word, to illustrate the limitations of extant sources: “a cuple barrels skarbeare”

Among all the other words relating to drinking that the English merchants used which can be identified, there is one that appears to have avoided capture by lexicographers (to my knowledge, anyway). In this case, the beverage was brewed by the merchants themselves: “skarbeare” – that is, scar beer.

Although unrecorded by the OED, this appears to be another term for small beer, or beer low in alcohol (as opposed to strong beer). Scar in scar beer may be an abbreviated variant of scarce, for one merchant writes that

“I … have fownd out the use of making scarce ale, but I want good mault”. [emphasis added]

Thus, the scar in scar beer probably means ‘small’ – or, rather, ‘weak; thin’.

Is the word dialectal? The writer using the term scar beer in the EIC letters was from Staffordshire, and the (West) Midlands dialect is evident in his use of language (spelling and lexical choice). Similarly, “scarce ale” occurs in a letter by another merchant from the West Country, although from further south, Wiltshire. A search of the English Dialect Dictionary (EDD) finds scar(r), meaning “small quantity; a morsel; a particle”, recorded in Shetland and the Orkneys (EDD, s.v.). The Orkneys are far from the West Country, but neither is close to London (which in practice drove the development of standard English), and old terms tend to live on in the periphery.

But perhaps the term is common: scar beer does crop up in contemporary literature. Henry Peacham (born in Hertfordshire) uses it in his The worth of a peny (1641):

Comparing scar beer to “a kinde of pitifull small Beere, too bad to be drunk” suggests the term has a negative connotation (I take it that “brewed with broom” means that the beer is flavoured with broom, rather than hops). This negative connotation can be found in an earlier text, a play by Henry Glapthorne (from Cambridgeshire), “Albertus Wallenstein” (printed 1639):

Further surveys (more than quick googling) would undoubtedly find more instances, but I think this case is quite straightforward. We can conclude that scar beer is more or less the same as, if not indeed a synonym for, small beer.

5. Don’t stop now! What else were they drinking?

This blog post is already long enough, but I’ll close with two more tidbits about the drinking habits of the EIC merchants in Japan.

The Englishmen brewed their own (small) beer, as seen above. And they also made cider – which could at times be undrinkable:

“I would I had a little of your pery …, I mean of the first bruinge, w’ch I suppose is not yet sower. I have a littell sider heare, but it is so sharp as viniger and cuts my throat in drincking.”

Finally, the degree to which the EIC merchants enjoyed drinking is reflected in how much they wrote about various drinks in their letters. One proxy for this is the index in the source edition I have used (Farrington 1991). Here is its list of pages on which wine is mentioned:

I should add that this is not an exhaustive list, for wine is mentioned in the documents not only as “wine”, but also with many other terms, such as “singe” as seen above – and also words like “morofack” (morohaku 諸白, a fine sake).


EDD = The English Dialect Dictionary. 1896–1905. 6 vols. Ed. by Joseph Wright. London: Henry Frowde. Available on the Internet Archive. A digitized version (EDD Online) is available at eddonline-proj.uibk.ac.at.

Farrington, Anthony (ed.). 1991. The English Factory in Japan 1613–1623. London: British Library.

Glapthorne, Henry. 1639. “Albertus Wallenstein”. In: The Old English Drama: A Selection of Plays from the Old English Dramatists, Vol. 2 (London, 1825). Google Books. Harvard College Library. books.google.com/books?id=gGX2sAcXnrUC.

de Lacerda, José. 1871. A New Dictionary of the Portuguese and English Languages: Containing All the Vocables in Common Use, with a Selection of Terms Obsolescent Or Obsolete Connected with Polite Literature Technical Terms, Or Such as are in General Use in the Arts, Manufactures, and Sciences, in Naval and Military Language, in Law, Trade, and Commerce, &c., &c., &c. Lisbon. Google Books. University of California Library. books.google.com/books?id=NrNLAQAAMAAJ.

OED = Oxford English Dictionary Online. Oxford University Press. Subscription service. www.oed.com.

Peacham, Henry. 1641. The worth of a peny, or, A caution to keep money with the causes of the scarcity and misery of the want hereof in these hard and mercilesse times : as also how to save it in our diet, apparell, recreations, &c.: and also what honest courses men in want may take to live. London. EEBO. Huntington Library.

Vocabvlario da Lingoa de Iapam com adeclaração em Portugues, feito por algvns padres, e irmaõs da Companhia de Iesv / Nippo Jisho: Pari-bon 日葡辞書: パリ本 [‘A vocabulary of the Japanese language, with Portuguese pronunciation, made by certain priests and brothers of the Company of Jesus / Japanese-Portuguese Dictionary: The Paris Copy’]. 1603. Nagasaki. Facsimile repr. with discussion by Harumichi Ishizuka 晴通石塚. 1976. Tokyo. Bensei 勉誠社. Google Books. Ohio State University Library. books.google.com/books?id=TFJAAQAAMAAJ.

Yoshida, Hajime 吉田元. 1993. 外国人による日本酒の紹介(I) ‘Introduction of Japanese sake by foreign visitors (1)’. 日本醸造協会誌 Journal of the Brewing Society of Japan 88(1): 56–61.

Letterlocking: How did you fold a letter in the early modern period and what did it mean?

First impressions are important. When I receive mail – physical items by post, that is – simply the size and shape of the envelope tells me something about the sender. A5-sized envelopes (well, C5-sized, but you know what I mean; ditto below) tend to be bills or notes from the bank, A6 and smaller are probably greeting cards and concentrate around public holidays and birthdays and the like; A4-sized envelopes are rarer, but can contain official papers as well as missives of condolences. There is cultural variation, of course, and the range of shapes and sizes of envelopes as well as their meanings vary between countries and continents.

Most people probably don’t stop to think about why we have a range of envelope shapes and sizes, although having to figure out which is appropriate for a specific purpose is probably a familiar task. Job application – A4; love-letter – a long and thin envelope, like an A5 folded lengthwise. But I’m not sure anyone today would be upset if they received mail in the “wrong” envelope – possibly puzzled, but not offended. (Having said that, it’s probably a safer bet to stick to instructions when posting job applications, though. The recipients might not take offense per se, but may well discard your application..)

Some modern envelope sizes

In the early modern period, envelopes in the modern sense did not exist. Instead, letters would be folded to form their own covers. This skill was taught as a matter of course as a part of other letter-writing skills, such as learning the right opening and closing formulas, and how to write superscriptions (addresses). Jana Dambrogio has coined the term letterlocking for the practices of folding, securing and sealing letters. At this stage, we still know next to nothing about the vast field that is letterlocking. We have only begun to chart the myriad ways in which letters were folded, secured and sealed. We know very little about change over time from Antiquity to the present day, or about regional variation. And we have only vague conceptions about all the meaning that different types of letterlocking conveyed across time and space.

This is incredibly exciting: so much unexplored territory!

Research on epistolary materiality has already shown that material features can reveal social codes and meanings (see esp. James Daybell’s 2012 book-length overview). This applies not only to what letters are physically made of and how they are folded, but also to what I call textual materiality, features like layout or mise-en-page, and also more subtle aspects such as script and hand. Layout, being the most immediately visible  ..er, visual non-linguistic aspect of the text of a letter, naturally attracted the attention of scholars first, and thanks to scholars such as Jonathan Gibson (1997) the concept of significant space is now widely known.

Significant space refers to politeness and deference expressed as space on the page of a letter. Very simply put, the width of the margins, and particularly the amount of space at the top and bottom of the letter – between the salutation at the top and the main chunk of text, and between the end of the text and the signature at the bottom – can indicate deference by the writer to the recipient (cf. the image below). Scholars who have discussed significant space have looked at the letters of the elite social ranks, which abound in the minutiae of negotiating social status. In the letters of the aristocracy, how much space one left at the top of the letter could translate into fawning, respect, arrogance, or downright insult. This topic is excellently explored by Giora Sternberg (2009), who, following French scholars, calls it epistolary ceremonial.

French letter from 1598 with clear use of significant space (TNA SP 94/6 f. 78r; photograph by author)

But to return to letterlocking.

What do we know about early modern letterlocking so far? Can we tell how recipients would have reacted to different ways in which letters were folded, secured and sealed?

Well, we have learned to recognize some of the more common types of letterlocking used in early modern England.

One of the most common varieties of letterlocking in the early modern period is usually called tuck-and-seal. This appears to be particularly frequent in personal correspondence, which makes sense as the folding requires little effort but the seal ensures security. In tuck-and-sealed letters, the letter is first folded (hiding the text) so that it forms an oblong shape, and then one end is ‘tucked’ into the other, and an adhesive – usually sealing wax – is applied over the seam and pressed with a signet seal; or then, as in the letter in the following images, the wax is placed between the layers of the tucked side, and the signet is pressed through the paper.

Tuck-and-seal letter: Sir Robert Cecil to Sir John Peyton, 1603? (Folger X.c.439; images from Folger Luna, © CC BY-SA)

Another, slightly more secure type of letterlocking is sometimes called slit-and-band. In this type, the letter is again folded into an oblong shape, but this is then folded over in two and the ends tied together by cutting a narrow slit through the entire end of the packet, and inserting a thin strip of paper through the slit and then securing it with sealing wax (note the short vertical slits near the edges of the paper in the following images). Instead of a band of paper, string was also commonly used to secure the packet.

Slit-and-band letter: Sir George Talbot to Bess of Hardwick, c.1575? (Folger ; images from Folger Luna, © CC BY-SA)

A third type of letterlock is usually seen as particularly intimate, and might be called (to coin a term) plait-and-floss. In this type, the letter is folded into a minute packet – possibly by plaiting (aka accordion folding) rather than folding the paper repeatedly over itself – and, as in slit-and-band, the ends of the resulting oblong are tied together, but this time using colourful floss or ribbon. The resulting packet was very small and could fit into a palm and easily be hid in a sleeve, making it perfect for passing surreptitious messages – or love letters. Heather Wolfe (2012) has explored these kinds of letters in a fascinating article.

Pleated letter fastened with silk floss: Jane Skipwith to Lewes Bagot, 14 April c.1610 (Folger L.a.852; images from Folger Luna, © CC BY-SA)

Comparison of plait-and-floss packet with modern envelope sizes: early modern folded letters could be tiny!

I could go on for longer, but will finish with a fourth type of letterlocking, another one which has gained a name, and has been called a blank margin lock. This type of lock is in essence a slit-and-band where the paper band is still attached to the letter it is used to secure. When such a letter is sealed with an adhesive, the resulting packet is practically impossible to open without damaging the paper (hence the long hole in the following images), and is essentially as secure as you could make a letter in the early modern period. (Secure in the sense that it cannot be opened without evidence of having been tampered with. Security in letterlocking is linked to being able to see if received letters have been opened en route; obviously any letter can be forced open.)

Blank margin lock: Simeon Fox to Sir Robert Cecil, 13 March 1602 (TNA SP 101/81 ff. 348-349; photographs by author)

To date, the most systematic attempt to categorise types of letterlocking is being conducted by Jana Dambrogio and Daniel Starza Smith; Jana’s website lists 8 different categories (as I write this). You should also check out their Youtube channel for videos of how to fold, secure and seal these kinds of letters, and many others!

Being a member of a team of historical sociolinguists, I am by default interested in social variation. In order to see how different people locked letters in different circumstances – to try to understand early modern letterlocking practices and the meanings they carried – will require charting said practices across time and space, in order to identify any trends. Although recent years have seen large-scale digitized databases and catalogues of letters – from the commercial resource State Papers Online to the online catalogue Early Modern Letters Online – at present we lack editions, databases or catalogues which record such information.* A pioneering one can be found in Bess of Hardwick’s Letters, an online edition of the surviving correspondence of the countess Shrewsbury (compiled by Alison Wiggins et al.), which includes information about letterlocking. This is fantastic, and a great start – although the corpus is fairly small at only 234 letters, spanning 1552-1607, and with letterlocking information on 194 letters.

A recently launched project will expand the scope of our understanding of early modern letterlocking practices tenfold. The Signed, Sealed & Undelivered project is investigating a wonderful resource: a chest of some 2,600 letters from 1689-1707, being in essence the dead letter repository of a postmaster from 300 years ago. These letters come from across Western Europe from a wide range of letter-writers, and their study will allow for a fantastic synchronic overview of letterlocking practices. The most exciting thing about the project, or rather about their material, is that 600 of the undelivered letters in the chest remain sealed and unopened. Check out their wonderful website for more.

Chest of undelivered letters (image copied from brienne.org, with apologies and thanks)

For my part – since obviously I am blogging about letterlocking because it is something I work on too – I have working on the British State Papers from the early 1600s, and have started to see trends in the material. For instance, in the material I work on (mainly State Papers Foreign, Spain, c. 1600-1610), tuck-and-sealed letters are relatively uncommon, and most of the surviving letters have been sent as fairly large packets fastened at one end with a paper band or with string. This appears to carry similar meaning to other aspects of material respect mentioned above, such as significant space. That is to say, this type of letterlock appears to have been the expected form when writing in a (semi-)official capacity in early modern England – not unlike sending forms and documents in an A4-size envelope to your job centre today.

But at the moment, my findings – if you can call them that – are little more than impressionistic. As a (part-time) corpus linguist, I firmly believe in quantitative evidence, and am reluctant to identify trends unless I can see the numbers. But I mean to keep working on this and hope to publish in due course.

But let’s go back to the question I posed above: can we tell how people would have reacted to different ways in which letters they received were locked?

Next week, I will be attending the Epistolary Cultures conference at York, and a part of my paper touches on this very question. In the Cecil Papers, there survives a delightful sequence of letters between Sir Robert Cecil, Secretary of State for King James I/VI, and his teenage son William. In a letter dated 15 May [1607?] (CP 228/19), Cecil comments on his son’s developing letter-writing skills:

I haue also sent yow a peece of paper fowlded as gentlemen vse to write theire letters, where yours are lyke those that come out of a grammer schoole.

Explicit information or instructions regarding material practices of letter-writing in the early modern period are in fact quite rare. Passages revealing how contemporaries understood and interpreted said material practices are even rarer. Most of our information on letterlocking has to be reconstructed from surviving letters themselves, since passages like this one ultimately raise more questions than they answer. Having said that, I still think this is a great passage, and we can gather several points out of it:

    1. letterlocking was taught in grammar school;
    1. gentlemen folded their letters differently from what was taught in grammar school;
  1. this fact is significant enough for Cecil to want to correct his teenage son in his letterlocking practices.

But there are several things that we cannot immediately infer:

    1. how were children in grammar school taught to fold and secure their letters?
    1. how did gentlemen fold their letters?
    1. did William Cecil learn grammar-school-letterlocking in grammar school, or somewhere else? and why did he use it at all in writing to his father?
  1. and, to my mind most curiously, why did Robert Cecil enclose “a peece of paper fowlded as gentlemen vse to write theire letters” – instead of just folding the letter he says this in in the desired way??

For my answers to these questions, you’ll have to come to York next week!  ..But I hope to write this study up for publication anon. My fingers itch for a broader quantitative survey, but we also need lots of case studies in order to get at the nuances of early modern letterlocking practices.

* One person who may have compiled a requisite database for a broad survey of letterlocking practices is Susan Whyman, who writes of having “systematically examined” numerous collections of letters for criteria including “paper, handwriting, spelling, outside address and title, stamps, docketing practices, franks, inside spacing and layout, margins, salutation, forms of address, closure and signature”, etc (Whyman 1999: 3). Whether she has charted letterlocking as well is uncertain; as is if this information will ever be made publicly available.


Daybell, James. 2012. The Material Letter in Early Modern England. London: Palgrave Macmillan.

Gibson, Jonathan. 1997. “Significant space in manuscript letters.” The Seventeenth Century 12(1): 1–9.

Sternberg, Giora. 2009. “Epistolary ceremonial: Corresponding status at the time of Louis XIV”. Past & Present 204: 33-88.

Whyman, Susan. 1999. ” ‘Paper visits’: The post-Restoration letter as seen through the Verney archive”. In Rebecca Earle (ed.), Epistolary Selves: Letters and Letter Writers 1600-1945. Aldershot: Ashgate Press, 15-36.

Wolfe, Heather. 2012. ” ‘Neatly sealed, with silk, and Spanish wax or otherwise’. The practice of letter-locking with silk floss in early modern England”. In S. P. Cerasano & Steven W. May (eds.), In the Prayse of Writing: Early Modern Manuscript Studies. Essays in Honour of Peter Beal. London: The British Library, 169-189.

An addendum on the history of the word “linguist” in the sense ‘interpreter’

One of my first publications was an article titled “Jurebassos and Linguists: The East India Company and Early Modern English words for ‘interpreter’” (abstract; full paper as a pdf). The article is a fairly straightforward and I admit rather light-weight investigation of the Early Modern English semantic field of ‘interpreter’, in which I note that instead of a single word (interpreter), there were several (interpreter, truchman, dragoman, linguist, jurebasso), the use of which depended upon, among other things, geographical and linguistic setting. (So that dragoman was used in the Arabic sphere of cultural and linguistic influence; and jurebasso where Malay was used as a lingua franca).

In any case, the article’s conclusions were in part a (good-natured) stab at the OED. Not because I want to detract from the worth of that lexicographical giant, but rather because antedating the OED is, at the end of the day, and as the OED stands at the moment, all too easy, and for those of us with an antiquarian-philological-lexicographical mindset, also quite good sport. And also, because in doing so I joined the ranks of previous scholars pointing out how the OED draws most of its evidence from literature (and a rather small canonical corpus at that), and that when you look outside that corpus of evidence, there are wonders awaiting the historical lexicographer. As my conclusion says quite plainly, the records of the early English East India Company are fantastic material for historical linguists (I continue pointing this out in everything I publish which draws on English East India Company material). (They’re fantastic sources for historians too, to be sure, but as far as I know, I still remain the only linguisticist-type to have used EIC materials).

But to move on to the point of this post:

The fate of one who engages in a game of one-upping with the OED is ultimately to be defeated at their own game.

That is to say, the reason why it’s easy to antedate the OED is that a substantial number of the entries still date from the first edition (1884-1928). A quick search in any old historical corpus will bring up antedatings to much of that material; and the same applies to the 1989 second edition which, although benefiting from the appearance of computers, still dates from long before EEBO-TCP, Google Books and other massive historical text resources.

Not so in the case of the third edition – begun in 2000, currently work in progress, and estimated to be completed by 2037 or so. In my article, I had of course used the OED entries to all the words for ‘interpreter’ I list above. Most of them came from the first edition of the OED, and linguist from the second. I concluded that whereas according to the OED, linguist wasn’t used in the sense ‘interpreter’ until 1711, in my material I found instances from a century earlier. However, if you now go to OED Online, you will see that linguist has since been updated to the third edition (September 2013). The entry now duly gives instances of linguist in the sense ‘interpreter’ from 1612 on. Overall too, the definitions given for linguist have been overhauled.

..I initially thought to subtitle this blog post, “Or, how OED antedated my antedating of OED’s definition of linguist” – but actually, in my article I wrote that “[t]he first occurrence of linguist in the sense of ‘interpreter’ is from 1610″. Which is two years earlier than the OED’s current earliest attestation. And going back to my notes, I find that this 1610 attestation comes from Nicholas Downton’s journal of the EIC sixth voyage. Here’s the extract and the reference:

As soon as the fleet anchored, the Governor sent an Arab to inspect the ships, who, on the following day, boarded the Admiral to inquire who and what they were; at the same time, “Jno. Williams and Walter the trumpetter, linguists” , with others, were sent on shore with a present to the Governor

– written at Aden, early November 1610
(Markham 1877: 168; emphasis mine)

Source: “Journal of the Sixth Voyage, kept by Nicholas Downton, 1610–1613”. In The Voyages of Sir James Lancaster, Kt., to the East Indies: with abstracts of journals of voyages to the East Indies during the seventeenth century, preserved in the India Office: and the voyage of Captain John Knight (1606), to seek the North-west Passage. Ed. by Markham, Clements R.. London: Hakluyt Society, 1877. pp. 151–227. Available on the Internet Archive.

Rather unfortunately, Markham edited Downton’s journal quite heavily, so that much of it consists of paraphrase, with the occasional direct quote retained for flavour – as in the excerpt above. Yet luckily for me, the word linguists occurs in one of these direct quotes.

To sum up, then.

The new third edition of the OED does a fantastic job in charting the meanings and attestations of words in the English language across time. In doing so, it puts to shame lightweight excursion into historical lexicography like my 2009 article, namely those which do not properly consider the implications of their findings. I feel I should have been able to draw some firmer conclusions from my data, and not hedged my final thoughts. And also, I guess I ought to have done a more thorough job in searching through sources and also in documenting my sources and search results.

At the same time, despite all the new tools, resources and text databases, much of historical lexicography rests on serendipity. I came across an attestation of linguist in the sense ‘interpreter’ dating from 1610; the OED editorial team didn’t. (Incidentally, I’m sure a day or two of further digging would uncover earlier attestations.)

Finally, this case makes me feel that humanities scholars should aim to publish the data they draw on – when this requirement is applicable, of course. For instance in the case of my 2009 article, most of the texts are indeed available online as full texts, but largely as OCR’d from variable quality scans of the source books, which bring their own inaccuracies and complications. So publishing a KWIC list of my word searches (with references) would have been useful in terms of reviews of my work and future work drawing on my initial endeavours.

On marking language-switching in speech and writing

So what I have to say is too long to fit in a tweet or even a handful of tweets. I followed the link in this tweet –

…which lead to this video by Daniel José Older, titled “Why We Don’t Italicize Spanish”, where he explains why language-switches in his books are not italicized (I’m assuming, anyway, that this applies to all languages, ie. not only is Spanish not italicised, but neither is Italian, German, etc):

I have two thoughts/observations/comments:

1) Regarding speech.

Older argues that since we don’t flag language-switching in speech, it shouldn’t be flagged in writing.

Well, I disagree that languages are not marked when language-switching in speech. All languages have, apart from different vocabulary and grammar, also different prosody: they stress words differently, and have different intonation patterns. Thus when someone speaks a non-native language “with an accent”, this refers to the stress and intonation they use. So someone speaking English with a Finnish accent is using the stress and intonation patterns of Finnish while producing English syntax and vocabulary. Languages sound different.

Therefore while a speaker may not intend to emphasise either language when language-switching in speech, both languages are nonetheless flagged.

2) Regarding writing.

I find this very interesting. I work on the links between scripts/typefaces and languages in the Early Modern period, and am fascinated by aspects of such practices which have survived into present-day use. Such as the practice of italicising foreign words, phrases and passages. What people in general will have no idea about is that this practice has its roots in the very earliest writing practices where languages were marked by different scripts. The general idea is familiar to everyone from having seen texts written in different writing systems: for instance, I’m using the Roman alphabet to write this blog post, and you can immediately distinguish it from Japanese writing:


What is less well known, is that it used to be common to use different scripts and typefaces to write, for instance, English and Spanish. That is to say, not different writing systems or even different alphabets – such as is the case with, say, Greek:

Αυτό δεν είναι γραμμένο με λατινικούς χαρακτήρες.

In the 16th century, Northern European vernaculars (German, Dutch, English, Swedish, etc) were usually written in a gothic cursive script, whereas Italian and Spanish used scripts based on italic characters. And the same applied in print:

Berlemont, dialogues in 6 languages, 1608 (STC (2nd ed.) / 1431.19A)

Anyway, long story short, this distinction of course disappeared, although gothic typefaces and scripts survived until very recently – for instance newspaper titles are often still found in blackletter. But the practice of switching script or typeface to indicate switching language was, in part, retained.

Of course, the story is far from this simple: at the same time, practices of textual emphasis developed. These included, for instance, colour, enlarged initials (capitalisation), underlining, quotation marks – and also script- and typeface-switching. Therefore whereas you might italicise a foreign word to flag it, and you might also italicise a word to emphasise it, these are in effect two different practices – even if, from a present-day perspective, they have for the most part merged. And today you could even see the italicisation of foreign words and passages as just one of many reasons you might want to emphasise text.

And this last point appears to me to be the one Older makes in his argument against italicising Spanish in his books: the italics make it look like the text is emphasised. Whereas this is not the case. And I take it he doesn’t want to give the impression that authorial emphasis is intended. It is just language-switching, that’s all.

This is great – people rarely voice their conceptions on these matters, and I find it fascinating to find out how people see methods of textual emphasis and their uses.

Anyway, I find these things incredibly interesting and bizarrely understudied. I have several publications on this coming out next year (gods willing), in which I hope to make my case and points better (and at more length) than I do here.

NB I purposely stayed away from discussing code-switching here. I don’t think it matters regarding the general points I make above.

Counting correspondence, listing letters

A number of years ago, I gave a talk on mapping correspondence – that is, about the ways in which you can plot letters and epistolary exchanges on a map. Perhaps the most important point arising from that talk, for me anyway, was the understanding that mapping correspondence was by no means a straightforward matter. What exactly do you map, when you map correspondence? The writers’ locations? Or that of both the writers and the recipients? Or the path of delivery the letter? The duration of conveyance? The amount of correspondence? And in doing any or all of these, what use is the map?

Similar ponderings are behind this blog post – not about mapping, but about counting correspondence. What do you count, really, when you count letters? How can counting help? Are graphs useful?

These thoughts arose from reading Brenton Dickieson’s blog post, “A Statistical Look at C.S. Lewis’ Letter Writing”. Working from three published volumes of C.S. Lewis’s collected letters, Dickieson plotted the 3,274 letters on graphs, basically looking at the volume of letters Lewis wrote over time, and discussing contextual events that are reflected in the sheer numbers of Lewis’s letters. Here’s his graph of the number of letters over time (copied from his blog, with my apologies and thanks):

This graph is much as you’d expect for someone like Lewis, whose fame grew over time, bringing the inevitable mountain of letters with it: it shows overall growth over time.1 I’m sure you can immediately see that some of the peaks can be mapped to publications (the Narnia books started coming out in 1950), and other events (WWI in 1914-1918).

But hang on: what does this graph actually show? What does it count?

I think that when we look at a graph like this we tend to make a lot of assumptions. For instance, it is easy to take the above graph as depicting ‘the amount of letters written by Lewis during his lifetime’ – especially as the number of letters is so high. Dickieson actually titles the chart as “the number of letters we have from Lewis each year” – which you might call ‘the amount of letters which are extant today’. But what the chart in fact shows is a third figure, namely ‘the amount of letters published in this one edition’.  These are different things:

  1. the actual number of letters written by a writer during their lifetime;
  2. a subset of (1), being the number of letters which survive; and
  3. a subset of (2), being the number of letters which we (or the editors, rather) know about.

These are all cases of ‘all the letters of X’ – a common phrase in titles of editions is “the complete correspondence”. Of course, attaining a true count of (1) is practically impossible – do you have copies of all the emails you ever sent? Exactly. So editions of “the complete letters of X” tend to strive to be (2) and say that they are (2), while they are of course (3). In fairness, (3) can equal (2), but it is not uncommon for further letters to be found after the publication of volumes of “the complete letters”.

So if we look back to the graph of Lewis’s letters above, now with the understanding that it represents (3) and, possibly, (2), but that it does not show (1), a second question arises. Given that the graph shows a subset of (1), is this subset representative of the whole? More specifically:

  • Do the ups and downs of the graph reflect actual fluctuations in the number of letters written by Lewis, or just in the number of letters that survive?
  • Similarly, does the overall trend reflect the actual overall trend – that of (1)?

For instance, for much of the 1930s, only about 20 letters per year survive from Lewis. Did he really write fewer letters during this decade?

Given that the graph is based on more than three thousand letters, I think that the overall trend – an increasing number of letters over time – probably does reflect (1). But its minor fluctuations are more likely to reflect what has survived than what was originally there.

More commonly, editions of letters offer only a selection of the correspondence of a writer or a group of writers. In these cases, the points I have raised become even more significant.

As an example, let’s take the letters of J.R.R. Tolkien. As far as I know, only one volume of his letters has been published, being The Letters of J.R.R. Tolkien (Allen & Unwin, 1981) – although more letters have been published since in various books and articles.2

The Letters published 354 letters. A very quick search online found about the same number again in a list on the Tolkien Gateway site; if we exclude letters of uncertain date, this list gives us another 349 letters, for a total of 703. A far cry from Lewis’s 3,000+, and I would imagine that many more of Tolkien’s letters survive; but this is enough to plot in a chart to make my point:

In this chart, the blue columns show the number of letters per decade published in the Letters, and the red columns the number of further letters given in the list on the Tolkien Gateway website. Obviously, neither set of letters reflects (1), or even (2) or (3) as discussed above. But the point I want to make here regards overall trends teased from these counts. If we look at the blue columns, it would appear that the peak of Tolkien’s letter-writing activities was in the 1950s, being fairly even from the 1940s through the 1960s. But the red columns indicate that the peak was not until the 1960s, and not many letters date from the 1940s. So we can immediately see that neither the blue nor the red columns appear to be representative of (1), of the actual number of letters written by Tolkien.

So, to recap and summarize. The overall trend we can extract from data depends on the dataset. This is really quite obvious. What is harder to remember is that the constitution of the dataset can be something else than what is expected by the reader, and this can have serious implications on the interpretation and understanding of the data.

This discrepancy becomes especially relevant in situations when only a fraction of (1) survives. Which of course in the case of historical material is almost always. Unless we have an extremely carefully made estimate of a letter-writer’s full output, we need to be really careful when counting their letters and making inferences based on those numbers.

Here’s an example. The following chart shows the number of letters sent from England by Thomas Wilson, servant and secretary to Sir Robert Cecil, to the English merchant Richard Cocks in Bayonne, France.

In this chart, blue shows the number of letters that survive (N = 1), red the number of letters that are mentioned in other surviving sources, but which don’t survive (N = 29). Based on the surviving letters, there is no trend. Based on the number of letters reconstructed from intertextual references, there was a continuous correspondence over this whole period.

I hope I haven’t given the impression with this blog post that I’m somehow criticizing Dickieson’s exploration of Lewis’s letters. On the contrary, I found his blog post fascinating, and I have in the past made similar graphs when trying to make sense of correspondences (as the last graph shows). I just wanted to raise some quite basic questions regarding the assumptions we make when using quantitative methods to make sense of data that we usually explore and study qualitatively.

[This post didn’t quite go where I thought it would, but it’s too long to rewrite. I’m not sure it’s particularly interesting, either, but I hope to remedy that anon with a post about dates (the calendar, not the fruit) in Early Modern letters.]


1. This feels quite obvious if we think about general human lifespans, too: the longer you live, the more people you meet => the more communication events are likely to follow. And this reminds me of an article in Science (Malmgren et al, “On universality in human correspondence activity”, Science 325 (1696), 2009) in which, through some serious number-crunching, the authors discovered that i) the amount of letters a person writes increases over their lifespan, ii) letter-writing is a correspondence event (when you receive a letter, you are likely to write a reply), and iii) letter-writing times correlate with the hours the writer is awake. My summary here is probably partly wrong, and certainly rather dismissive, and I have no idea about the calculations involved which I expect are the real beef of the article, but there are two points to make from their article, both of which are relevant to my present discussion: (1) number-crunching doesn’t necessarily tell you anything new; and (2) you can only get out what you put in, aka. what, exactly, are you counting? (Actually there’s also a third: (3) the humanities and sciences are interested in different things, ask different questions, take into account different contexts, etc etc. But let’s not go there today).

2. And I have to take this opportunity to boast confess that I’ve edited one previously unpublished letter myself: see Alaric Hall & Samuli Kaislaniemi (2013), “‘You tempt me grievously to a mythological essay’: J. R. R. Tolkien’s correspondence with Arthur Ransome”, in Ex Philologia Lux: Essays in Honour of Leena Kahlas-Tarkka ed. by Jukka Tyrkkö, Olga Timofeeva & Maria Salenius. [Mémoires de la Société Néophilologique XC]. Helsinki: Société Néophilologique. pp. 261-280. Link to pdf.

Editing is Hell, and normalization is an illusion

As a procrastinatory excursion, here are some thoughts about editing historical texts. Rather than an insightful comment on editorial philosophy, the following stems from practical matters and contains nitty-gritty details, and is not written in conversation with other editors (sorry). I’m sure everything I say here has been said before, but repetitio etc.

1. Why normalization is an illusion

Years back, I found I had a problem. I was considering the expansion of abbreviated words, and one of the words in question was merchant. This word was very frequent in the text I was working on, and occurred in various forms, some of them abbreviated: “marchant”, “mrchnt”, etc. I thought it would be a straightforward matter to settle on an expanded form. I decided I would normalize according to most attested usage. That is to say, editorial expansions would reproduce the most frequent forms fully spelled out.

In other words, if for instance “marchant” was the most frequent unabbreviated spelling of merchant, then “mrchnt” would be expanded as “marchant”. (With editorial expansions indicated, e.g. “marchant”.)

My next step was to establish what was the authorial preferred spelling of the word merchant. This is what I found (word stem forms only):

mrchant* 1
mrchnt* 31
mrcht* 1
marchand* 1
marchant* 31
marchnt* 21
marshant* 1

Out of 87 hits of merchant*, only 33 were spelled out fully, i.e. 60% were abbreviated forms. And the abbreviated form “mrchnt*” was as frequent as the fully spelled out form “marchant*”.

This made me stop and think: if the fully-spelled-out form of a (stem of a) word is much rarer than abbreviated forms, can we really claim it represents authorial preference? If we chart the possible spellings of merchant as a sequence of graphs, we get the following:

Note that this image shows both the actual realized spellings in the texts, and also possible spellings which do not occur (such as “mrchand*”). (Note also that other possible contemporary spellings are not given, such as “mrchāt”, or indeed “merchant”).

I consequently completely abandoned the idea that we can ‘reconstruct’ authorial spelling, and also determined to make sure to explicitly indicate editorial intervention at all times. A spelling like “mrchnt” should never be expanded as “marchant”. Instead, modern forms should be used: the only correct expansion here is “merchant”.

Having said that, using modern spellings is of course not an option in dead language varieties, such as Middle English. And further, even in Early Modern English there are graphemic and orthographical features without counterpart in Present-Day English(es). What to do with obsolete suffixes, like in “hath”, or “didst”? And what about graphs which were already obsolete but replaced with other contemporary ones, such as <y> for thorn <þ>, as in “ye” ‘the’ and “yt” ‘that’? And what about graphs and special characters which were common in the period but which we no longer use?

2. Editing is Hell

Here is one example of how just such a special character – a brevigraph – can cause serious problems when deciding on best editorial practice.

The image below is from a document from 1599, written in a late Elizabethan cursive (aka English secretary hand), with a slightly old-fashioned ductus. It reads: “her matꝭ servyce” (or “ſervyce”, if we want to retain the long <s>1). The second word is an abbreviation of majesty’s.

The <e>-shaped graph in the abbreviated second word is what I usually call an -es-graph. Originally a medieval brevigraph2 for (usually word-final) “-is”, “-es”, and “-ys”, in Early Modern English cursive hands <ꝭ> is also used for (word-final) “-s”. Some sources claim it indicates a plural or possessive, but I have found it used in proper names, prepositions, verbs and adverbs too. (Possibly it is more frequent in one rather than another, but I expect that in its distribution scribal preference is more significant than part of speech).

So the question for today’s workshop is, what to do with the abbreviation “matꝭ”?

What indeed.

Depending on your editorial principles, I can come up with 23 different possible outcomes, based on choices regarding i) spelling, ii) contractions (abbreviations), iii) superscripts, and iv) brevigraphs (special characters).

The outcomes – and the choices – are shown in the massive table below. Let me help you read it. The top three rows indicate options when retaining original spelling. The blue section in the middle indicate choices and outcomes when normalizing spelling (since regardless of my personal opinion as stated above, normalization is a common editorial practice), and the bottom ten rows the same for modernized spelling.

The numbers show whether editorial intervention has been indicated or not. “1” indicates that the editorial intervention is marked (e.g. “majesty’s”), and “0” that it is unmarked (e.g. “majesty’s”). And “(1)” means that the feature is marked, but it is so in the original text (i.e. the superscript and the brevigraph).

Some of the features are not strictly speaking divisible – that is to say, the contraction is marked by the superscript, and hence you can’t show that you’ve changed one without doing the same for the other: expanding the contraction and lowering the superscript are in essence the same thing. Thus “(1)ss” means that contraction is marked since what has been done to superscripts is marked, and “(1)c” means that superscript is marked since what has been done to contraction is marked.

Some notes are in order:

* Arguably since the contraction (and superscript) is not marked, this does not qualify as a representation of the original spelling.

† When lowering of the superscript or expansion of the contraction are indicated, indication of the normalization or modernization of the brevigraph gets subsumed – when using italics to mark editorial interventions. If other methods are used – apostrophes, parentheses – it is possible to make this distinction, as seen in the rightmost column.

‡ Some of the possible outcomes are unlikely to be chosen by the editor, because editorial intervention can also make the word more difficult to parse, rather than less. So for instance the edited forms “mates” and “mates” seem to me undesirable outcomes – “mates” is scarcely better, whereas “ma’tes” (or “ma’tes”) at least indicates that the word is an abbreviated form.

3. Um, argh?

Well, yes exactly. And this is without going into the jungle of brackets and other symbols used by editors to distinguish between different kinds of things in the text, such as interlineal insertions, deletions, damage, etc etc (a good discussion of which, with clear examples, can be found in the appendix to Michael Hunter’s Editing Early Modern Texts (2007)).

And also, other cases – other graphs, other methods of abbreviation, other hands – produce different problems, so I’m certain that the above table does not suffice for all editorial problems.3

What can the editor do, then?

I think this question can be answered: The editor can do whatever the hell they please. What they should do, in any case, however, is to make sure that all editorial interventions in the text are visible and the original form is recoverable. How they do this is another matter, as is the extent of their meddling. But there should in any case be a chapter or document setting out very clearly the editorial principles and practices followed in the edition.


1)  Who would want to do that!, you exclaim. Well, I would, for one. Our knowledge of Early Modern English handwriting is ridiculously limited, and in particular quantifiable information is scarce. So you need geeks like me to count them long <s>s.

2)  The -es-graph in Unicode – <ꝭ> – may be okay for medieval texts, but it is quite unlike the <e>-shaped -es-graphs in Early Modern English secretary hands. For a type facsimile edition, I would need to find a better character. And indeed I have done so – as seen by the -es-graph used in the table (which I got from the Electronic Text Edition of Depositions 1560–1760, available on the CD accompanying Merja Kytö, Peter J. Grund & Terry Walker, Testifying to Language and Life in Early Modern England (Benjamins, 2011)).

3) Something to avoid are what might be called hybrid forms. For instance, combining a normalized expanded stem with the word-final brevigraph: “maiestꝭ ”. Or then expanding the contraction and modernizing the stem of the word, but normalizing the expansion of the brevigraph: “majestes”. Expanding and modernizing the contraction but leaving the superscript just looks silly: “majesty’s”.

quantity +/- quality

For a long time, I’ve felt that the pressure to produce MORE publications – more Things To Count, since the system as it is now uses quantitative methods to establish quality of academics – is doing everyone a disservice, with lots of half-formed publications seeing the light of day.* In this publish or(/and) perish world, I’ve been seeing it as a quantity VERSUS quality issue, and have felt that less might be more – a point that has been raised by many others, quite often using citing Nobel laureates of yore who only ever published half a dozen articles. Clearly quantity is not an objective measure of academic worth.

However, there is another way of looking at this issue, one which is highly important to anyone writing for a living. To paraphrase various authors, this is the general system:

  1. write
  2. finish what you write
  3. send it out and get it published
  4. revise texts only if and when necessary
  5. repeat from step 1

Which makes well good sense: it all comes down to writing things, finishing them, and getting them out into the world. On a conceptual level, writing a blog post and writing a monograph are par. Yes, my interminable monograph is taking forever to get done, but I also have a draft for a post on this blog that dates to December 2012. Things can – and do! – end up hanging in limbo, unfinished for a slew of reasons, all of which come down to one: not writing.

But the other step in the system – where step 1 is To Write – is To Write LOTS.

To produce Quantity.

If you’re a writer trying to make a living, I can see you need to sell your writings in order to feed your kids. But if you’re an academic..? And in any case won’t the quality suffer?

Generally speaking, I’ve tended to equate academic writing with something that takes a long time to accomplish. It takes time to do the background reading, to do the research, to do the thinking required. (To do the lab experiments, to run the programs (and correct the bugs), to process the results…)  Perhaps that’s what I find infuriating about the present state of affairs: here’s this thing which takes A Long Time To Do (never mind Do Well), and we are required – Demanded – to do more of it in less time!

But things can also get written very rapidly.

Now, I’ve written articles that took years to produce. But I’ve also written things at speed: there are a couple pages in one of the things I have published that I well remember writing – long hand (!) – in one sitting, in a pub in London.† And last fall I put together an article within a week – admittedly based on a presentation I gave earlier last year, but my presentations are not article-text read out loud. We all have moments when words just flow out, and it feels like you are channeling something or someone, rather than producing new text. But I can’t help feeling this is rarer in academia than in (some) other fields (I suspect this is something that gets easier the further along your academic career you get, ie. experience helps, even more so than with many if not most other genres).

..okay, so if I’m willing to concede, after all, that academic writing can be fairly rapid, what am I struggling over?

Maybe it’s the business with killing your darlings. Just now, I read a short but great post on Tumblr. Here’s an excerpt:

Pottery, particularly wheel-throwing, is wonderful for this, incidentally. You fail over and over and you fail fast and you are creating quantity to lead to quality. You throw and throw and throw and things die on the wheel and things die when you take them off the wheel and things explode in the kiln and after you have made a dozen or two dozen or a thousand, none of them are precious any more. There is always more clay.

..but it’s only a page long, so go read it on Squash Tea. I can wait.

Done? Groovy.

Here’s the bit that particularly got me:

after you have made a dozen or two dozen or a thousand, none of them are precious any more

Now, I am apparently able to pull together a blog post like this one in an hour or two, and even write one based on more extensive impromptu (not to mention ad hoc) research over an evening. Am I, then, just being overly precious about my ‘real’ academic publications? (Or even about ‘real’ academic publications overall..?)

To put it another way, perhaps my problem with the perceived Quality vs Quantity issue is just misguided?

Given that the system is what it is, churning out pots and hoping most of them will be servicable and at least some also beautiful, and also not forgetting to smash the ones that are downright bad, does not appear to be at all a bad approach to academic writing. It’s easy to get stuck on polishing texts – awareness of the impossibility of perfection notwithstanding – but it’s equally difficult to see, years later, what exactly were the faults that you so much wanted to redress.


Anyway, time to stop this rambling. I am aware that I have touched upon a slew of other points related to academia that are worth addressing (and re-addressing), but I will avoid all of them for now. This was supposed to be a short note..

But just one point as a coda. Not all academic writing is done for publications. In fact, probably the vast majority of it is produced when planning and preparing teaching and writing lectures, but also in putting together talks and conference presentations and guest lectures. And most of these are one-off shows. In the past, when I’ve attended dance, theatre or music performances, I’ve wondered about the ration of preparation versus performance in the arts. Choreographies will be practiced for weeks, scripts rehearsed and polished and rewritten up until curtain up, and bands spend hundreds of hours playing together in preparation for ten performances, or only five, or even just one single performance – to an audience of a thousand, a hundred, or just half a dozen people. But then it struck me that this is what we do as academics: I will spend days putting together a conference paper, and then give it to a roomful of scholars in 20 minutes – and that’s it. Potentially I will write it up for publication later, but this is by no means always the case (although making it so is a worthwhile habit to create).

I guess my point is that these, really, are our pots and dishes: we churn them out by the dozen, and they do include many duds. Sometimes you work for weeks but only on actually presenting it do you see why your paper doesn’t work. But most of the time, you produce a serviceable dish. And then it’s time to reach for more clay.


* This is meant to be a short blog post so pardon me for not engaging with questions relating to quality of academic publications over time, or any other parameter for that matter. As also with issues such as other reasons why texts of dubious quality get published in the first place.

† I actually wrote out twice as much as ended in the article but had to scrap everything written after the first pint. Alcohol can be a muse but when Clio morphs into Thalia you know you’ve had one too many.


At the beginning of this week, I attended the two-day Big Data Approaches to Intellectual and Linguistic History symposium at the Helsinki Collegium for Advanced Studies, University of Helsinki. Since Tuesday, I’ve found myself pondering on topics that came up at the symposium. So I thought I would write up my thoughts in order to unload them somewhere (and thus hopefully stop thinking about them) (I have a chapter to finish, and not on digital humanities stuff), and also in order to try to articulate, more clearly than the jumbled form inside my head, my reflections upon what was discussed there. I.e. the usual refrain, ‘I need to hear what I say in order to find out what I think’.

So here goes.

NB this is not a conference report, in that I’m not going to talk about specific presentations given at the symposium. For that, check out slides from the presentations linked to from the conference conference website, and see also the Storify of the tweets from the event (both including those from the workshop that followed on Wednesday, Helsinki Digital Humanities Day).


I’ve been a part of the DH (Digital Humanities) community for about ten years now. I started off working on digital resources – linguistic corpora, digital scholarly editing; I’ve even fiddled with mapping things – but have in recent years not been actively engaged in resource- or tool-creation as such. Yet I use digital and digitised resources on a daily basis: EEBO frequently, the broad palette of resources available on British History Online all the time, and, when I have access to them, State Papers Online and Cecil Papers (Online). (I work on British state papers from around 1600, and am lucky in that much of the material I need has been digitised and put online in one form or another). I also keep an eye on what happens in the DH world: I attend DH-related conferences and seminars and whatnot when I can, subscribe to LLC (Literary & Linguistic Computing, about to be renamed DSH, Digital Scholarship in the Humanities), and hang out with DHers both online (Twitter, mostly) and in real life.

All this goes to say that I feel quite confident about my understanding of DH projects at the macro level. (Details, certainly not: implementation, encoding, programming, etc etc).

Thus, attending a DH symposium on ‘big data’, I expected to hear presentations about things I was already familiar with. And this turned out to be the case: there were descriptions of/results from projects, descriptions of methodologies (explaining to those from other disciplines ‘what is it we do’), and explorations of concepts that keep coming up in DH work.

Don’t get me wrong: I found all the presentations (that I saw) very good, and listening to talks by people in other disciplines does give you new perspectives. Maybe not profound ones, and often you end up thinking/feeling there’s little or no common ground so why do we even bother? But it’s not a completely useless exercise. Yet what I felt to be the take-away points from this symposium were ones I feel keep coming up at DH events that I have attended over the years, and ones that we – meaning the DH community – are well aware of. Such as (by no means a comprehensive list):

1. Issues with the data

  • “Big Data” in the humanities is not very big when compared to Big Data in some other fields
  • We know Big Data is good for quantity, but rubbish for quality
    • We are aware of the importance and value of the nitty-gritty details
  • We know that manual input is a required part of both processing/methodology – in order to fine-tune the automatic parts of the process – and more importantly, for the analysis of the results (Matti Rissanen’s maxim: “research begins where counting ends”)
  • We know that Our data – however Big it is – is never All data (our results are not God’s Truth)
    • We are aware of the limits of the historical record (“known unknowns, unknown unknowns”)

2. Sharing tools and resources

  • We need to develop better tools, cross-disciplinary ones
    • Our research questions may be different, but we are all accessing and querying text
  • We need to develop our tools as modular “building blocks”, ‘good enough’ is good enough
  • We need to share data/sources/databases/corpora/materials – open access; copyright is an issue, but we’re all (painfully) aware of this

Clearly, these are important points that we need to keep in mind, and challenges that we want to address. And repetitio mater studiorum est. So why do I feel that their reiteration on Monday and Tuesday only served to make me grumpier than usual?*

In the pub after Wednesday’s workshop, we talked a little bit about how pessimistically these points tend to be presented. “We can’t (yet) do XYZ”. “We need to understand that our tools and resources are terrible”. …which now reminds me of what I commented in a previous discussion, on Twitter, early this year:

One element in how I feel about the symposium could be the difficulty of cross-disciplinary communication. This, too, is familiar to me seeing as I straddle several disciplines, hanging out with historical linguists on the one hand, historians on th’other, and then DHers too. I once attended a three-day conference convened by linguists where the aim was to bring linguists and historians together. I think only one of the presentations was by a historian…  So yeah, we don’t talk – as disciplines, that is: I know many individuals who talk across disciplinary borders.  …and, come to think of it, I know a number of scholars who straddle such borders. But perhaps it’s just that at interdisciplinary events there’s a required level of dumbing-down on the part of the presentators on the one hand, and inevitable incomprehension on the part of the audience on the other. Admittedly, it is incredibly difficult to give a interdisciplinary paper.

A final point, perhaps, in these meandering reflections, is of course the wee fact that I don’t, in fact, work on research questions that require Big Data.† (At the moment, anyway). So I’m just not particularly interested in learning how to use computers to tell me something interesting about large amounts of texts – something that it would be impossible to see without using computational power. It’s not that the methodologies, or indeed the results produced, are not fascinating. It’s just that I guess I lack a personal connection to applying them.  ..but then, I suppose this can be filed under the difficulty of interdisciplinary communication! ‘I see what you’re doing but I fail to see how it can help me in what I do’.


So how to conclude? I guess, first of all, kudos to HCAS for putting the symposium together – and, judging from upcoming events, for playing an important part in getting DH in Finland into motion. It’s not as if there’s been nothing previously, and HCAS definitely cannot be credited for ‘starting’ DH activities in Finland in any way – some of us have been doing this for 10 years, some for 30 years or more, and along the way, there have been events which fall under the DH umbrella. But only in the past year or so has DH become established institutionally in the University of Helsinki: we have a professor of DH now, and 4 fully-funded DH-related PhD positions. Perhaps it was the lack of institutional recognition that made previous efforts at organizing DH-related activities here for the large part intermittent and disconnected. But we’ll see how things proceed: certainly many of us are glad to see DH becoming established in Finnish academia as an entity. And judging by the full house at the symposium and the workshop that followed, it would appear that there are many of us in the local scholarly community interested in these topics. The future looks promising.

It should also be said that DH has come a long way from what it was ten years ago. The resources and tools we have today allow us to do amazing things. Just about all of the presentations at the symposium described and discussed projects that use complicated tools to do complex things. I am seriously impressed by what is being done in various fields – and simply, by what can be done today. And there is no denying that there is a Lot of work being done across and between disciplines: DH projects are often multidisciplinary by design, and many are working on and indeed producing tools and resources that can be useful to different disciplines.

Maybe it’s just the season making me cranky. You’ll certainly see me at the next local DH event. Watch this space..



* ..Maybe it’s just conference fatigue that I’m struggling with? There are only so many conference papers one can listen to attentively, and almost without exception there is never time to do much but scratch the surface and provide but a thin sketch of the material/problem/results/etc. (It’s rather like watching popular history/science documentaries/programs on tv: oh look, here’s Galileo and the heliocentric model again, ooh with pictures of the Vatican and dramatic Hollywood-movie music, for chrissakes). (I mean, yes it’s interesting and cool and all that but oh we’re out of time and have to go to commercials/questions). (So in order to retain my interest there needs to be some seriously new and exciting material/results to show, like those baby snow geese jumping off a 400ft cliff (!!!!) in David Attenborough’s fantastic new documentary Life Story, or all the fantastic multilingual multiscriptal stuff in historical manuscripts that we have only just started to look at in more detail. If it’s yet another documentary about Serengeti lions / paper about epistolary formulae in Early Modern English letters, I’m bound to skip it. I’m willing to be surprised, but these are well-trodden ground).   /rant

† Incidentally, I disagree with the notion that in the Humanities we don’t have Big Data – I would say that this depends on your definition of “big”. While historical text corpora may at best only be some hundreds of millions of words, and this pales in comparison to the petabytes (or whatever) produced by, say, CERN, or Amazon, every minute, I see (historical) textual data as fractal: the closer you look at it, the more detail emerges. Admittedly, a lot of the detail does not usually get encoded in the digitised corpora (say, material and visual aspects of manuscript texts), but there’s more there than per byte than in recordings of the flight paths of electrons or customer transactions. Having said this, I’m sure someone can point out how wrong I am! But really, “my data > your data”? I don’t find spitting contests particularly useful in scholarship, any more than in real life.

No signal, just noise

One of the (oh too many) things I work on is code-switching* in historical texts. Or, more broadly, how multilingual environments are reflected in Early Modern English (merchants’) letter-writing. In particular I’ve done some work on the letters of early English East India Company merchants – some of it published – and then of course a bit more on the focus of my (never-ending) dissertation, the early letters of Richard Cocks. This year, I’ve joined my interest in historical code-switching to my interest in palaeography, and have given papers on if and how and why script and typeface are also switched when there is a code-switch in Early Modern English texts. In short, I have been looking at the historical development of why we still italicise words and passages in aliis linguis – and also at other practices of typographical flagging We Still Employ On A Daily Basis. This is all still work in progress, although I will write it up for publication in due course. But I would here like to share an observation gained from conferences and discussions on these and other topics this year.

Last June, there was an excellent symposium at the University of Tampere on historical code-switching. Really the first meeting of its kind, it was hugely enlightening to have three days of papers on code-switching in historical texts, covering over 1,500 years and, although focussing on English texts (OE, ME, EModE, LModE), also including some papers on texts produced elsewhere in Europe. Although all of it was fascinating and informative, I was naturally primarily interested in finding out about visual flagging of code-switching – whether by script-switching, as in the letters I presented on, or by other means. Only a couple of the papers actually focussed on visual aspects of code-switching, but visuality did crop up often enough to give me an idea of the range of the phenomenon and variation within it.

Overall, though, I was struck particularly with the realization that what we were conferencing on under the rubric of “historical code-switching”, actually was/is a hugely diverse … er, thing. Not a practice, but a vast set of practices. Not a phenomenon, but a huge array of phenomena. One of the conclusions of the conference that came up in the closing open discussion was in fact that although most scholars working on historical code-switching have been applying methodologies developed for present-day conversational code-switching to historical texts, we have all been discovering how inadequate such conceptual models and practical approaches are for our purposes. (The same has been realized by scholars working on code-switching in present-day texts). So developing models that are more suitable for the analysis of code-switching in (historical) texts is an important part of future work.

So, practices of code-switching in historical texts vary greatly depending on which period, region, languages, text types, genres, etc etc are involved – and much is down to scribal idiolects. That is to say, code-switching in the Early Modern English letters I work on is very different from code-switching in Late Modern English literary texts, or Early Modern English printed tracts, or practices in present-day northern India, or those in the Jewish community in medieval Cairo. And equally, the code- and script-switching practices of the writer I work on are completely different to those of some of his peers.

Simply put, there appears to be so much variation over time and space and text type, that it is difficult, at this stage, to see any patterns – except when restricting the study to a single place, time, text type, or indeed writer.


Okay. So what? How is the situation different from pretty much any other field?

Well, of course it isn’t. The primary difference to many other fields is the lack of data – historical code-switching is a emerging field and studies are still thin on the ground and cover disparate material. Give it another 20 years and the picture will be clearer. And actually, the fact that we know so little about any of this means that there is unexplored material aplenty, so it is ridiculously easy to come up with further topics and sources to study. Which is exciting!

But this is my point: the variation between texts (types, times, places) is so great as to render generalizations based on a single corpus void. Thus anyone making any general points about historical code-switching is, in my view, bound to be wrong.

And all of this applies equally strongly to script-switching, and also to material aspects of letter-writing: at the moment, we know next to nothing about either of these things.

Which brings me back to my work.

I guess what’s bugging me is the fact that, particularly in PhD work, you quite desperately want to be able to contribute to scholarship, and preferably with A Point: something that can be drawn out of your study and generalized; something that can be applied to other sources. Thus it’s eminently frustrating to cover new ground through painstaking attention to detail in your sources, only to end up with the realization that you have indeed made an important finding in itself, but all you can really say, based on This Material, is how This Material behaves.


* When I say “code-switching”, I use it in the broadest possible sense to mean any use of L2 in L1, including such things as quotations (which arguably require no competence in L2) and lexical borrowings (if cul-de-sac is not French, why do we keep italicising it?) as well as ‘real’ code-switching, be it inter- or intrasentential.