1. Early Modern English spelling variation
Yesterday, rather late in the evening, I followed a link on Twitter:
— heather froehlich (@heatherfro) April 24, 2014
This led to the great Early Modern Print : Text Mining Early Printed English website where there was an interface like the Google Books Ngram Viewer but for the EEBO-TCP corpus, called EEBO Spelling Browser (or more technically, EEBO-TCP Ngram Browser). With the delight of a researcher falling upon a new toy I started to play with it – but hadn’t even started when I was struck by the figure that is displayed when you navigate to the EEBO-TCP Ngram Browser page. It looks like this:
The idea of an ngram viewer – as per Google – is to look at the frequency of occurrences over time, of a word (a 1-gram) or a phrase (2-, 3-, 4- … N-gram). Frequency here means the proportion of the search phrase to all the words in the corpus, plotted over time. So for instance, the frequency of the word “war” rises during wartime, and falls in peacetime. But things get much more interesting when you look at less obvious things.
Anyway. The point of the EEBO-TCP spelling variant ngram viewer is to compare the change and development of spelling variants over time: for instance, plotting “spell” against “spelle”, “spel”, etc:
English spelling only became standardized in the 18th century, and anyone who wants to read earlier texts has to learn to deal with the fact that apparently all spellings were equally acceptable, and that writers haphazardly used the first spelling that came to their mind* – one of the most famous (or notorious) examples being how William Shakespeare signed his name in six different ways. Despite eventual standardization, spelling variation in English has not completely disappeared today, for although varying how you spell your name today sounds outrageous and unthinkable, all students of English as a foreign language have to learn that there are British and American spellings for many familiar words: colour and color, standardize and standardise, etc.
2. What the hell happened in 1625?
But to return to the EEBO-TCP Spelling Browser, what struck me was the dramatic change in the 1630s. If you look back to the first figure above, you can see that of two spellings of the word above, the spelling “aboue” is essentially the given form until 1625, when it rapidly loses to the alternative spelling “above”, which is firmly established by about 1640.
..Hang on, what? The centuries-old practice of not differentiating between the graphemes <u> and <v> according to the phonemes they indicate – /u/ and /v/ – is replaced, over the stunningly short period of 15 years – across the board (!?) in printed texts by consistent mapping of <u> to /u/ and <v> to /v/..!?
@heatherfro …That was 90mins in the middle of the night playing with EModE spelling variation. What the hell happened in 1625?!?
— THE POSTDOCTOR (@samklai) April 24, 2014
Sooo many questions.
My very first thought was that it must be an artefact of the dataset. One word, of course, hardly tells the whole story. Did this change hold for other words that show u/v variation? What about i/j variation? Or perhaps the EEBO-TCP material was somehow skewed?
But however much I fiddled with the browser, the period between 1620 and 1640 remained the significant factor. And it also applied for i/j-words:
But I did also check EEBO proper – knowing that the results may well be different from those of the EEBO-TCP Ngram Browser. However, it turned out once again that the Browser had been right:
So what on earth happened in the 1620s and 1630s to explain this dramatic shift into standardized spelling?
…actually, I don’t know. Googling revealed that, on the one hand, this is a known phenomenon – although I so far have not found a definitive study of the phenomenon nor a good explanation. (Clearly it has something to do with what’s going on in printing houses). But for instance in her article in the Cambridge History of the English Language vol 3 (2000), Vivian Salmon discusses historical variation in using <u> and <v> to indicate both /u/ and /v/, and then quite casually mentions how “the distinction was made in the 1630s” (p. 39). I think that a corpus-based study of this change remains to be done – although I could be wrong.
Yet rather than starting to look at this point in more detail, I pursued another question that had come to mind: how did this shift in orthographical practices manifest in non-printed texts, such as letters?
3. Non-printed texts and manuscripts
Happily, I am in a perfect position to ask this question, being part of the team who have compiled the Corpus of Early English Correspondence (CEEC). The CEEC is a corpus of English personal letters, spanning 1400-1800 and presently containing about 12,000 letters (5.2m words). It was designed for historical sociolinguistics – to apply modern sociolinguistics methods on historical texts.
Of course, there’s a caveat: CEEC is based on printed editions of letters. “Hang on”, you might say, “is a corpus built from such sources linguistically reliable? Shouldn’t the corpus have been compiled from manuscript texts?” Well, yes – but we have been careful not to use editions that modernize the letter texts, as well as editions that normalize the texts extensively. For the kinds of linguistic queries that the corpus was designed for, the normalization of features such as u/v variation was deemed acceptable. And we have always been careful to stress that the CEEC is not suitable for studying English orthography.
Anyway, I nonetheless rushed right in to see what the CEEC threw up. Not having fancy tools (like DICER) to reveal the proper extent of variant spellings in the corpus, I used a short list of sample words (euer/ever, ouer/over, aboue/above, vp/up). But the results were underwhelming:
In this figure, the ratio of the old form of u/v-spelling variants was far too low through the whole period – it should have been at least around 80%, if the EEBO data was indicative of English spelling practices overall, rather than just those restricted to printed texts.
In order to have better data, I spent some time extracting a subcorpus from the CEEC† consisting of texts only from editions of 17th-century letters in which I could find u/v and i/j-variation. This time, the results were more interesting:
Although the ratio of old spelling variants is still much lower than I had expected, in this figure there is a sharp decline from the 1640s on – which would be in accordance to a prescribed change. (For example, if all schoolchildren are taught to spell according to certain rules, it takes a while for the older generations of writers to die out (or change their spelling habits). Similarly, it makes sense that the influence of a standard orthography in printed texts would reflect in manuscript texts with a slight time lag.)
Yet I remained unhappy with this data. In EEBO, the shift is from nearly 100% old form to 100% new form. Clearly the texts of the editions used for CEEC were normalized more than I had thought. Even given that this was a quick pilot study, the discrepancy was simply too large to accept as a difference between orthographical practices of manuscript and print.
I had one last trick up my sleeve: I did have a fairly good-sized corpus of letters from the first decade of the 1600s transcribed from manuscript, which retained original spellings and other orthographical features. It wouldn’t show me change over time, but it would give me a control figure for how much, exactly, were letter-writers using the old forms in their letters.‡
The result can be seen in the figure above – it is the red X, marking a whopping 87.9% old forms. Finally, something resembling the situation in EEBO.
There was a fair bit of variation between different words in the manuscript sources, and in some cases the new form was dominant:
The greatest discrepancy between the manuscript sources and CEEC (namely the second extracted subcorpus) could be seen in the fact that in the manuscripts, words beginning /un-/ were spelled with a <v> 99.6% of the time (of 987 tokens), whereas in CEEC, the <v>-form occurred only 31% of the time (of 295 tokens). Even editions which claim to retain original spellings clearly cannot be taken at face value.
4. Summing up
So what can we say about that dramatic end to spelling variation in the 1630s seen in the figures from the EEBO-TCP Ngram Browser? Actually, not much.
1. It would appear that in the EEBO corpus, u/v variation became standardized between 1620 and 1640. However, without a comprehensive survey even this conclusion may be wrong – cf. for instance i/j variation in the proper name James, where it takes longer for the <i>-form to start declining, nor is it gone by the end of the century (this might have to do with capitalisation):
2. In manuscript texts, it looks like the spelling standardization process occurred 20 or more years after it took place in print. But without a broader survey, even this estimate may be well wrong.
3. The EEBO Spelling Browser is awesome!
I do remain curious about what happened in the 1620s & 30s. Particularly in whether the standardization of spelling was something more than a development in printing house practices. But I think I’ve done my share of midnight rabbit chasing for the moment.
* Students of Early Modern English beware: this is not true! There are methods in the apparent madness, although the rules may be subtle, and they do vary between writers.
† My first search of CEEC material was of c. 5,000 letters (2.2m words), finding 6,657 tokens of which 700 were old spellings. My second dataset consisted of c. 1,900 letters (just under 800k words), and 3,613 tokens of which 887 were old forms (types: euer/ever, ouer/over, aboue/above, vp/up, vs/us).
‡ This manuscript-based corpus contained about 200 letters (130k words). I expanded my sample word list (types: euer/ever, ouer/over, aboue/above, haue/have, giue/give, vp/up, vs/us, adu*/adv*, vn*/un*), extracting 2,712 tokens – of which 2,385 were old forms.