Datamoaning

At the beginning of this week, I attended the two-day Big Data Approaches to Intellectual and Linguistic History symposium at the Helsinki Collegium for Advanced Studies, University of Helsinki. Since Tuesday, I’ve found myself pondering on topics that came up at the symposium. So I thought I would write up my thoughts in order to unload them somewhere (and thus hopefully stop thinking about them) (I have a chapter to finish, and not on digital humanities stuff), and also in order to try to articulate, more clearly than the jumbled form inside my head, my reflections upon what was discussed there. I.e. the usual refrain, ‘I need to hear what I say in order to find out what I think’.

So here goes.

NB this is not a conference report, in that I’m not going to talk about specific presentations given at the symposium. For that, check out slides from the presentations linked to from the conference conference website, and see also the Storify of the tweets from the event (both including those from the workshop that followed on Wednesday, Helsinki Digital Humanities Day).


 

I’ve been a part of the DH (Digital Humanities) community for about ten years now. I started off working on digital resources – linguistic corpora, digital scholarly editing; I’ve even fiddled with mapping things – but have in recent years not been actively engaged in resource- or tool-creation as such. Yet I use digital and digitised resources on a daily basis: EEBO frequently, the broad palette of resources available on British History Online all the time, and, when I have access to them, State Papers Online and Cecil Papers (Online). (I work on British state papers from around 1600, and am lucky in that much of the material I need has been digitised and put online in one form or another). I also keep an eye on what happens in the DH world: I attend DH-related conferences and seminars and whatnot when I can, subscribe to LLC (Literary & Linguistic Computing, about to be renamed DSH, Digital Scholarship in the Humanities), and hang out with DHers both online (Twitter, mostly) and in real life.

All this goes to say that I feel quite confident about my understanding of DH projects at the macro level. (Details, certainly not: implementation, encoding, programming, etc etc).

Thus, attending a DH symposium on ‘big data’, I expected to hear presentations about things I was already familiar with. And this turned out to be the case: there were descriptions of/results from projects, descriptions of methodologies (explaining to those from other disciplines ‘what is it we do’), and explorations of concepts that keep coming up in DH work.

Don’t get me wrong: I found all the presentations (that I saw) very good, and listening to talks by people in other disciplines does give you new perspectives. Maybe not profound ones, and often you end up thinking/feeling there’s little or no common ground so why do we even bother? But it’s not a completely useless exercise. Yet what I felt to be the take-away points from this symposium were ones I feel keep coming up at DH events that I have attended over the years, and ones that we – meaning the DH community – are well aware of. Such as (by no means a comprehensive list):

1. Issues with the data

  • “Big Data” in the humanities is not very big when compared to Big Data in some other fields
  • We know Big Data is good for quantity, but rubbish for quality
    • We are aware of the importance and value of the nitty-gritty details
  • We know that manual input is a required part of both processing/methodology – in order to fine-tune the automatic parts of the process – and more importantly, for the analysis of the results (Matti Rissanen’s maxim: “research begins where counting ends”)
  • We know that Our data – however Big it is – is never All data (our results are not God’s Truth)
    • We are aware of the limits of the historical record (“known unknowns, unknown unknowns”)

2. Sharing tools and resources

  • We need to develop better tools, cross-disciplinary ones
    • Our research questions may be different, but we are all accessing and querying text
  • We need to develop our tools as modular “building blocks”, ‘good enough’ is good enough
  • We need to share data/sources/databases/corpora/materials – open access; copyright is an issue, but we’re all (painfully) aware of this

Clearly, these are important points that we need to keep in mind, and challenges that we want to address. And repetitio mater studiorum est. So why do I feel that their reiteration on Monday and Tuesday only served to make me grumpier than usual?*

In the pub after Wednesday’s workshop, we talked a little bit about how pessimistically these points tend to be presented. “We can’t (yet) do XYZ”. “We need to understand that our tools and resources are terrible”. …which now reminds me of what I commented in a previous discussion, on Twitter, early this year:

One element in how I feel about the symposium could be the difficulty of cross-disciplinary communication. This, too, is familiar to me seeing as I straddle several disciplines, hanging out with historical linguists on the one hand, historians on th’other, and then DHers too. I once attended a three-day conference convened by linguists where the aim was to bring linguists and historians together. I think only one of the presentations was by a historian…  So yeah, we don’t talk – as disciplines, that is: I know many individuals who talk across disciplinary borders.  …and, come to think of it, I know a number of scholars who straddle such borders. But perhaps it’s just that at interdisciplinary events there’s a required level of dumbing-down on the part of the presentators on the one hand, and inevitable incomprehension on the part of the audience on the other. Admittedly, it is incredibly difficult to give a interdisciplinary paper.

A final point, perhaps, in these meandering reflections, is of course the wee fact that I don’t, in fact, work on research questions that require Big Data.† (At the moment, anyway). So I’m just not particularly interested in learning how to use computers to tell me something interesting about large amounts of texts – something that it would be impossible to see without using computational power. It’s not that the methodologies, or indeed the results produced, are not fascinating. It’s just that I guess I lack a personal connection to applying them.  ..but then, I suppose this can be filed under the difficulty of interdisciplinary communication! ‘I see what you’re doing but I fail to see how it can help me in what I do’.

Hmm.

So how to conclude? I guess, first of all, kudos to HCAS for putting the symposium together – and, judging from upcoming events, for playing an important part in getting DH in Finland into motion. It’s not as if there’s been nothing previously, and HCAS definitely cannot be credited for ‘starting’ DH activities in Finland in any way – some of us have been doing this for 10 years, some for 30 years or more, and along the way, there have been events which fall under the DH umbrella. But only in the past year or so has DH become established institutionally in the University of Helsinki: we have a professor of DH now, and 4 fully-funded DH-related PhD positions. Perhaps it was the lack of institutional recognition that made previous efforts at organizing DH-related activities here for the large part intermittent and disconnected. But we’ll see how things proceed: certainly many of us are glad to see DH becoming established in Finnish academia as an entity. And judging by the full house at the symposium and the workshop that followed, it would appear that there are many of us in the local scholarly community interested in these topics. The future looks promising.

It should also be said that DH has come a long way from what it was ten years ago. The resources and tools we have today allow us to do amazing things. Just about all of the presentations at the symposium described and discussed projects that use complicated tools to do complex things. I am seriously impressed by what is being done in various fields – and simply, by what can be done today. And there is no denying that there is a Lot of work being done across and between disciplines: DH projects are often multidisciplinary by design, and many are working on and indeed producing tools and resources that can be useful to different disciplines.

Maybe it’s just the season making me cranky. You’ll certainly see me at the next local DH event. Watch this space..

 


 

* ..Maybe it’s just conference fatigue that I’m struggling with? There are only so many conference papers one can listen to attentively, and almost without exception there is never time to do much but scratch the surface and provide but a thin sketch of the material/problem/results/etc. (It’s rather like watching popular history/science documentaries/programs on tv: oh look, here’s Galileo and the heliocentric model again, ooh with pictures of the Vatican and dramatic Hollywood-movie music, for chrissakes). (I mean, yes it’s interesting and cool and all that but oh we’re out of time and have to go to commercials/questions). (So in order to retain my interest there needs to be some seriously new and exciting material/results to show, like those baby snow geese jumping off a 400ft cliff (!!!!) in David Attenborough’s fantastic new documentary Life Story, or all the fantastic multilingual multiscriptal stuff in historical manuscripts that we have only just started to look at in more detail. If it’s yet another documentary about Serengeti lions / paper about epistolary formulae in Early Modern English letters, I’m bound to skip it. I’m willing to be surprised, but these are well-trodden ground).   /rant

† Incidentally, I disagree with the notion that in the Humanities we don’t have Big Data – I would say that this depends on your definition of “big”. While historical text corpora may at best only be some hundreds of millions of words, and this pales in comparison to the petabytes (or whatever) produced by, say, CERN, or Amazon, every minute, I see (historical) textual data as fractal: the closer you look at it, the more detail emerges. Admittedly, a lot of the detail does not usually get encoded in the digitised corpora (say, material and visual aspects of manuscript texts), but there’s more there than per byte than in recordings of the flight paths of electrons or customer transactions. Having said this, I’m sure someone can point out how wrong I am! But really, “my data > your data”? I don’t find spitting contests particularly useful in scholarship, any more than in real life.

Leave a comment

Your email address will not be published. Required fields are marked *