Big corpus

UK publishers produce over 180,000 books each year. (About one third are in digital formats.) So that’s a lot of words, even before the outputs of other countries are taken into account, and all the other words generated online — self published, or unpublished — and journal, magazine and newspaper articles.

These large text corpuses are more than big data, but can be treated as such — counted, mined, probed, analysed, compared, correlated and turned into tables, graphs and network diagrams, without the need for anyone to understand any of it.

More precisely, scholars can use computer programs to transform literary content into different formats in order to understand it better — or at least differently. That’s distant reading, as opposed to close reading. The scholar stands back as if from afar and reviews a whole corpus (collection) of works, and combinations of corpuses. It’s less about singular texts, and more about whole collections (e.g. the complete works of William Shakespeare, all nineteenth century English novels, or the Hansard Reports).

Literary theory

Kathryn Schulz in the New York Times (2011) is suitably skeptical about this kind of study. Franco Moretti of the Stanford Literary Lab hopes to find the “unified theory of plot and style” (229), as if gathering data from the natural world. Schulz makes the obvious point that literary data is created “by design,” and not subject to the independent, distant readings science claims to make of natural phenomena. So dispassionate analysis of texts can only get us so far before we have to commit to the meaning of what it is we are reading, or don’t have time to read.

As a trial I ran my last 7 blog posts through the free-to-use for analysing corpuses of texts. Here’s some of what it comes up with.

Screen Shot 2015-10-03 at 12.38.35

Screen Shot 2015-10-03 at 12.32.06

The postings are ordered, so I guess there’s some sense here to the idea of a trend. I look forward to discovering more, but I’m reluctant to commit whole manuscripts to an online text analysis tool. At present I don’t think automated text analysis provides a substitute for reading, or vicarious reading through other readers’ interpretations.

For the interpretive scholar any text operates at a distance anyway. See posts tagged hermeneutics.


  • Moretti, Franco. 2013. Distant Reading. London: Verso
  • Schulz, Kathryn. 2011. What is distant reading? The New York Times, (June 24) online.


1 Comment

Leave a Reply