The Why of an Intelligence Corpus; and How: Ethical and Construction Issues

Robert Buckmaster
University of Latvia

The language of intelligence has become more prominent in everyday discourse, in the mainstream media, on websites and blogs and in the comments sections of newspapers in the first years of the 21st Century as a result of the 9/11 attacks and the subsequent wars and revolutions. This prominence has not been reflected in the literature to any great extent.

Three Cases

On February 4th 2014, a short but explosive four minute ten second tape of an intercepted mobile/cell phone conversation between Assistant Secretary of State Victoria Nuland and US Ambassador to Ukraine Geoffrey Pyatt was uploaded to YouTube (Nuland and Pyatt, 2014) amidst the growing tensions surrounding the ongoing Maidan demonstrations in Kiev, the capital of Ukraine. Most attention was focused on Nuland's colourful 'Fuck the EU' and this phrase was probably why this particular extract of a longer conversation was leaked by persons unknown, though the FSB, successors to the KGB, are suspected. Of more interest to the linguist were usages such as 'deets' for 'details', 'complicated electron', the need to 'glue' things, the necessity for 'an international personality' to 'help to midwife this thing', idiomatic usage like 'we could er land jelly side up on this one', collocations like 'political homework', and mixed metaphors like 'if it does start to gain altitude the Russians will working behind the scenes to try to torpedo it'. The conversation was a fascinating insight into the language of American diplomats at work and a cause of outrage from the offended Europeans.

On the 20th May 2013 Edward Snowden, a former CIA employee and NSA contractor discretely left Hawaii for Hong Kong with four laptop computers and a treasure trove of nearly 2 million documents. He was met in China on the 1st of June by Glenn Greenwald and Ewen MacAskill, Guardian journalists, and Laura Poitras, a documentary film maker. Snowden used a Rubik's cube to make contact with the journalists and was then 'debriefed' or interviewed by the journalists for a week before the first leak was published by the Guardian on the 5th June. Snowden went public on the 6th June and left Hong Kong for Moscow on the 23rd, where his US passport was revoked and he was granted temporary asylum by Vladimir Putin, the Russian president (Wikipedia, 2015). The Guardian and other newspapers continue to print reports based on the Snowden documents. On the 3rd February Greenwald, Poitras and Jeremy Scahill created The Intercept (Greenwald and Scahill, 2015) with a 'short-term mission... to provide a platform and an editorial structure in which to aggressively report on the disclosures provided to us by our source, NSA whistleblower Edward Snowden.'

Looking even further back PFC Bradley Manning leaked a trove of 250,000 US diplomatic cable and 500,000 Army reports to Wikileaks. Manning was arrested in May 2010 and found guilty of violations of the Espionage Act in July 2013 (Wikipedia 2015b.

These three cases illustrate the unprecedented quantity of previously classified material which is now available to the general public. The Manning and Snowden leaks are of an order of magnitude greater than previous leaks like the Pentagon Papers (National Archives), and unlike the Papers are of operational language: the language of internal NSA briefings about classified programmes and the language of State Department diplomats reporting back to Washington in cables.

This paper will address the question of why a corpus of intelligence texts is important and what questions it could answer, as well as discussing the important ethical questions of collecting and analysing leaked/stolen materials - relating this to copyright law, and highlighting some construction issues related to corpus sampling and balance.


Greenwald, G. and Scahill, J. 2015.
Nuland, V. and Pyatt, G. 2014
National Archives
Wikipedia. 2015
Wikipedia. 2015b