Corpora

A corpus is ‘a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of language’ (Sinclair, 1996).

In a sense, and at its simplest, a corpus is a sample of language in use. It consists of all the words, phrases, sentences, and even newspapers, magazines, books, and speech transcripts that have been collected and put into it. Usually, the sample will be taken from real examples of what people have said or written. Corpora can include spoken language and written language; the language of children, women, and men; language from any corner of the globe; language from a particular year or decade; and more.

Sometimes a corpus has a very specific focus on certain types of language – perhaps spoken language from casual conversations between professional women in London, or language from printed newspapers in the UK – but some corpora (the plural of corpus) consist of many different types of language.

Additionally, many corpora take their words from particular times, so can provide a historical record of how language was used and perhaps how it has changed over time.

Corpora allow us to look at language and see patterns and examples in language use. For example, now that most corpora are digitised you can search for when a word appeared in print for the first time, or how many times a word appears in one type of writing or speech.

Corpora are very helpful for carrying out statistical studies of language, perhaps to work out how common certain expressions are in relation to other expressions, or if certain words are preferred over others in spoken or written texts.

Perhaps most importantly, corpora allow linguists, teachers and students to study real language as it is actually used. Instead of making up rules about what people should say or write, we can see what real people have actually said or written and how they have used their language in its context. We can ask ‘What do people actually say?’ and then explore why.

Much of the content on this website is drawn from the ICE-GB corpus (the International Corpus of English, Great Britain component), which is housed at the Survey of English Usage at University College London. 

Full Preview

This is a full preview of this page. You can view a couple of pages a day like this without registering. But if you wish to use it in your classroom, please register your details on Englicious (for free) and then log in!

Englicious (C) Survey of English Usage, UCL, 2012-17 | Supported by the AHRC and EPSRC. | Cookies