What is a corpus?

If you study languages and linguistics, it’s very likely that you’ll come across the word corpus. A corpus (plural corpora) is “a collection of written or spoken material stored on a computer and used to find out how language is used”, and can be used in academic fields like corpus linguistics.

There are several different types of corpora:

Monolingual corpus: made up of texts in only one language e.g. the British National Corpus (BNC), which has 100 million words
Parallel corpus: made up of two (or more) monolingual corpora which are translations of each other e.g. the Europarl corpus
Diachronic corpus: made up of texts from different time periods and used for investigating language change over time e.g. The Diachronic Corpus of Present-Day Spoken English (DCPSE)
Synchronic corpus: made up of texts from the same time period e.g. the Brown University Standard Corpus of Present-Day American English
Learner corpus: made up of texts produced by language learners e.g. the ArabCC – Learner Corpus of English Essays

There are several web applications which allow you to work with corpora, including Sketch Engine, which will be featured in a separate blog post on this site.

Comments

One response to “What is a corpus?”

Leave a Reply Cancel reply