If you study languages and linguistics, it’s very likely that you’ll come across the word corpus. A corpus (plural corpora) is “a collection of written or spoken material stored on a computer and used to find out how language is used”, and can be used in academic fields like corpus linguistics.
There are several different types of corpora:
- Monolingual corpus: made up of texts in only one language e.g. the British National Corpus (BNC), which has 100 million words
- Parallel corpus: made up of two (or more) monolingual corpora which are translations of each other e.g. the Europarl corpus
- Diachronic corpus: made up of texts from different time periods and used for investigating language change over time e.g. The Diachronic Corpus of Present-Day Spoken English (DCPSE)
- Synchronic corpus: made up of texts from the same time period e.g. the Brown University Standard Corpus of Present-Day American English
- Learner corpus: made up of texts produced by language learners e.g. the ArabCC – Learner Corpus of English Essays
There are several web applications which allow you to work with corpora, including Sketch Engine, which will be featured in a separate blog post on this site.
Leave a Reply