*** Welcome to piglix ***

Spoken English Corpus


The Spoken English Corpus (SEC) is a speech corpus used in corpus linguistics consisting of a collection of recordings of spoken British English compiled during the period 1984-7 through a collaboration, funded by IBM, between the Unit for Computer Research on the English Language (UCREL) at the University of Lancaster and the IBM Scientific Centre in Winchester. The corpus comprises 53 recorded passages, mainly recorded from the BBC, spoken in the accent usually referred to as Received Pronunciation, or RP. It covers categories such as commentary. news broadcast, lecture and dialogue. The corpus contains 52,637 words, in a recording time of 339 minutes. The compilation of the corpus is described by Lita Taylor in her 1996 article "The Compilation of the Spoken English Corpus."

A system was devised for transcription of the intonation of the material in the recordings, and two transcribers, Gerry Knowles and Briony Williams, analysed the entire corpus. The transcription system is explained by Williams, and an experiment was conducted by Brian Pickering to assess the degree of agreement between the two transcribers on a section of the Corpus containing around 1000 tone-units which was transcribed by both transcribers. Good agreement was found.

Grammatical tagging of each word was added to the text of the SEC by an automatic process; the fact that this tagging was in machine-readable form made it possible to relate grammatical and prosodic information in the texts. Subsequent work used probabilistic models to develop further the grammatical tagging and to produce automatic parsing techniques.

Although the text and its associated tagging existed in machine-readable form, the recordings themselves existed only as tape-recordings. A collaboration, funded by the Economic and Social Research Council in 1992-4, between speech scientists at the Universities of Lancaster and Leeds in the United Kingdom set out to produce a version of the corpus which contained the recordings in digital form, time-linked to the text. The principal researchers were Gerry Knowles and Tamas Varadi (Lancaster) and Peter Roach and Simon Arnfield (Leeds). The outline of the project is set out in Knowles, and the automatic time-alignment is described by Roach and Arnfield. The digitized recordings were recorded on CD-ROM; it was subsequently made available for downloading for research purposes from Leeds University, though this facility is no longer supported.


...
Wikipedia

...