As of March 2009, oral discourse in EANC includes over 3 million tokens with the following distribution:
Oral discourse in EANC is presented by the Yerevan standard. Relying on the Yerevan standard is justified by the fact that it is the closest spoken dialect to Standard Eastern Armenian. Historically, the Yerevan (Araratian) dialect served as a spoken prototype for the Eastern Armenian literary tradition.
The entire EANC oral discourse corpus has been compiled by the EANC team. Raw video and audio data were recorded in mpeg/wav format and subsequently transcribed. A written permission to record respondents was obtained whenever possible. For ethical reasons, names and other identity markers in the oral spontaneous subcorpus have been replaced by placeholders (randomly chosen capital letters).
A small subcorpus of online communication (internet forum posts, blogs, etc.; 442,399 tokens) included in the oral subcorpus is comprised of texts linguistically intermediate between oral and written discourse: see the relevant checkbox under Oral in the Subcorpus Selection window.
Oral Public Discourse (currently at 1,9 million tokens) was compiled in video format. It includes various recordings of public debates, talk shows, interviews, etc. broadcast by Armenian TV stations such as PTV1, PTV2, Kentron, Yerkir media, Armenia TV, TV5, among others.
TranscriptionOnce the raw audio/video data has been obtained, it is transcribed in a "shallow" transcription, which follows traditional Armenian orthography and punctuation, with the addition of several special tags: == for falsestarts, = for fragmented words, <> for ambiguous words, ## for comments. A detailed discourse transcription used in representations of some other oral corpora may be implemented in the future. Three audio samples supplemented by "shallow" transcription are provided for your reference.
A sample of Goris dialect is provided as a reference point to Armenian dialectal variety. Recording and processing