Composition

EANC is designed as a comprehensive corpus with the objective to include as many Standard Eastern Armenian texts as practicable. As of March 2009, EANC comprises about 110 million tokens. Overall, we have been guided by the goal of comprehensive representation – all literary, scientific and oral texts available to us have been indexed for search. The only exception to this are certain widely-available texts, such as electronic press and legal documents, whose presence has been limited for the sake of balance among different genres.

Due to its comprehensive nature, EANC is inherently different from the "major" languages’ corpora such as Russian National Corpus or British National Corpus which choose their collections selectively. BNC additionally imposes a limit on the number of words per document, truncating longer texts. EANC, on the other hand, includes a great majority of all extant Eastern Armenian literary texts. In this respect, EANC is similar to Czech National Corpus or Slovak National Corpus.

The written discourse subcorpus of EANC includes 836 fiction texts, both prose and poetry (including 206 translated fiction titles), 7,858 newspaper issues and a sizeable collection of scientific and other non-fiction texts.

The SEA oral discourse subcorpus (3 million tokens) is an important structural element of EANC, comprised of spontaneous dialogs, task-oriented interviews, TV talk shows, films, and other audio recordings, all transcribed for EANC. Recently added samples of online communication are of a type intermediate between oral and written register; they have been placed in the oral subcorpus.

Each of the 9,960 document entries in EANC is labeled by metatext information specifying genre and other bibliographic details (e.g.: date of creation/publication, name of the author, etc.).

EANC Composition
as of March 2009

Written discourse		# tokens	% EANC	# of docs

Fiction
	prose: novels	29 909 172	27,1%	371	incl. 99 translated
	prose: short stories	5 959 142	5,4%	183	incl. 56 translated
	prose: plays	1 411 030	1,3%	55	incl. 8 translated
	prose subtotal	37 279 344	33,8%	609

	poetry	3 648 160	3,3%	227	incl. 43 translated

Press		47 264 735	42,9%	7858

Non-fiction
	science	13 875 930	12,6%	113	incl. 22 translated
	essays, memoirs, official, religious	4 735 997	4,3%	379	incl. 8 translated

Written discourse total		106 804 166	96,8%	9 186

Oral discourse		# tokens	% EANC	# of docs

	Oral spontaneous discourse	1 029 646	0,94%	208
	Oral public discourse	1 933 899	1,76%	543
	Oral task-oriented discourse	70 010	0,06%	22

+	Online communication	442 399	0,40%	1

Oral subcorpus total		3 475 954	3,2%	774

EANC Total		110 280 120	100%	9 960

Most of the texts in EANC have been acquired by scanning and optical character recognition of various printed sources. Some of fiction titles, however, as well as modern press have been downloaded from open internet archives (for more information and credits see Armenian texts online). All oral corpus consists of texts transcribed by EANC from 2006 to 2008 as well as by Victoria Khurshudian in 2003 to 2005. The following chart represents EANC composition by type of source.

EANC composition - tokens by source type

Written discourse
OCR

downloaded

other

tokens
% EANC
tokens

% EANC

tokens

% EANC

Fiction
38 672 087
36,2% 1 580 876 1,5% 674 541 0,6%

Press
12 709 536
11,9% 34 555 199 32,4%

Non-fiction
15 571 293
14,6% 2 222 181 2,1% 818 453 0,8%

Written discourse total
66 952 916

62,7%
38 358 256 35,9% 1 492 994 1,4%

Online communication
442 399

100%

downloaded

Oral discourse
3 033 555

100%

transcribed