Composition
     
Composition

EANC is designed as a comprehensive corpus with the objective to include as many Standard Eastern Armenian (SEA) texts as practicable. As of March 2008, EANC comprises well over 100 million tokens. Overall, we have been guided by the goal of comprehensive representation – all literary, scientific and oral texts available to us have been indexed for search. The only exception to this are certain widely-available texts, such as electronic press and legal documents, whose presence in the search results has been limited for the sake of balance among different genres. The total number of tokens available to queries is currently about 90 million.

Due to its comprehensive nature, EANC is inherently different from the "major" languages’ sample corpora such as Russian National Corpus or British National Corpus which choose their collections selectively. BNC additionally imposes a limit on the number of words per document, truncating particularly long texts. EANC, on the other hand, includes a great majority of all extant East Armenian literary texts. In this respect, EANC is similar to Czech National Corpus or Slovak National Corpus.

The written discourse subcorpus of EANC includes 1,039 fiction texts, both prose and poetry (including 184 translated fiction titles), 6,514 newspaper issues and a sizeable collection of scientific and other nonfiction texts.

The SEA oral discourse subcorpus (1.84 million tokens) is an important structural element of EANC, comprised of spontaneous dialogs, task-oriented interviews, TV talk shows, films, and other audio recordings, all transcribed for EANC.

Each of the 8,113 document entries in EANC is labeled by metatext information specifying genre and other bibliographic details (e.g.: date of creation/publication, name of the author, etc.).

EANC Composition      
as of March 2008      
         

Written discourse

# tokens

% EANC

# of docs

         
Fiction      
  prose:  novel

23 982 848

27,1%

287

  prose:  story

5 318 243

6,0%

104

  prose:  play

1 298 774

1,5%

46

 

prose subtotal 

30 599 865

34,6%

437

         
  poetry

2 459 421

2,8%

106

         
Press

35 258 177

39,9%

3895

         
Nonfiction      
  science

13 664 469

15,4%

109

  essays, memoirs, official, religious

4 629 156

5,2%

320

         
Written discourse total 

86 611 088

97,9%

4 867

         
Oral discourse

# tokens

% EANC

# of docs

         
  Oral spontaneous discourse (OSD)

1 026 222

1,16%

156

  Oral public discourse (OPD)

753 061

0,85%

172

  Oral task-oriented discourse (OTOD)

65 552

0,07%

16

         
Oral discourse total 

1 844 835

2,1%

344

         
EANC Total

88 455 923

100%

5 211



Most of the texts in EANC have been acquired by scanning and optical character recognition of various printed sources. Some of fiction titles, however, as well as modern press have been downloaded from open internet archives (for more information and credits see Armenian texts online). All oral corpus consists of texts transcribed by EANC from 2006 to 2007 as well as by Victoria Khurshudyan in 2003 to 2005. The following chart represents EANC composition by type of source.

EANC composition - tokens by source type    
           
           
Written discourse

# tokens

% EANC

# tokens

% EANC

           
   

scanned

 

downloaded

 
Fiction

32 036 156

97%

1 023 130

3%

Press

12 672 872

36%

22 585 305

64%

Nonfiction

16 438 291

90%

1 855 334

10%
Written discourse total 

61 147 319

71%

25 463 769

29%

           
   

transcribed

     
Oral discourse

1 844 835

100%