|
|
| Search | Help | Library |
|
Composition
EANC is designed as a comprehensive corpus with the objective to include as many Standard Eastern Armenian (SEA) texts as practicable. As of March 2008, EANC comprises well over 100 million tokens. Overall, we have been guided by the goal of comprehensive representation – all literary, scientific and oral texts available to us have been indexed for search. The only exception to this are certain widely-available texts, such as electronic press and legal documents, whose presence in the search results has been limited for the sake of balance among different genres. The total number of tokens available to queries is currently about 90 million. Due to its comprehensive nature, EANC is inherently different from the "major" languages’ sample corpora such as Russian National Corpus or British National Corpus which choose their collections selectively. BNC additionally imposes a limit on the number of words per document, truncating particularly long texts. EANC, on the other hand, includes a great majority of all extant East Armenian literary texts. In this respect, EANC is similar to Czech National Corpus or Slovak National Corpus.
Most of the texts in EANC have been acquired by scanning and optical character recognition of various printed sources. Some of fiction titles, however, as well as modern press have been downloaded from open internet archives (for more information and credits see Armenian texts online). All oral corpus consists of texts transcribed by EANC from 2006 to 2007 as well as by Victoria Khurshudyan in 2003 to 2005. The following chart represents EANC composition by type of source.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|