Texts in the corpus are annotated with morphological tags, each single token having a set of
tags, and with special
metric and prosodic tags, enabling search in lines of specific metre, in
rhyming parts, etc. Texts are shown to users with word translation into Russian, which makes the
system useful not only to speakers of Bashkir language, but also to researchers in humanities,
and prosody and linguistic typologists.
3) The Tatar corpus "Tugan Tel" is a linguistic resource of the modern literary Tatar language.
The project is supported by the Basic Research Program of the
Presidium of the Russian
Academy of Sciences. The developed corpus is addressed to a wide range of users: linguists,
specialists in the field of Tatar linguistics, typologists, teachers of the Tatar language, cultural
figures, as well as everyone who studies and is interested in the Tatar language. The volume of
the corpus as of September 2013 is more than 26 million words usage. The corpus
contains texts
of various genres (fiction, media texts, texts of official documents, textbooks, scientific
publications, etc.). Each document has a metadescription (authors, their gender, output,
dates, genres, parts, chapters, etc.). The texts included in the corpus are provided with
morphological markup (information about the part of speech and the grammatical characteristics
of the word form). Morphological markup of corpus is carried out automatically using the
module of two-level morphological analysis of the Tatar language, implemented in the PC-
KIMMO software toolkit.
Experiences in developing various Turkic corpora has positively influenced the
development of the Kyrgyz Corpus as well. However, the problem of creating the Kyrgyz
Corpus still remains relevant.
Достарыңызбен бөлісу: