84
2
. Taking into account all morphological and semantic peculiarities of verbal adjective,
i.e. Atoochtuk in the Kyrgyz language we suggest following tagsets for verbal adjectives,
leaning on the tools from Turkic Lexicon Apertium:
for ‘past-tense verbal adjective’, e.g., келген конок – arrived guest (past)
for ‘future-tense verbal adjective’, e.g., келер конок – the guests who should
come (soon) (future)
for ‘present-tense verbal adjective’, e.g., келүүчү конок – arriving guests
(present)
Acknowledgment
I would like to express my deepest gratitude and appreciation
for my supervisor Aida
Kasieva, whose guidance, support, encouragement, feedbacks and advices have been
invaluable throughout this research.
It was a great honour for me to write about ‘Kyrgyz Corpus Linguistics’ as it allows us to
trace the Kyrgyz language.
GLOSSARY
Apertium
a free/open source platform for developing rule-based machine
translation
system
Annotation
tagging of language data in text or spoken form
Annotated / labeled
corpus
a corpus of texts that contains special labels that allow
receiving data (statistics, language examples, etc.) from a
corpus for any linguistic parameters (part of speech,
grammatical form,
syntactic function, etc.)
Balanced corpus
a representative corpus in which various components are
presented in a “layered” form, which allows you to create a
pattern of occurrence of a linguistic phenomenon investigated
against the background of extralinguistic information
Grammar markup,
“tagger”
a program that automatically performs grammatical
(morphological)
markup of texts-corpus
85
Colloquate
a word or word form that occurs as a close neighbor of a given
word (word form)
Collocation
a regular, stable combination of words, taking into account
morphological and syntactic conditions that ensure the
compatibility of linguistic units
Concordance
1) a pointer that associates each usage with context; 2)
automatically obtained set of contexts for a given phenomenon
(word / phrase / grammatical form, etc.)
Corpus
a collection of texts, usually in a machine-readable format,
including information about the situation in which the text was
produced, such as information about the speaker, author,
recipient, or audience
Corpus
markup
system of standart codes inserted into a document stored in
electronic form to provide information about the text itself
Lemma
an initial (dictionary) form for a given word form
Lemmatization
a process of generating initial forms for word forms
Parser
a computer program that performs automatic markup of text at
the
syntactic or semantic level
Parsing
analysis of the syntactic structure of a sentence and its
presentation in the form of a tree or structure of components
Subcorpus
a group of texts of the corpus, united on the basis of the
coincidence of some parameter (language, genre, etc.)
Token
a specific word in the text, word form,
text form, word usage
Tokenization
splitting the flow of characters in natural language texts into
separate significant units (tokens)