Дипломдук иш темасы: Жаңы кыргыз корпусундагы "Атоочтуктарды" энтектөө

The Development of Turkic Corpora

жүктеу/скачать 1,4 Mb.

Pdf көрінісі

бет	24/66
Дата	08.02.2022
өлшемі	1,4 Mb.
	#98772
түрі	Диплом

1 ... 20 21 22 23 24 25 26 27 ... 66

Байланысты:
diploma paper alinapdf

1.1.5. The Development of Turkic Corpora
Turkic corpus linguistics began develop intensively only since the 1990s, therefore the
projects for creating national corpora for Turkic languages are especially relevant. Nowadays,
there are a small amount of representational corpus of texts in Turkic languages that include:
1) Turkish National Corpus (TNC) is considerate to be as a balanced and representative corpus
of modern Turkish. In this sense,
TNC generally follows the framework of British National
Corpus (BNC), yet necessary adjustments in corpus design of TNC are made whenever needed.
Throughout the process, different types of open-source software are used for specific tasks, and
the resulting corpus is a free resource for non-commercial use.
The TNC with a size of 50 million words, is a balanced and a representative corpus of
contemporary Turkish. It consists of samples of textual data across a wide variety of genres
covering a period of 24 years (1990-2013). Written component consists of texts produced in
different domains on various topics. Transcriptions from spoken data constitute 2% of TNC’s
database, which involves spontaneous, every day conversations and speeches collected in
particular communicative settings. From a size of 50 million words collection, users will be able
to perform queries by defining restrictions to generate outputs from media, text sample, domain,
derived text type, sex of author, type of author, text genre, as well as the audience of the text. In
TNC Version 3.0 users will able to conduct queries in term of POS of words and in terms of
suffixes. Moreover, they search multiword units and also send queries by using regular
expressions
2) Bashkir National Corpus is the first corpus of Bashkir language and second poetic corpus in
the world (after Russian corpus). A specific feature of this corpus is its text collection comprising
verse of Bashkir poets of 20th and early 21st century.

28
Texts in the corpus are annotated with morphological tags, each single token having a set of
tags, and with special metric and prosodic tags, enabling search in lines of specific metre, in
rhyming parts, etc. Texts are shown to users with word translation into Russian, which makes the
system useful not only to speakers of Bashkir language, but also to researchers in humanities,
and prosody and linguistic typologists.
3) The Tatar corpus "Tugan Tel" is a linguistic resource of the modern literary Tatar language.
The project is supported by the Basic Research Program of the Presidium of the Russian
Academy of Sciences. The developed corpus is addressed to a wide range of users: linguists,
specialists in the field of Tatar linguistics, typologists, teachers of the Tatar language, cultural
figures, as well as everyone who studies and is interested in the Tatar language. The volume of
the corpus as of September 2013 is more than 26 million words usage. The corpus contains texts
of various genres (fiction, media texts, texts of official documents, textbooks, scientific
publications, etc.). Each document has a metadescription (authors, their gender, output, creation
dates, genres, parts, chapters, etc.). The texts included in the corpus are provided with
morphological markup (information about the part of speech and the grammatical characteristics
of the word form). Morphological markup of corpus is carried out automatically using the
module of two-level morphological analysis of the Tatar language, implemented in the PC-
KIMMO software toolkit.
Experiences in developing various Turkic corpora has positively influenced the
development of the Kyrgyz Corpus as well. However, the problem of creating the Kyrgyz
Corpus still remains relevant.

жүктеу/скачать 1,4 Mb.

Достарыңызбен бөлісу:

1 ... 20 21 22 23 24 25 26 27 ... 66