Дипломдук иш темасы: Жаңы кыргыз корпусундагы "Атоочтуктарды" энтектөө


 The Development of Turkic Corpora



Pdf көрінісі
бет24/66
Дата08.02.2022
өлшемі1,4 Mb.
#98772
түріДиплом
1   ...   20   21   22   23   24   25   26   27   ...   66
Байланысты:
diploma paper alinapdf

1.1.5. The Development of Turkic Corpora
Turkic corpus linguistics began develop intensively only since the 1990s, therefore the 
projects for creating national corpora for Turkic languages are especially relevant. Nowadays, 
there are a small amount of representational corpus of texts in Turkic languages that include: 
1) Turkish National Corpus (TNC) is considerate to be as a balanced and representative corpus 
of modern Turkish. In this sense,
TNC generally follows the framework of British National 
Corpus (BNC), yet necessary adjustments in corpus design of TNC are made whenever needed. 
Throughout the process, different types of open-source software are used for specific tasks, and 
the resulting corpus is a free resource for non-commercial use. 
The TNC with a size of 50 million words, is a balanced and a representative corpus of 
contemporary Turkish. It consists of samples of textual data across a wide variety of genres 
covering a period of 24 years (1990-2013). Written component consists of texts produced in 
different domains on various topics. Transcriptions from spoken data constitute 2% of TNC’s 
database, which involves spontaneous, every day conversations and speeches collected in 
particular communicative settings. From a size of 50 million words collection, users will be able 
to perform queries by defining restrictions to generate outputs from media, text sample, domain, 
derived text type, sex of author, type of author, text genre, as well as the audience of the text. In 
TNC Version 3.0 users will able to conduct queries in term of POS of words and in terms of 
suffixes. Moreover, they search multiword units and also send queries by using regular 
expressions 
2) Bashkir National Corpus is the first corpus of Bashkir language and second poetic corpus in 
the world (after Russian corpus). A specific feature of this corpus is its text collection comprising 
verse of Bashkir poets of 20th and early 21st century. 


28 
Texts in the corpus are annotated with morphological tags, each single token having a set of 
tags, and with special metric and prosodic tags, enabling search in lines of specific metre, in 
rhyming parts, etc. Texts are shown to users with word translation into Russian, which makes the 
system useful not only to speakers of Bashkir language, but also to researchers in humanities, 
and prosody and linguistic typologists. 
3) The Tatar corpus "Tugan Tel" is a linguistic resource of the modern literary Tatar language.
The project is supported by the Basic Research Program of the Presidium of the Russian 
Academy of Sciences. The developed corpus is addressed to a wide range of users: linguists, 
specialists in the field of Tatar linguistics, typologists, teachers of the Tatar language, cultural 
figures, as well as everyone who studies and is interested in the Tatar language. The volume of 
the corpus as of September 2013 is more than 26 million words usage. The corpus contains texts 
of various genres (fiction, media texts, texts of official documents, textbooks, scientific 
publications, etc.). Each document has a metadescription (authors, their gender, output, creation 
dates, genres, parts, chapters, etc.). The texts included in the corpus are provided with 
morphological markup (information about the part of speech and the grammatical characteristics 
of the word form). Morphological markup of corpus is carried out automatically using the 
module of two-level morphological analysis of the Tatar language, implemented in the PC-
KIMMO software toolkit. 
Experiences in developing various Turkic corpora has positively influenced the 
development of the Kyrgyz Corpus as well. However, the problem of creating the Kyrgyz 
Corpus still remains relevant. 


Достарыңызбен бөлісу:
1   ...   20   21   22   23   24   25   26   27   ...   66




©engime.org 2024
әкімшілігінің қараңыз

    Басты бет