1.1.2 Brief History of Corpus Linguistics
The Brown Corpus was the first computer-readable general corpus of texts prepared for
linguistic research on modern English. This corpus includes about 500 texts from American
newspapers, books and magazines and published in 1961 in US. In the Brown corpus each text
consists of 2000 words and whole collection includes 1 million words (500 texts with 2000
words) and he authors of this corpus are W. Francis and G. Kučera. (Zacharov, Bagdanova,
2011)
It is by now customary to distinguish between the pre-electronic and post-electronic eras in
the development of corpus linguistics. Svartvik (2007), for example, notes that the initials BC,
for a corpus linguist, stands for Before Computers. The preelectronic period refers to corpus
studies that were predecessors of contemporary work and which were mostly done before the
18
1960s. For some, the early studies go back to the thirteenth century indexing work on the Bible
and for others, to recent times as recently as the beginnings of the twentieth century work of
American Structuralism in collecting textual samples of language use (Leech, 1992).
Advances in computer technology, such as increase in storage capacities and the sophistication
of available software, had a major impact on the progress of corpus linguistics. In fact, it is such
advances that have empowered corpus linguistics to achieve its status today. Equally, we may
say that linguistics also provided a strong impetus in developing many practical applications in
computing in general, because it demanded new types of software in processing natural language
for its complex manifestations at different levels. Apart from the concordances derived from data
stored on punch cards that appeared in the late 1950s, Francis and Kučera (1964) constructed the
first ever electronic corpus of written English at Brown University in 1961. The Brown Corpus
set the standards for corpus design with a size of one million words. The developments following
the Brown Corpus are described as five phases or stages in Renouf (2007, p. 28). The stages are
determined on the basis of the periods in which a specific corpus was constructed as well as the
“types, styles and design” of the corpora of the time.
1. 1960s onwards: the one-million-word (or less) Small Corpus (standard, general and
specialized, sampled, multimodal, multidimensional);
2. 1980s onwards: the multimillion-word Large Corpus (standard, general and specialized,
sampled, multimodal, multidimensional);
3. 1990s onwards: the ‘Modern Diachronic’ Corpus (dynamic, open-ended, chronological data
flow);
4. 1998 onwards: The Web as corpus (Web texts as sources of linguistic information);
5. 2005 onwards: The Grid (pathway to distributed corpora, consolidation of existing corpus
types).
1.1.1.1 Early corpus linguistics
The idea of collecting the text for the use of language analysis is not new concept. People
in the middle ages began to make the list of all words which take place in the texts with their
19
contexts. Other scholars made their own list of the most frequent words from the collections of
texts.
McEnery and Wilson use the term “early corpus linguistics” for all works based on corpus
done before the advent of Chomsky (Tony McEnery, Andrew Wilson, 2001).
Early methods based on corpus used as the fundament for different linguistic studies. The
naturally exiting data are collected and analyzed by researchers in order to describe the change of
language and phenomena of language etc. While texts were collected by linguists the objective
materials also found which answers to all linguistic questions were. McEnery and Wilson believe
that if language is finite then it is easy to collect texts and enumerate. (McEnery,Wilson, 1996)
1.1.2.1. Corpus-based work up to the end of the 1950s
The early empirical studies played a great role in the development of corpus linguistics.
These studies built the basis for an idea which will be improved later. In 1897 German scholar
Käding began to compare frequency of letters and sequences of letters derive spelling
conversations from it. Approximately 11 million German words used by him. Nowadays it looks
unbelievable how he could work through this kind of large number of words without computer.
Between 1876-1926 the research in language acquisition based on diaries of parents who record
their children’s language. The interesting thing is that those findings are used nowadays as a
source of normative information in language over half century later (Kennedy, 1998).
In research on foreign language pedagogy two scholars also used corpus based date whose
vocabulary lists derived from corpora based on the studies of Throndike. Those two scholars
were Fries and Traver (Kennedy, 1998).
Eaton who made research in field of comparative linguistics he collocates the words
which frequently used in French, Italian and Dutch. Nowadays also his work still used as an
example and considered the best works ever. The other scholars used his list of words in their
works and one of them is Lorge who used semantic frequency list as Eaton. One more scholar
named Fries whose work in descriptive grammar based on telephone conversation. All these
works are still considered to be sophisticated and provide developing further corpora.
1.1.2.2. Corpus Linguistics and its Methodology
20
Corpus is a body of text, also it can be described as a large collection of texts that have been
collected and systemized electronically from different types of texts or specific set of criteria.
Here is four the most important characteristics to corpus: “authentic”, “large”, “electronic” and
“specific set”. These features of corpora make them different from other types of text (Lynne
Bowker, Jennifer Pearson, 2002).Corpus linguistics has generated number research methods.
According to Nelson and Wallis there are 3A perspectives which named as annotation,
abstraction and analysis (Sean Wallis, Gerald Nelson, 2001).
Достарыңызбен бөлісу: |