Lengcom 8.4 (2015). ISSN 2386-7477


General Analysis of an Online Language Corpus


Kerwin A. Livingstone

Universidade de Porto

profesordelenguasmodernas@yahoo.es


Abstract: Corpus-based research is rapidly gaining ground in the field of Applied Linguistics. More interesting is the evidence of many online language corpora which can be easily accessed, with just the click of the mouse. A quick navigation of the Web will produce different kinds of corpora in a vast number of language areas. Given the need to find new and exciting ways to improve the language learning and teaching process, corpus linguistics does have potential for generating significant learner experiences. Taking into consideration the above-mentioned, this paper deals with the general analysis of an online language corpus. The specific corpus chosen is the Corpus del Español, one from within the Brigham Young University (BYU) Corpora, created by Professor Mark Davies in 2002 of BYU. An analysis of the corpus is done, highlighting its implications for language learning and teaching.

 

Keywords: corpus, language corpus, applied linguistics, language learning and teaching, Spanish, corpus-based research, corpus linguistics.

 

Resumen: La investigación basada en corpus está ganando terreno rápidamente en el ámbito de la lingüística aplicada. Interesante aún es la evidencia de muchos córpora de lengua en línea a los cuales pueden accederse fácilmente, con un clic del ratón. Una búsqueda rápida de la Web producirá distintos tipos de córpora, en un gran número de áreas de la lengua. Dada la necesidad de encontrar nuevas y emocionantes maneras de mejorar el proceso de enseñanza-aprendizaje de lenguas, la lingüística de corpus sin duda tiene el potencial para engendrar experiencias de aprendizaje significativas. De acuerdo con lo anteriormente mencionado, este artículo versa sobre el análisis general de un corpus de idioma en línea. El corpus específico seleccionado es el Corpus del Español, uno de los córpora de la Universidad Brigham Young, creado por el profesor Mark Davies en 2002, profesor en esta universidad. Se lleva a cabo un análisis del corpus, destacando sus implicaciones para la didáctica de lenguas.  

 

Palabras clave: corpus, corpus de lengua, lingüística aplicada, enseñanza-aprendizaje de lenguas, español, investigación basada en corpus, lingüística de corpus. 


1. Introduction

Substantial collections of language texts in electronic form have been available to scholars for almost forty years, and they offer a view of language structure that has not been available before. While much of it confirms and deepens our knowledge of the way language works, there is also a fascinating area of novelty and unexpectedness – ways of making meaning that have not previously been taken seriously (Sinclair, 2004; Biber, Conrad & Reppen, 2004; Frankenberg-Garcia, Flowerdew & Aston, 2011). The afore-mentioned affirmation clearly speaks to the fact that bodies of linguistic/language texts have been accessible for a while now. These bodies of texts can be referred to as corpus (singular) or corpora (plural).

 

A corpus can be defined as a systematic collection of naturally occurring texts of both written and spoken language (Nasselhauf, 2004). It is systematic for two reasons: (1) the structure and contents of the corpus follows certain extralinguistic principles (the principles on the basis of which the texts included were chosen); (2) the information on the exact composition of the corpus is available to the researcher, including the number of words in each category, and in the whole corpus, how the texts included in the corpus were sampled, and the like.

 

Maia (2014: 3) postulates that “the main purpose of a corpus is to verify a hypothesis about language – for example, to determine how the usage of a particular sound, word, or syntactic construction varies”. Such an assertion is important for research in languages and linguistics, since it implies that empirical studies employing quantitative studies and analyses can be conducted, with the support of language corpus. Additionally, important to note also is that even though a corpus can refer to any systematic text collection, it is commonly used in a narrower sense today, and is frequently only used to refer to systematic text collections that have been computerised.

 

Bearing in mind the afore-mentioned, the term Linguistics has to do with the scientific study of language. Fusing the two terms together gives way to Corpus Linguistics. Corpus Linguistics can be described as “a linguistic methodology which is founded on the use of electronic collections of naturally occurring language, viz. corpora” (Granger 2002: 3). In other words, it is a method for carrying out linguistic analyses, with the help of a computer with specialised software, and which takes into account the frequency of the phenomenon being investigated. Nasselhauf (2004) establishes that corpus linguistics has transformed itself into one of the most ubiquitous methods of linguistic research in recent years.


2. General Analysis of the Corpus del Español

2.1 Description of the Corpus

The Corpus del Español contains 100 million words in Spanish, from the 1200s to the 1900s. The interface of the corpus seems very organised and easy to follow.  On the top left hand corner of the corpus, there are a number of sections, for example “Display”. In this section, one can either select the option ‘list’, ‘chart’, ‘key word in context (KWIC)’, or ‘compare’. The other sections are “Search String”, “Sections”, “Sorting and Limits”, and “Click to See Options” which has other subcategories for specific kinds of searches. With a click of any of these options, the researcher can begin to investigate whatever it is that needs to be investigated.

 

One notable characteristic of this corpus is that in the section called “Sections”, one has the option of choosing the period in which the word was used, between the 1200s – 1900s. Note-worthy is that there are four contexts from which the corpus has been constructed: oral, fiction, news and academic. Interesting to note is that under the section called “Search String”, the subcategory ‘Pos List’ allows one to select the syntactic class of the lexical item that is to be studied. Another fundamental characteristic of the corpus is that it can be accessed in English and Spanish. In other words, the instructions in English may very well be for the novice users who are not proficient in Spanish. Similarly, the instructions in Spanish are naturally directed to those who are more linguistically competent.

 

2.2 Using the Corpus

By way of seeing how this corpus functions, two lexical items (quantifiers), sometimes confusing in Spanish, were selected: bastante and muy. These were tested only in the period 1900s.

 

In the case of bastante (rather), a search was done to list various contexts in which the word appeared. For the period, 1900s, the frequency of this quantifier was 3,894. This took only 0.523 seconds. On clicking the word ‘bastante’ at the top centre of the page, a vast number of examples of use appeared, together with the context in which it was used (oral, fiction, news, or academic). Upon clicking one context in which the word appears, “Estoy bastante satisfecho de mi vida”, the source information was immediately displayed, providing the date, title, source and author of the information. Additionally, an expanded context was provided to show exactly how the phrase/sentence was used.

 

On selecting the option ‘Chart’ in the “Display” section, an actual chart appeared which provided the frequency of the quantifier ‘bastante’ from the 1200s – 1900s. The frequency in each of the four contexts was also made ostensible. Figure 1 presents this information:


SECTION

1200s

1300s

1400s

1500s

1600s

1700s

1800s

1900s

 

ACAD

NEWS

FICT

ORAL

FREQ

4

8

155

1013

1039

1945

3682

3894

 

492

422

759

2221

PER MIL

0.60

3.00

18.99

59.47

84.14

198.13

190.80

170.62

 

98.40

85.02

159.12

524.68

 

 

 Figure 1. Frequency of use of the word ‘bastante’ from 1200s-1900s.

 

A check of the same word in the category ‘KWIC’ revealed the word highlighted, and the contexts in which it appeared. An example is provided below in Table 1:


82

19-OR

Habla Culta: Mexico: M2

A

B

C

. . pues , al otro día  ,  con  una  apariencia 

  bastante 

  tersa   y   pulcra   .  Enc . - - Bueno . ¿ Y

Table 1. Example of the key word in context.

 

In relation to the above information, the same process was repeated for the quantifier muy. For the period 1900s, a total frequency of 35, 393 was evidenced in 0.332 seconds. Clicking the word revealed the four contexts in which they were utilised, together with a plethora of examples of use.  One selected phrase was “Es un camino muy largo el que nos queda por recorrer”. Similar to the other quantifier above, the source information and expanded context were presented.

 

Choosing the subcategory ‘Chart’ showed the following for the word ‘muy’ in Figure 2:


SECTION

1200s

1300s

1400s

1500s

1600s

1700s

1800s

1900s

 

ACAD

NEWS

FICT

ORAL

FREQ

22043

13792

26619

49857

19394

16505

29321

35393

 

5053

5475

6933

17932

PER MIL

3,282.30

5,166.39

3,261.57

2,926.86

1,570.59

1,681.32

1,519.44

1,550.81

 

1,010.61

1,102.99

1,453.50

4,236.18

 

 

Figure 2. Frequency of use of the word ‘muy’ from 1200s-1900s.

 

Using the ‘KWIC’ method, the following example was highlighted, as in Table 2:


2

19-OR

Entrevista (ABC)

A

B

C

. El reparto se apoya en  tres  nombres  «  estilísticamente 

  muy 

  adecuados   »   ,   como  son el barítono Robert Hale….

Table 2. Example of the key word in context.

 

2.3 Comparison of the two quantifiers ‘bastante’ and ‘muy’

By selecting the option ‘Compare’ in the “Display” section of the corpus, some of the many examples presented are displayed in Table 3:

WORD 1 (W1): BASTANTE (0.11)

 

  WORD

W1

W2

W1/W2

SCORE

 1

  ACEPTABLE

7

0

14.0

127.2

2

  MODIFICA

5

0

10.0

90.9

3

  VERTICALES

5

0

10.0

90.9

4

  APRENDIDO

6

1

6.0

54.5

6

  BASTANTE

154

34

4.5

41.2

WORD 2 (W2): MUY (9.09)

 

  WORD

W2

W1

W2/W1

SCORE

1

  ESPECIAL

242

1

242.0

26.6

2

  POCOS

322

2

161.0

17.7

3

  LINDO

125

1

125.0

13.8

4

  DON

61

0

122.0

13.4

5

  DESPACIO

60

0

120.0

13.2

6

  RESPUESTA

60

0

120.0

13.2


Table 3. Context comparison of ‘muy’ and ‘bastante’.

 

As can be seen from the above, each of the individual quantifiers is placed in tables with an exhausted list of words with which they can be used. Word 1 is ‘bastante’ and Word 2 is ‘muy’. In each of the individual categories, contextual comparisons are made with Word 1 and Word 2. Table 4 presents examples of these data.

5

19-OR

Habla Culta: Caracas: M19

A

B

C

-...hacia el norte de Bélgica. Es una zona bastante fría y muy húmeda.

Inf. A - - Llueve mucho. Inf. B - -

28

19-F

Rayuela

A

B

C

? Le pregunté cuándo se cambiaba de ropa. Qué tontería preguntarle eso.

Es muy buena, está bastante loca, esa noche creía ver las flores del campo en

8

19-OR

Habla Culta: Havana: M49

A

B

C

después de tomar una siesta? Inf.a. - Bueno, bastante bien. 

Muy alivio, mucho alivio espiritual. Enc. - Sí, sí, eso ayuda

22

19-OR

España Oral: CCON013F

A

B

C

que me gusta que sí, que el nivel de lingüística es muy - es muy 

Sí. Bastante bueno. Yo lo que veo que además de Generativa,

27

19-OR

Entrevista (PRI)

A

B

C

. Creo que Juan Millán es un hombre que tiene una trayectoria bastante amplia,

 muy completa, no es un improvisado, no ha llegado por accidente a la política

Table 4. Comparative uses of ‘muy’ and ‘bastante’.

 

2.4 Analysis of the Corpus

The Corpus del Español presents an ample range of linguistic text in Spanish, dealing with all aspects of the language. It is a very user-friendly, useful corpus, and very easy to manipulate. To continue using the corpus, the researcher has to register with the system, providing valid information which, when approved, allows for continued, uninterrupted use. This is notable.

 

The various contexts from which the corpus has gathered data – oral, fiction, news and academic – present information gathered from the different Spanish-speaking countries across North America, Central America, South America, Europe, and the Caribbean Sea, thus providing a plethora of examples of rich, diverse linguistic text. This is very important for language pedagogy, since providing rich inputs fosters language acquisition (Krashen, 1981).

 

Experimenting with the corpus with other lexical items, phrases and grammar concepts have shown that the linguistic data have been carefully compiled to include almost all aspects of language production. This is note-worthy. It is an excellent tool for both language teachers and learners, and it is recommended that it be used to complement pedagogical practices, with specific reference to the learning-teaching of Spanish as a first language (L1), second language (L2), or foreign language (FL).

 

One recommendation would be to find a way to include recorded speech in this corpus as a means of listening comprehension and pronunciation exercises for language students. In other words, a specific section of the corpus could be reserved for listening/pronunciation drills only. 


3. Implications for Language Learning and Teaching

As mentioned in the above-paragraph, this corpus has implications for Spanish language learning and teaching:

  1. Too often in the language classroom, pedagogic materials do not always meet the needs of students. They are either too abstract, or very often ignore cultural aspects. Using this corpus can correct that deficiency, since language teachers will now be able to provide their students with rich, authentic examples of language use from native speakers in real contexts.
  2. This corpus, extending itself from the 1200s to the 1900s, lends itself very well for various types of Spanish language study: linguistics, history (the development and use of language over time), literature (the various linguistic styles during literary periods), dialectology (comparisons of dialectal use), translation and interpreting, sociolinguistics, pragmatics, syntax, semantics, language change and variation, among others.
  3. This corpus is vital for confirming a hypothesis about language (Maia, 2014). Conducting quantitative comparisons and analyses of a wide range of linguistic features in this corpus, representing different varieties of the Spanish language, can show how different features cluster together in distinctive distributional patterns, effectively creating different text types. In other words, it can create much better descriptions of many of the different language registers (informal conversation, formal speech, academic writing, among others) and dialects of native Spanish speakers (Peru, Mexico, Spain, Venezuela, and the like). 

To complement the use of corpus in quantitative studies, Livingstone (2014), in a quasi-experimental longitudinal study of Spanish L2 students, conducted in 2008, made use of recorded speech of participants, which was then transcribed and stored in the computer. From this corpus of recorded speech, the author was able to quantitatively determine effectiveness of a mixed methodology for improving the linguistic skills, and those language areas that needed further study, testing, and eventual improvement.

 

4. Conclusion

This exercise of conducting a general analysis of the Corpus del Español has shed light on its potentialities and implications for language learning and teaching. Corpus-based studies, throughout the years, since its genesis, have opened many doors for detailed documentation of, and research about, language use. In fact, corpus-based research has become increasingly popular in applied linguistics. Corpus linguistics will continue to play a defining role in all areas of study. It is now left up to researchers, teachers, educators, and even students, to decide if they wish to be a part of this process, which can allow for improved pedagogical practices and significant educational experiences.


References

  • Biber, D., Conrad, S. & Reppen, R. (2004): Corpus Linguistics. Investigating Language Structure and Use (4th ed.). Cambridge: Cambridge University Press.
  • Davies, M. (2002): Corpus del Español. Utah: Brigham Young University.
  • Frankenberg-Garcia, A., Flowerdew, L. & Aston, G. (Eds.) (2011): New Trends in Corpora and Language Learning. London: Continuum International Publishing Group.
  • Granger, S. (2002): “A bird’s eye view of learner corpus research”, in Granger, S., Hung, J. & Petch-Tyson, S. (Eds.), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins Publishing, pp. 3-33.
  • Krashen, S.D. (1981): Second Language Acquisition and Second Language Learning. Oxford: Pergamon.
  • Livingstone, K. A. (2014): La efectividad de un modelo metodológico mixto para la enseñanza-aprendizaje de español como lengua extranjera. Münich: LINCOM.
  • Maia, B. (2014): Corpora para investigação (PPT Presentation). Porto: Universidade do Porto.
  • Nasselhauf, N. (2004): “Learner corpora and their potential for language teaching”, in Sinclair, J. (Ed.), How to Use Corpora in Language Teaching. Amsterdam: John Benjamins Publishing, pp. 125-152.
  • Sinclair, J. (Ed.). (2004): How to Use Corpora in Language Teaching. Amsterdam: John Benjamins Publishing.


Te invitamos a comentar y participar de esta discusión. Todo comentario que atente contra los derechos humanos y se centre en la grosería para descalificar no será admitido en el presente espacio de debate ciudadano.

Escribir comentario

Comentarios: 0