FU9, August 2000, Tartu

[ page: 1 2 3 4 5 6 7 ]

5. Since the autumn of 1998 I have been developing a set of corpus-linguistic tools using the Perl programming language (Wall et al. 1996). I have nicknamed the tools CorpTool.

CorpTool contains, among other things, two KWIC concordance programs. One is intended to be language-independent and can deal with plain text only. This general concordancer is nicknamed MyConc (cf. Figure 1). The other is designed specifically for Estonian texts that are processed by ESTMORF, the Estonian-language morphological analyzer. This Estonian-language concordancer equipped with ESTMORF is nicknamed EstCorp.

Figure 1. General Concordance Program MyConc

+--------------------+   +--------------------------------+
| unannotated corpus | + | alphabet definition subroutine |
+--------------------+   +--------------------------------+
          |
          |
     +--------+          +-----------------+
     | MyConc | <------- | sort subroutine |
     +--------+          +-----------------+
          |
          |
+------------------+
| KWIC concordance |
+------------------+

When designing a concordance program, language-specific alphabetic sort should be taken care of properly. The standard sort program (sort) that comes with the operating system uses the character codes for ordering. So if your are using Microsoft Windows in the default character set specification, the sort program arranges the letters in the ANSI (Latin-1) order: (1) the uppercase letters of the English alphabet in the familiar alphabetic order (A - B - C - D - E - F - G - H - I - J - K - L - M - N - O - P - Q - R - S - T - U - V - W - X - Y - Z), then (2) the lowercase letters of the English alphabet in the same order (a - b - c - d - e - f - g - h - i - j - k - l - m - n - o - p - q - r - s - t - u - v - w - x - y - z), then all the uppercase accented and umlaut letters in the ANSI order (À - Á - Â - Ã - Ä - Å - Æ - Ç - È - É - Ê - Ë - Ì - Í - Î - Ï - Ð - Ñ - Ò - Ó - Ô - Õ - Ö - Ø - Ù - Ú - Û - Ü - Ý - Þ), and finally the lowercase accented and umlaut letters in the same order (à - á - â - ã - ä - å - æ - ç - è - é - ê - ë - ì - í - î - ï - ð - ñ - ò - ó - ô - õ - ö - ø - ù - ú - û - ü - ý - þ - ÿ). A word list sorted "alphabetically" by the standard sort program would look like this: Ene – Juku – Tallinn – Venemaa – ema – kala – tema – vesi – Ära – Õde – Öösel – ÜRO – ääres – õun – ööpik – äär. No corpus linguists would be very happy with such an ordering.

One of the special features of MyConc is the flexibility with which it performs alphabetic sorting. In addition to a number of preset language-specific alphabetic orders (Estonian, Finnish, German, English, Turkish, Mari, etc.), the program accepts any user-defined alphabetic order each time. Internally, the Perl program defines each alphabet as a specific subroutine. The sort subroutine is common for all languages, i.e, can work under any alphabet specification.

Table 4 is a sample input to the general concordancer MyConc. The program assumes a minimal set of standardized tags. In its default configuration, the program searches only the lines marked by an <s> tag ("sentence tag") which indicates a sentence, neglecting all the other lines not marked by <s>.

Table 4. Sample input to MyConc (line breaks are marked by '↲')

<com> Aleksander Borissov, born 1931, Jõgõperä village,
recorded 1992, interviewers Heinike Heinsoo (int),
Loit Jõekalda (LJ) </com>↲
<int> <est> Aleksander Borissov, Jõgõperä küla, kolmekümnes
 juuni.
</est> </int>↲
<int> siä õõt süntünnü kassõn tšüläZ? </int>↲
<p>↲
<s> kassõn tšüläz õlõn süntünnü . </s>↲
</p>↲
<int> pajata vähäzee maamassa ja taatassa. tšed õltii. </int>↲
<p>↲
<s> maam i taatõ . </s>↲
<s> taatõ maal tetši tüüt kassõn . </s>↲
<s> muutt tämä , toozˆ zˆiivattaa piti ja ... </s>↲
</p>↲
<int> kui pajatti? õmaa tšeelii? </int>↲
<p>↲
<s> õmaa tšeelii . </s>↲
</p>↲
<int> vaissi sis, pajatti? </int>↲
<p>↲
<s> pajatti , koko aikaa ain vain tämä pajatti . </s>↲
</p>↲
<int> ja maam? </int>↲

EstCorp, on the other hand, operates on morpholexically annotated versions of the Corpus of Estonian Literary Language. It is equipped with a search mechanism designed to process ESTMORF output. Once processed by ESTMORF, any Estonian-language text can be searched by EstCorp.

Though ESTMORF is a DOS program, it can be integrated into (or called by) a Perl program without difficulty. So a Perl program, nicknamed EstMorf, is created to preprocess the input to ESTMORF, launch ESTMORF, and postprocess the ESTMORF output for EstCorp. The original DOS preprocessing and postprocessing programs that come with the morphological analyzer program are thus replaced by EstMorf. Table 5 shows a sample of EstMorf output, which in its turn is the direct input to the concordance program EstCorp.

Morpholexical annotation provided by ESTMORF enables EstCorp to carry out more sophisticated searches than those carried out by MyConc, which performs GREP-type searches by applying Perl's powerful machinery of regular expressions to plain text. EstCorp thus can operate on morphosyntactic categories such as "terminative" (see Table 6), "ma-infinitive", "numeral", etc. It also operates on the basis of lexemes (lemmas), say tuba 'room' (see Table 7), mesi 'honey', tegema 'to make', minema 'to go', etc., by making use of the information on lemmas provided by ESTMORF.

It might be appropriately added here that these corpus-linguistic tools are designed for the Microsoft Windows environment. As the Perl programming language is free, this means that you can do corpus linguistics on your laptop computer without extra investment.

Figure 2. Estonian Concordance Program EstCorp

+--------------------------------------+
| Corpus of Estonian Literary Language |
|       (unannotated version)          |
+--------------------------------------+
          |
          |
     +---------+
     | ESTMORF |
     +---------+
          |
          |
 +------------------+
 | morpholexically- |     +------------------------------+
 |    annotated     |  +  | Estonian alphabet subroutine |
 | Estonian corpus  |     +------------------------------+
 +------------------+
          |
          |
     +---------+             +-----------------+
     | EstCorp |   <-------  | sort subroutine |
     +---------+             +-----------------+
          |
          |
+------------------+
| KWIC concordance |
+------------------+

Table 5. Sample of annotated Estonian corpus (line breaks are marked by '↲')

1950I_0002↲
Suur        suur+0{A_sg_n}↲
rõõm        rõõm+0{S_sg_n}↲
täitis      täit+is{V_s}↲
äkki        äkki+0{D}↲
Ülo         Ülo+0{H_sg_g}Ülo+0{H_sg_n}↲
südant      süda+t{S_sg_p}↲
.↲
@↲
1950I_0003↲
Meeli       Meeli+0{H_sg_g}Meeli+0{H_sg_n}meel+i{S_pl_p}↲
elas        ela+s{V_s}↲
alles       alles+0{D}↲
.↲
@↲
1950I_0004↲
Nüüd        nüüd+0{D}↲
ta          tema+0{P_sg_g}tema+0{P_sg_n}↲
oleks       ole+ks{V_ks}↲
tahtnud     taht=nud+0{A_sg_n}taht=nud+d{A_pl_n}taht=nu+d{S_pl_n}
            taht+nud{V_nud}↲
kas         kas+0{D}↲
või         või+0{D}või+0{J}või+0{S_sg_g}või+0{S_sg_n}või+0{V_o}↲
tema        tema+0{P_sg_g}tema+0{P_sg_n}↲
silme       silm+0{S_pl_g}silm+e{S_pl_p}↲
ees         ees+0{D}ees+0{K}esi+s{S_sg_in}↲
surra       sure+a{V_da}↲
.↲
@↲

Table 6. Sample KWIC concordance created by EstCorp (1): Estonian Terminative

0 . aastal jõuti 75 000 rublani * , mullu ligi 125 000-ni .
e viisaastaku lõpuks 60 rublani * ühe elaniku kohta ; avatakse 60 te
 tõmbas naine tekinurga kaelani * ja keeras selja , sõnades see juur
ine ühtlane , kaelusest kuklani * ulatuv lõkendus ta näkku tagasi .
eeritud ja pealaest jalatallani * uhiuutes rõivastes .
          Nad olid laevasillani * ja sedamööda lähemate mündrikupaat
ise ulatus vaevalt mehele õlani * .
eiler ulatas professorile õlani * , mistõttu Konstantin kummardus rä
ikul öeldi , et järgmise külani * on kõigest kakskümmend viis , » va
õi läbi heinakuhja tulid temani * ainult katkendlikud : « ... aart ,
 ametikohal töötas kuni surmani * .
staid , tavaliselt kuni surmani * ning nende elu ja haigus olevat ni

Table 7. Sample KWIC concordance created by EstCorp (2): Estonian tuba 'room'

al pärast tütre lahkumist * magamistuppa läks , oli Astrid , tema n
             Võõrastemaja * numbritoad muutusid teatri rõivastumisr
li tagasi , kui Naujocksi * numbritoas helises telefon .
Ta oli korralik tüdruk ja * numbritubadest ei saanud juttugi olla .
neljandas vaatuses Ülesoo * rehetoas .
isa rajatud elud- majad , * rehetuba , põhukuur ja loomalaut – kõik
ina nagu juhuslikult tema * tagatoa uksele – Michel elas nüüd seal
d lõi meister ise kokku , * tagatoas aga liimisid , vooderdasid , p
d , siis istus ta niisama * tagatoas lauanurgal , kõlgutas jalgu ja
meenutas ta , et oli seal * tagatoas täitsa toredasti öelnud : » ..
Rammu tõstis kontoriruumi * tagatoast laua välja .
laste , raskete sammudega * tagatuppa .
 endilegi säherduse kasti * tagatuppa toob .
eelde , teisel korrusel , * tisleritöötoa kohal ...

[ page: 1 2 3 4 5 6 7 ]