Too Big or Not Too Big : Establishing the Minimum Size for a Legal Ad

A corpus can be described as “[a] collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis 1982). However, the concept of representativeness is still surprisingly imprecise considering its acceptance as a central characteristic that distinguishes a corpus from any other kind of collection (Seghiri 2008). In fact, there is no general agreement as to what the size of a corpus should ideally be. In practice, however, “the size of a corpus tends to refl ect the ease or diffi culty of acquiring the material” (Giouli/Piperidis 2002). For this reason, in this paper we will attempt to deal with this key question: we will focus on the complex notion of representativeness and ideal size for ad hoc corpora, from both a theoretical and an applied perspective and we will describe a computer application named ReCor that will be used to verify whether a sample of legal contracts compiled might be considered representative from the quantitative point of view.


Introduction
Corpus-driven/based studies rely on the representativeness of each corpus as their true foundation for producing valid results (cf.Biber et al. 1988: 246).However, according to Leech (1991: 2) the assumption of representativeness "must be regarded largely as an act of faith".Actually, as Tognini-Bonelli (2001: 57) stated "at present we have no means of ensuring it, or even evaluating it objectively".Unfortunately, faith and beliefs do not seem to ensure quality… For this reason, in this paper we will attempt to deal with this key question: we will focus on the complex notion of representativeness and ideal size for ad hoc corpora, from both a theoretical and an applied perspective and we will describe a computer application named ReCor, version 2.5, that will be used to verify whether a sample legal ad hoc corpus might be considered representative from the quantitative point of view.

The Importance of Being Representative
Thousands of defi nitions have been provided as to what constitutes a corpus as the followings: "[a] collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis" (Francis 1982: 17); "a corpus is not simply a collection of texts.Rather, a corpus seeks to represent a language or some part of a language" (Biber et al. 1998); "a fi nite-sized body of machine-readable texts sampled in order to be maximally representative of the language variety under consideration" (McEnery/Wilson 2001[1996]: 24), among others.However, despite the repeated reference to the quality of being representative and so forth as distinguishing features of corpora as opposed to other kinds of textual collections, there appears to be no consensus amongst the experts: "[t]he defi nition of representativeness is a crucial point in the creation of a corpus, but is one of the most controversial aspects among spe-cialists, especially as regards the ambiguity inherent in its use due to the intermingling of quantitative and qualitative connotations" (CORIS/CODIS 2006).

Qualitative representativeness
Dealing with the fi rst concept, quality, the root of the problem here may lie in the low quality of the texts that are included if they come from sources that are insuffi ciently reliable (Gelbukh et al. 2002: 10).This obstacle can be solved by designing a system for gauging the quality of digital information (cf. Seghiri 2006: 89-95).So, fi rstly, it is vital to establish a set of clear design criteria when compiling a corpus.We will illustrate this methodology by creating a corpus of travel insurance contracts. 1This corpus will be monolingual (Spanish), and diatopically restricted to Spain, due to the large number of countries in which this language is spoken.It will be a comparable fulltext corpus because it will include complete contracts originally written in Spanish, all of them downloaded from the web, so the corpus will be also electronic.Finally, as the corpus will only include travel insurance contracts, it will be homogenous in genre and topic.
Once the set of design criteria is clear, a compilation protocol divided into four steps -(i) fi nding data, (ii) downloading, (iii) formatting and (iv) storage -should be followed for the creation of the ad hoc corpus: The fi rst step, fi nding data, will consist in searching relevant documents on the web.There are two main types of searches that may be carried out online: institutional searches and thematic searches.On the one hand, the institutional search is the one carried out on the web sites of international companies, organisations and institutions.The information one can fi nd on these sites is of a high standard of quality and reliability because the writers are specialists in the fi eld.Contracts on this topic have been mainly downloaded from web sites of Spanish insurance companies such as MAPFRE, Ocaso, among others.A list of the main insurance companies in Spain can be downloaded from the Spanish Association of Insurance Companies, named Asociación Empresarial del Seguro. 2 On the other hand, thematic search is normally carried out by using key word searches on good search engines.There are many search engines on the Internet, like Google or Yahoo, for instance.However, according to a great number of analysts (cfr.Radev et al. 2005), Google is the best search engine in terms of the quality of search results.On this point, it is clearly essential to establish descriptors (like travel insurance and contract) and using Boolean operators (like AND, OR), in order to avoid a large amount of irrelevant information to be returned.At the same time, search engines (like Google) allow to restrict the fi nding to a specifi c domain.In this case, it will be selected "pages from Spain" (.es) in order to fi lter pages from other English spoken countries.
Once the Spanish contracts have been found, the second step is downloading data.This stage can be carried out manually although, sometimes, it is possible to automate the task with programmes like BootCaT, 3 for instance, which allows downloading groups of contracts from a single webpage.
During the third step, formatting, the wide variety of formats available on the web needs to be considered: there is a noticeable predilection for HTML (.html) and PDF (.pdf) formats on the Internet, but all these documents have to be converted to an ASCII or plain text format (.txt) in order to be processed by any corpus management tool like WordSmith Tools 4 or Concordance, 5 to 1 European consumers have the right to demand translations of this type of documents under the auspices of European directives on insurance matters (92/49/EEC and 92/96/EEC).These directives recognize the right of the party taking out insurance to receive the contract written not only in the offi cial language of the member state where the agreement is made, but also in a language which they may specify. 2 http://www.unespa.es/frontend/unespa/buscador_guia.php.3 http://bootcat.sslmit.unibo.it. 4 http://www.lexically.net/wordsmith.5 http://www.concordancesoftware.co.uk.name just a few, in accordance with the clean-text policy described by Sinclair (1991): "[t]he safest policy is to keep to the text as it is, unprocessed and clean of any other codes".
The conversion from any format to plain text is as easy as to copy the information and paste it into a plain text document (.txt).For PDF format, Google allows the majority of PDF documents to be seen in HTML, thereby permitting the same procedure -copy and paste -to be carried out.When this is not possible, conversion programmes such as AbbyFine Reader6 can be used.
The last stage is the storage of the data, and it consists of saving the documents that have been downloaded, correctly identifying and arranging them.One possible way of doing this is through the use of fi les and subfi les depending on the topic -travel insurance -, language -Spanish -and formats -original format and plain text -.The texts have been automatically codifi ed (cfr.Figure 1) with the programme Lupas Rename as follows: number (01), language (TO stands for "original text", and ES stands for "Spanish") and genre (CO means "contract").In the study now under examination an ad hoc corpus of travel insurance contracts in Spanish was compiled, with 92 documents and 901,869 words (tokens).Quality has been assured through a set of clear design criteria and a compilation protocol divided into four steps.But, the quantity of documents and words (tokens) is enough to cover the terms used in this topic and genre?

Quantitative representativeness
According to Lavid (2005), the size of the corpus is a decisive factor in determining whether the sample is representative in relation to the needs of the translation.However, the concept of representativeness is still surprisingly imprecise, especially if one considers its acceptance as a central characteristic that distinguishes a corpus from any other kind of collection.However, many authors state that there is no general agreement as to what the size of a corpus should ideally be and, "[u]sually, the availability of material in the particular fi eld of study determines the fi nal size of the corpus" (Giouli/Piperidis 2002).

Zipf's law approach
There have been a great number of papers on the question of quantity as a criterion to reach representativeness as well as suggested formulas for calculating a priori the minimum number of words and documents necessary for a specialist corpus to be considered representative (cf.Heaps 1978;Biber 1988Biber , 1990Biber , 1993Biber , 1994Biber and 1995;;Leech 1991;Biber et al. 1998and Yang et al. 1999and 2002, amongst others).Most of these formulas are based on Zipf's law.Zipf's law is based on the idea that all texts contain a number of words that are repeated, i.e., the total number of words in any text is referred to as tokens, while the total number of distinct words, without counting rep-etitions, is known as types.If types are divided into tokens, the result will be the frequency of each word in the corpus.Words may thereby be ordered according to their frequency with each word being given a rank.The word with the highest frequency will occupy the fi rst position on the list, or rank one, with the other words following in descending order.Zipf stated that the higher the rank number of a word the lower its frequency of occurrence in a text, since a higher rank number indicates that the word is further down the list and therefore less frequent.In other words, there is an inverse relationship between frequency and rank, i.e. frequency decreases as rank increases.By using Zipf's law, it is therefore, possible to establish that the number of occurrences of a word or its frequency of occurrence -f(n) -is inversely proportional to its number on the list or rank (n).According to this information, Zipf's law can be expressed mathematically as follows Zipf's law can, therefore, give us an idea of the breadth of vocabulary used, but it is not limited to a particular or approximate number because this will depend on how the constant is determined (Braun 2005(Braun [1996] ] and Carrasco Jiménez 2003: 3).Numerous studies have been based on the law, but the conclusions they reach do not specify, even through the use of graphs, the number of texts that are necessary to compile a corpus for a particular specialised fi eld (Almahano Güeto 2002: 281).There have been many attempts to set the size, or at least establish a minimum number of texts, from which a specialised corpus may be compiled.Some of the most important are those put forward by Heaps (1978), 7 Young-Mi (1995) and Sánchez Pérez/Cantos Gómez (1997).However, subsequently some of these authors such as Cantos (cfr.Yang et al. 2000: 21) recognised some shortcomings in these works, stating that "Heaps, Young-Mi and Sánchez and Cantos failed by using regression techniques. 8This might be attributed to their preference for Zipf's law". 9

Minimum Size Recommendations
It is surprising to observe how, for many authors, no maximum or minimum number of texts, or words, that a corpus should contain seems to exist (Sinclair 2004) and where an approximate fi gure is proposed, many authors appear to take extreme positions.Thus, Sinclair (2004) considers that ideally a corpus should be 'big', although the interpretation of this adjective remains open to debate because no approximate fi gure is given.McEnery/Wilson (2006[2000]), Borja- Albi (2000) and Ruiz Antón (2006) suggest that the ideal number of words that any corpus should reach is around a million.Friedbichler/Friedbichler (2000) consider that a fi gure between "500,000 and 5 million words per language (depending on the target fi eld) will provide sample 7 Indeed, out of this work came the rule known as Heaps' law.Both Zipf's and Heaps' laws are used to grasp the variability of corpora.Heaps' law is an empirical law which examines the relationship between vocabulary size, or in other words, the number of different words (types) and the total number of words in a text (tokens).In this way a sequential increase of vocabulary in relation to text type can be observed.The programme ReCor has been validated using this law (cf.Seghiri 2006: 399-403).8 Simple linear and multiple linear are the most usual regression techniques used.The prototype situations that these techniques are applied to consist primarily of a set of subjects or observations in which two variables, X and Y for instance, can be measured.When the value of one of the variables, that of X for example, is known the technique is used to predict the value of this subject in the variable Y.A detailed description of different regression techniques and their applications can be found in Lorch/Myers (1990).9 Conscious of these defi ciencies, Yang et al. (2000) attempted to overcome them by taking a new approach: a mathematical tool capable of predicting the relationship between linguistic elements in a text (types) and the size of the corpus (tokens).However, at the end of their study, the authors refl ected on some of its limitations, "the critical problem is, however, how to determine the value of tolerance error for positive predictions" (Yang et al. 2000: 30).
evidence in 97 % of language queries".Although it is the dream of many linguists to have gigantic corpora of more than ten million words at their disposal to enable them to carry out studies on general language (Wilkinson 2005: 6), it has been shown that smaller corpora give optimum results in specialised areas.In fact, an increasing number of researchers, such as Bowker and Pearson (2002: 48), stress that shorter text with "a few thousand and a few hundred thousand words" are just as useful in the study of languages for specifi c purposes.Thus, Clear (1994) wrote an article: "I Can't See the Sense in a Large Corpus".Other authors have followed this same line of thought and have emphasised that smaller corpora are extremely useful for sketching out specifi c areas of a language (cfr.Murison-Bowie 1993: 50).Haan (1989Haan ( , 1992) ) has given a detailed account of the success of a wide variety of analyses based on corpora that contain no more than twenty thousand words.In different linguistic studies carried out using small corpora, Kock (1997 and2001) also draws the conclusion that these collections (each containing 19 or 20 texts with approximately one hundred thousand occurrences) are more than suffi cient, taking into account that "it is not necessary to have such large corpora if they are homogenous in terms of language register, geographical area and historical time, for instance" (Kock 1997: 292).Biber (1995: 131) reduces these fi gures still further and states that it is possible to represent practically the totality of elements of a specifi c register with relatively few examples, one thousand words, and a small number of texts belonging to this register, ten to be exact.
If these principles are applied to the particular case under examination here, it may be stated that the ad hoc corpus on travel insurance contracts has been isolated with the objective of analysing the language used by a very limited community, in a communicative situation that is very specifi c (the sale of an insurance for travelling) and with only one text type being represented (contract), whose frequency in general language use is minimal.In addition, Bravo Gozalo/Fernández Nistal (1998: 216) add that size should be in relation to the purpose the corpus is going to be used for.Since the corpus under examination has a very specifi c objective, its size could be even further reduced, taking this consideration into account.
The fact that no consensus exists as to the number of documents and words that our fi nal collection should include has led us to the conclusion that, before carrying out any kind of analysis, it is essential to ensure that the number of documents and words achieved is suffi cient.However, the ranges of fi gures that have been suggested differ widely and the proposed calculations are not particularly reliable. 10In a previous study (cfr.Seghiri, 2006) we concluded that a possible solution may be to carry out an analysis of lexical density in relation to the increase in documentary material included.In other words, if the ratio between the actual number of different words in a text and the total number of words (types/tokens) is an indicator of lexical density or richness, it may be possible to create an algorithm, called N-Cor, that can represent increases in the corpus (C) on a document by document (d) basis, for example: Following from this, our starting point is the idea forwarded by Biber (1993) and subsequently endorsed in studies such as those by Sánchez Pérez/Cantos Gómez (1998) that the number of types does not increase in proportion to the number of words the corpus contains, once a certain number of texts has been achieved.This may make it possible to determine for the fi rst time the minimum size of a corpus a posteriori.With the help of graphs, it should be possible to establish whether the corpus is representative and how many documents and words (tokens) are necessary to achieve this.This theory has become a practical reality in the shape of a software application, named ReCor, which enables accurate evaluation of corpus representativeness11 , as described in the next section.The ReCor programme has been developed on the bases of the N-Cor algorithm (cfr.Figure 3) which was patented in 2010 by the Spanish Patent and Trademark Offi ce.12 5.

ReCor 2.5
ReCor is a software application which has been designed to facilitate the evaluation of representativeness of corpora in relation to their size.In this study we used version 2.5 of ReCor, which has an improved capacity for working with multiple and very large fi les quickly and also allows lexical bundles to be identifi ed on the basis of analysis of n-grams (n ≥ 1 and n ≤ 10) of the corpus.The programme illustrates the level of representativeness of a corpus in a simple graph form, which shows lines that grow exponentially at fi rst and then stabilise as they approach zero. 13In the fi rst presentation of the corpus in graph form that the programme generates -Graphical Representation A -the number of fi les selected is shown on the horizontal axis, while the vertical axis shows the types/tokens ratio.The results of two different operations are shown, one with the fi les ordered alphabetically (the red line), and the other with the fi les introduced at random (the blue line).In this way the programme double-checks to verify that the order in which the texts are introduced does not have repercussions for the representativeness of the corpus.Both operations show an exponential decrease as the number of texts selected increase.However, at the point where both the red and blue lines stabilise, it is possible to state that the corpus is representative, and at precisely this point it is possible to see how many texts and words (tokens) will produce this result.At the same time another graph -Graphical Representation B -is generated in which the number of tokens is shown on the horizontal axis.This graph can be used to determine the total number of words that should be set for the minimum size of the collection.
Once these steps have been taken, it is possible to check whether the number of Spanish contracts compiled is suffi cient to enable us to affi rm that our corpus is representative (with 1-gram).From the data shown in Figure 5 it is possible to deduce, according to Graph A (Estudio gráfi co A), that the corpus begins to be representative from the point of the inclusion of 25 documents; since the curve hardly varies either before or after this number, in other words this is the point where the lines stabilise and are closest to zero.Graph B (Estudio gráfi co B) shows the minimum total number of words (tokens) necessary for the corpus to be considered representative, which in this case is 300,000 words approximately (319,494 words exactly, cfr. Figure 7).
We can also check if the corpus is representative from 2 to 10 grams, in order to carry out collocational al phraseological studies.To illustrate this, we will check if the corpus is representative with 2 grams (see Figure 6): From the data shown in Figure 6 it is possible to state that, according to Graph A (Estudio gráfi co A), the corpus begins to be representative (with 2-grams) from the point of the inclusion of 52 documents; since the curve hardly varies either before or after this number.Graph B (Estudio gráfi co B) shows the minimum total number of words (tokens) necessary for the corpus to be considered representative, which in this case is 500,000 words approximately (527,108 words exactly, cfr. Figure 8).
At the same time, three output fi les are created (in plain text and excel).The fi rst output fi le, Statistical Analysis, shows the results from two distinct analyses; fi rstly, with the fi les ordered alphabetically by name (see Figure 7 for 1-gram and Figure 9 for 2 grams) and secondly with the fi les in random order (see Figure 8 for 1-gram and Figure 10 for 2 grams).The document that appears is structured into fi ve columns which show the number of types, the number of tokens, the ratio between the number of different words and the total number of words (types/tokens), the number of words that appear at least one more time, i.e. one type plus one token (V1) and the number of words that appear at least twice, i.e., one type plus two tokens (V2).We can see (cfr. Figure 7 & 8) that with 6,057 types and 319,494 tokens the corpus grows in size (i.e.tokens) but not in grams (i.e.types), so we can confi rm that as no 1-gram-types are entering in our corpus, the minimum size has been reached.As for 2 grams, the ReCor programme creates the following statistical analysis:     The same information is shown in the third fi le, 'Frequency,' but this time the words are ordered according to their frequency, or in other words, by their rank.From this list it may be deduced that the words with the highest absolute frequency are those that are 'empty', whilst the least frequent are those that reveal the author's individual style and richness of vocabulary.Words that appear in the middle range in terms of frequency distribution are those that are really representative of the document (see Figures 13 and 14).

Conclusions
Nowadays it is not possible to determine a priori the exact total number of words (tokens) or documents that should be included in specialised ad hoc corpora in order that they may be considered representative.However, in this paper we have described a corpus-driven approach to evaluating corpus size a posteriori.In order to achieve this, a double approach to corpus building has been adopted, based on two arguments: fi rstly, a qualitative approach has been followed where a set of clear design criteria and a compilation protocol in four steps are needed in order to ensure corpus representativeness according to quality.Secondly, a quantitative approach has been adopted based on the N-Cor algorithm and the ReCor programme.The ReCor programme 2.5.allows to determine that the corpus is of an adequate size of documents and words (tokens) after it has actually been compiled (or even during analysis), i.e. a posteriori.As no new types are entering in the corpus, the minimum size has been proved to be reached.This methodology has been illustrated trough the compilation of an ad hoc corpus of travel insurance contracts in Spanish; however, this methodology can be used to compile any ad hoc corpus, in any language and covering any topic and genre.

Figure
Figure 1.Storage data

Figure
Figure 3. N-Cor Algorithm