Log Files as a Tool for Improving Internet Dictionaries

In their advertisements, dictionary publishers often praise their dictionaries for taking into account the exact needs of the users. Until the beginning of the 1980s, however, no theoretical contributions on dictionary use were available, neither in the form of purely theoretical considerations nor in the form of empirical research. Since then, the situation has changed completely. Such a large number of user surveys have been carried out that it is no longer possible to give a complete overview. Nevertheless, this has led to no signifi cant improvement of the situation as the majority of these surveys are not related to concrete examples of dictionary use. The surveys, which have always been concerned with printed dictionaries, have therefore not contributed to substantial improvements of dictionary conception. In the case of internet dictionaries, on the other hand, technical possibilities enable lexicographers to monitor user behaviour in a different and much more precise way. Analyses of log fi les reveal exactly which lemmas and which types of information have been requested, and, perhaps more signifi cantly, which lemmas and which types of information have been requested but were not found in the dictionary. Furthermore, log fi les allow lexicographers to see the types of information which have not, or not yet, been searched for. All in all, log fi les may thus be used as a tool for improving internet dictionaries – and perhaps also printed dictionaries – quite considerably. 1. Internet Dictionaries only? As new tools are invented, old tools become obsolete. This process has been described in relation to electronic dictionaries vs. paper dictio naries (Simonsen 2000 with references to other scholars). We are not convinced, however; on the contrary, we are certain that paper dictio naries will remain a popular tool for at least the next two or three * Henning Bergenholtz * Mia Johnsen Centre for Lexicography Centre for Lexicography Aarhus School of Business Aarhus School of Business Fuglesangs Allé 4 Fuglesangs Allé 4 DK – 8210 Aarhus V DK – 8210 Aarhus V hb@asb.dk miajohnsen_1@hotmail.com


Internet Dictionaries only?
As new tools are invented, old tools become obsolete.This process has been described in relation to electronic dictionaries vs. paper dictio naries (Simonsen 2000 with references to other scholars).We are not convinced, however; on the contrary, we are certain that paper dictio naries will remain a popular tool for at least the next two or three centuries (Bergenholtz 1996).Already now, CD-ROM dictionaries and even DVD dictionaries have had their time and will only be known by the next generation as a lexicographical medium that is no longer used.We believe that internet dictionaries, on the other hand, will have a much longer life.We are convinced, too, that the next 20-30 years will see not only internet dictionaries free of charge, most of them of quite low quality, but also really high quality dictionaries for which the user can pay monthly, yearly or pay per view.But we will still have printed dictionaries, especially those that comprise only one or two volumes.It is more doubtful whether we will have multi-volume printed dictionaries and encyclopaedias as they are too expensive in relation to the limited number of years in which they will be up to date.Here, the internet dictionaries and encyclopaedias will offer a version that is updated daily, and the user is not compelled to search for answers in several different volumes if he has different questions and a limited amount of time at his disposal.But for the smaller dictionaries, there is no doubt that there is a demand for paper and internet versions of the same dictionary.We have recently made this experience with the free internet dictionary THE DANISH-ENGLISH DICTIONARY OF ACCOUNTING (http://www.regnskabsordbogen.dk/iasdk).This dictionary has been avail able since August 2003, but very soon after the release there were so many requests from the users for a paper version that a publishing com pany asked for the possibility of publishing such a paper version (DICTIONARY OF ACCOUNTING DANISH-ENGLISH 2004).
In this paper, we are dealing with the use of internet dictiona ries, more specifi cally with THE DANISH INTERNET DICTIONARY (http://www.netordbog.asb.dk), which has been available on the internet since April 2002.

User Surveys in Theory and Practice
Dictionaries are utility products.They are tools designed to help a poten tial dictionary user solve problems with producing, comprehending or translating a text and to provide cultural, encyclopaedic or linguistic knowl edge.The function of a given dictionary is to provide assistance to a specifi c user group with specifi c characteristics in order to meet the complex needs that arise in a specifi c type of user situation.A con crete dic tionary can have one or more functions, i.e. it can be mono or multi-functional.As any other utility product, dictionaries also have a genuine pur pose.This genuine purpose comprises the totality of functions of a given dictionary and the subject fi eld(s) that it covers (Bergenholtz/Tarp 2003).References to users and their needs have been made in dictionary prefaces and other lexicographic contributions for centuries.There is nothing new in that.It is therefore rather a paradox that the German lexicographer, Wiegand (1977), was right in his conclusion that the dictionary user is the "known unknown".Similarly, 25 years later, the dictionary user was referred to as a yeti (Bergenholtz 2002).This does not mean that no research has been carried out on dictionary use yet -on the contrary.From 1985 until today, so many monographs, editions and papers in journals have been published that it is diffi cult or even impossible to get a complete overview.When Bergenholtz (2002) insists on the comparison between the dictionary user and a yeti, he does not mean that no one has made fatiguing expeditions.Rather, Bergenholtz (2002) argues that, in reality, the main part of the research on dictionary use has only found unclear tracks of the dictionary user.We can add that the big majority of the investigations have not made clear why they want to fi nd out how users use a dictionary.Perhaps it is too evident for those scholars that knowledge about user habits leads to better dictionary conceptions, which, in the end, leads to better and more helpful dictionaries.There could be another, but naturally related, goal: It is indeed interesting to know about the dictionary users' habits and experiences with different dictionaries.It is a goal sui generic, and, at the same time, it could be a contribution to dictionary criticism of a single dictionary or a set of dictionaries.
A distinction exists between two kinds of investigation.The fi rst one is the criticized type -and the most practiced kind -and is undertaken without a direct relation to a concrete dictionary use.In such ques tion naire surveys, the same methods are employed as in other forms of market analysis: a number of standard questions are asked of a selected sample concerning a certain product or behaviour, e.g.Atkins (1998).However, the answers from the informants do not necessa rily refl ect a real genuine user situation.It cannot be ruled out that the problems, behaviour, etc. described by the informants dif fer from their real problems.The questions asked have to do with fut ure activities, as in "Under which headword would you look for the following collocations?",or with past activities: "Which types of information do you look for most often?".There is no guarantee that the answers cor re spond to why and how the informants really have used or will use dictionaries.Such surveys are quite problematic because they pre suppose that the informants remember exactly how they have used dictionaries in the past and that they are able to predict how they will do it in the future.And as far as we can see, none of the surveys meet the normal requirements of representativity, e.g. it is very often students only, and the informants are not selected in accordance with the principles applied in the social sciences.
More realistic are the so-called dictionary protocols written by select ed dictionary users directly after each dictionary use.They are more realistic because they refer to authentic user situations.The prob lem leading to the dictionary use is still clearly remembered by the informants, they can describe the result of the looking up and the way they used the dictionary items, also how they found the wanted pieces of infor mation or that they did not fi nd an item which enabled them to solve the problem.In practice, however, such protocols are insuffi cient; com pare the results from some of the investigations of this kind: • Wiegand (1985) asked foreign students (in Germany) with another native language than German to translate a text from their mother tongue into German.They were allowed to use any bilingual dictionary.He did not ask the informants to write a protocol from this part of the translation.He asked the students to improve the German trans lation by using a monolingual German dictionary.Each time they encountered a problematic word or text part, the students were to describe the problem in the protocol, use a dictionary, correct the text and write down in the protocol what they had found or not found in the dictionary, and how they had used it.The results were quite interest ing, but in the end not typical because the really interesting aspect of dictionary use, the translation phase, was ignored.Furthermore, it is doubtful whether a small number of language students is representative of other students or all other kinds of users.
• Another attempt was undertaken by Danish libraries.Next to the shelves with dictionaries, the dictionary user would fi nd a question naire to be fi lled out after each use of a dictionary.The investigation was part of a ministerial report on the need for additional or different dic tionaries in Denmark (Vilkår 1982).After a year, however, the mi ni stry had received only some fi fty replies, most of them very impre cise.
All those investigations concern printed dictionaries, but in principle, questionnaire surveys are also valid for internet dictionaries or other electronic dictionaries.We have no knowledge of any dictionary protocols regarding internet dictionaries.Other possibilities exist for internet dictionaries, however.With a log fi le, you can track every single use of the dictionary, depending, of course, on the search possibilities.If it is only possible to search for the lemma, only data for the fi rst access step in the dictionary will be available.Which lemmas have been looked up how often?Which lemmas have never been looked up at all?And which words have been used in the search fi eld without result, i.e. how many and which lemma lacunas does the dictionary use indicate?It is this kind of user investigation that we will describe in the following section.It is possible, however -if there is direct access to every dictionary item class (or fi eld) -to get exact data for the use of the semantic item, the grammar item, the collocation item, etc.As far as we know, the use of such exact log fi les has not yet been described.Obviously, such log fi les do not reveal exactly the kind of problem which the user had; they do not reveal whether the user did indeed fi nd the information to fulfi l his needs.This could only be done by using a kind of dictionary protocol, e.g. if some of the users were asked to or made to fi ll out a questionnaire after every use of the dictionary (this could function technically if the user is not allowed to use the dictionary for free unless he fi lls out such a questionnaire).This possibility has not yet been practiced either, at least not as far as we know.
Several other interesting investigations may be carried out, too: In which way does the use of internet dictionaries differ from the use of paper dictionaries?Do we have user groups who never used paper dictionaries but now use internet dictionaries?To which extent do internet dictionaries function as lexicotainment dictionaries, i.e.only for entertainment?Does the use of paper dictionaries decrease?Or -as we believe -does the total use of dictionaries as a helping aid in connection with communicative and knowledge-related questions increase?All this will not be discussed in this paper, but it is certainly a relevant topic for further contributions.In this paper, we will analyse log fi les from THE DANISH INTERNET DICTIONARY.It is a Danish monolingual dictionary with 108,000 dictionary entries and a total of 126,000 different "records", i.e. the dictionary contains 18,000 subentries for polysemy.The genuine purpose of the dictionary is to help users with Danish as their mother tongue or with a good knowledge of Danish when encountering problems in a text production process and looking for help in the dictionary.The results will be used for decisions in the ongoing (and never-ending) work of improving dictionaries -not only this single internet dictionary but also other monolingual and even bilingual dic tionaries and printed dictionaries, too, at least to some extent.
There are only a few published scholarly descriptions of internet dictionary log fi les.The most interesting contribution from de Schryver/ Joffe (2004) describes the log fi le for a South-African bi lin gual dictionary, a Sesotho sa Leboa-English dictionary.The number of visitors and the number of lookups is not very high: 21,337 lookups made by 2,530 different visitors.De Schryver/Joffe write that the dictionary is partly used as a lexicotainment dictionary with no less than 17 sexually related words in the top 100 list.The most frequently requested words in both languages are greeting routine formulas like hello, good morn ing, goodbye, resp.dumêla, thôbêla and sepela.The users also look for nonexisting words, especially for misspellings, but not as often as would be expected, e.g.there is only one misspelling in the top 100 list.Furthermore, they describe emails from the users, most of them thanking the dictionary makers for the free dictionary.De Schryver/Joffe (2004) fail to mention one very interesting point: With 28,000 English lemmas and 25,000 Sesotho sa Leboa lemmas, the users cannot have looked up all lem mas (with only 21,337 lookups).It would be most interesting to know which types of words are not looked up: Is about 90% or 80% of the dictionary never used at all?The very limited number of lookups indicates that no more than 40-50% of the dictionary is actually being used.Will all lemmas in the dictionary be looked up in time when the dictionary has had many more users?Or are there some lemmas that will never be looked up?If future dictionary makers knew the answers to those ques tions, they would not have to waste time describing words of no interest to the users.

How Frequently Are Internet Dictionaries Used?
To determine how frequently internet dictionaries are used in practice, it may be useful to look at a number of specifi c internet dictionaries that carry statistics of the search frequency.Unfortunately, only a limited num ber of internet dictionaries make such information available on the web site, and it has therefore not been possible to carry out a systematic analysis of e.g. the 10 most commonly used online dictionaries according to the Danish telecommunications provider, TDC (see below).
An example of a dictionary that does allow the user to access informa tion on the search frequency is the German-French dictionary, ALL-GEMEINES WÖRTERBUCH DEUTSCH-FRANZÖSISCH (http://site.ifrance.com/allinfor/dico/index.htm).According to the counter on the web site, 262,768 searches have been performed since 20 October 2002, which results in an average of approximately 380 searches per day.A French-Swedish online dic tio nary (FRANSK-SVENSKT LEXIKON; http://www.azoria.com/lexikon/indexsw.shtml)contains statistics of the number of searches per month for the last year, a total of 1,199,122 searches.Thus, approximately 3,406 searches are performed daily.Interestingly, however, the number of searches is lower at the end of the period in question than at the begin ning.This may be contrasted with the search frequency of two other internet dictionaries, the German-English-German dictionary QUICKDIC (http://quickdic.org/index_d.html)and the above-mentioned SESOTHO SA LEBOA (NORTHERN SOTHO) -ENGLISH DICTIONARY (http:// african languages.com/sdp/).The latter does not provide a counter on the web site, but an article on the dictionary in which statistics on the search frequency are included appeared in the EURALEX Proceedings 2004 (de Schryver/Joffe 2004).The statistics of QUICKDIC show an increase in the number of searches from less than 20,000 in 1997 to almost 100,000 in 2001, and the same trend appears from the Sesotho sa Leboa-English dictionary, which was launched in 2003.This dictionary had a frequency of 1,308 searches in the fi rst month, and 6 months after the release, this number had grown to 3,673 with numbers varying from just over 2,000 to al most 6,000 in the months in between.
As the following sections of this article will focus on the most common ly used Danish online dictionary 1 , THE DANISH INTERNET DICTIONARY (http://www.netordbog.asb.dk), it may be interesting to examine whether this dictionary also shows an increase in search frequency.According to the statistics, an average of 1,631 searches were performed daily in the fi rst period of 2003, whereas the following period of 2004 showed an increase to 2,520 searches per day.The follow ing section will elaborate further on these fi gures, but it is clear that this dictionary, too, is used more frequently now than 6 months ago.
Although it is diffi cult to make any fi nal conclusions on the basis of statistics from randomly selected dictionaries, the trend seems to be that internet dictionaries in general are used ever more frequently.As mentioned above, the Danish telecommunications provider, TDC, carries daily statistics of the 10 most used internet dictionaries on their web site.Previously, TDC also carried a list of the 50 most used internet dic tion aries, but this list is no longer available.From the top 10 list, it appears that THE DANISH INTERNET DICTIONARY is number one in the vast ma jor ity of cases, e.g. on the very recent list from 29 September 2004 where it is followed by ON-LINE DICTIONARIES, RETSKRIVNINGSORDBOG, CAMBRIDGE, EURODICAUTOM, BRITANNICA.COM, IT-LEKSIKON, SVENSKA AKADEMIENS ORDBOK, ORDBØGER -GYLDENDAL and WORTSCHATZ DEUTSCH.Unfortunately, the only one of these dictionaries for which sta tistical information is available is THE DANISH INTERNET DICTIONARY, and we therefore contacted TDC in order to determine what they base this top 10 list on and how many users are involved as regards the other dictionaries appearing on the list.However, no one at TDC knew anything about this top 10 list, and we have therefore not been able to determine how it is compiled.Nevertheless, it is still of interest to this article that THE DANISH INTERNET DICTIONARY appears so frequently at the top of the list as it indicates the wide use of this dictionary.Specifi cally, THE DANISH INTERNET DICTIONARY had the following contents on 3 August 20042 : Some of the searches may be termed "empty" because the user did not write anything in the search fi eld before hitting the search button.They are not included in the statistics: "empty" records: 4,179 In this article, we do not include the number of unique users as it is not possible to distinguish between a single user and a unique user.In most cases, the unique user will be a single dictionary user, but this is not always the case, e.g. if a whole school uses the same identity number for all computers connected in a network.
The search string can be used to look for the lemma or for an infl ected form with the lexeme represented by the lemma, or for only a part of the lemma.Most users try the "traditional" way, i.e. looking directly for a lemma, but the other possibilities are used too: In total, the users have searched for 104,097 different orthographical forms found in THE DANISH INTERNET DICTIONARY.This fi gure comprises about 35,000 different lexemes represented by a lemma (a detailed discussion and explanation of this appears in section 4).In comparison to this, the number of searches for different orthographical forms not found in the dictionary is higher; more specifi cally, 116,066, a difference of 12,000.This result is quite thought-provoking and gives the dictionary makers behind the dictionary cause to a renewed lem ma selec tion, especially in the case of real lemma lacunas and of unsuccessful searches due to misspellings (more about that in section 4).As was the case in the log fi les described by de Schryver/Joffe (2004), the number of searches for sexually related expressions on the top 100 list is quite high, e.g.pik (cock) is number 25, fi sse (cunt) is number 29.Such words are mainly looked for in the evening and during the night.Furthermore, the log fi le enables us to follow individual users linking from one word to another, probably using the synonyms as links.Such use of a dictionary will hardly be due to communicative problems; the user knows the words and how to use them and is thus

Concrete Searches
A study of the log fi le from THE DANISH INTERNET DICTIONARY also re veals a number of specifi c problems encountered by the users of the dic tionary.These problems fall into different categories which will be dis cussed in turn below.

The Passive
At present, it is not possible to search for the passive form of verbs in THE DANISH INTERNET DICTIONARY.However, the log fi le reveals that quite a large number of users actually attempt to do this: a search in the entire log fi le (1,021,139 searches) for the error string -es, i.e. notfounds that end in the letters -es (one of the two most common passive end ings in Danish, the other being -s), returns a total of 4,141 hits.The vast majority of these are passive forms, probably between 3,000 and 3,500 of these searches.The top 100 list of not-founds also contains 5 passive forms, i.e. fås (is available), nås (is reached), fåes (is available), nåes (is reached) and gennemgåes (is examined).In addition to this, the top 500 of not-founds contains a further 7 searches for passive forms, i.e. opnås (is achieved), gåes (is walked), foreslås (is suggested), anses (is con sider ed), forståes (is understood) and opnåes (is achieved).
In practically all of the above cases, the users have subsequently conducted a search for the infi nitive of the word, but the examples none theless show that many users are unsure of how to form the passive of certain words (with or without e, i.e. -s or -es).Thus, it may be considered whe ther it would be relevant to add the passive form in the dictionary to en able the users to search for it in cases of doubt.
The majority of searches for passive forms concerns the present tense of the passive, but there are also examples of the past tense, e.g.
hjaelpedes Most of these examples are grammatically correct, although some are used less frequently than others (eg gaves, påstodes), whereas yet others do not exist at all, e.g.hjaelpedes and fundes.This illustrates that users may also be unsure of how to form the past tense of the passive as the correct form is not always evident.Consequently, this is another argu ment for including the passive form in the dictionary.

The Imperative
As is the case with the passive, it is presently not possible to search for imperative forms in THE DANISH INTERNET DICTIONARY.Among the 100 most frequent not-founds, no searches for imperative forms appear, whereas 15 searches are registered among the 500 most frequent notfounds, ie In some of these cases, however, it is impossible to determine for certain whether the user actually searched for the imperative, e.g. in the case of registrer (may also be a misspelling of the plural form of register (register), or registre (registers) and aerger (may also be a misspelling of aergre (to annoy) or aergrer (annoys).
If these two terms are viewed in the context of the entire log fi le, however, it seems most likely that the user did not search for the infi nitive form as he subsequently searched for register, registre and aergre, aergrer.Even so, the examples mentioned indicate that there is a need among the users for being able to search for the imperative as the formation of it is not always as simple as it may seem.
A search for specifi c imperatives in the log fi le (other than those mentioned above) reveals that users have searched for vent (wait) at 17 occasions, 11 times for fi nd (fi nd) and lyt (listen), respectively, 9 times for installer (install), 7 times for luk (close) and returner (return), respectively, and 6 times for afl ever (hand over), angiv (state) and skriv (write), respectively.This supports the assumption that it would be relevant to include the imperative form in the dictionary to enable users to search for it.Some imperative forms, such as spis (eat), hør (hear), drik (drink), sig (say), sabl (bill), saboter (sabotage), sagtn (slacken), saliggør (save), saluter (salute), samkør (co-ordinate), køb (buy), skub (push), traek (pull), kast (throw) and kør (drive), do not appear from the log fi le at all, i.e. no searches have been performed for these words.
The reference book HANDBOOK OF CONTEMPORARY DANISH lists a number of imperative forms that may cause the user problems.A search for those words in the log fi le yields the following results: -affjedr (spring): No hits, no hits on affjeder (alternative, though not correct, spelling) either -behandl (treat): 4 hits, 2 hits on behandel (alternative, though not correct, spelling) -hamstr (hoard): No hits -krydr (season): 3 hits, no hits on krydder (alternative, though not correct, spelling) -pensl (paint): 2 hits -sagtn (slacken): No hits, no hits on sagten (alternative, though not correct, spelling) either -saml (collect): 2 hits, 1 hit on sammel (alternative, though not correct, spelling) -smuldr (crumble): 2 hits, no hits on smulder (alternative, though not correct, spelling) -åbn (open): 23 hits Clearly, the problem of not being able to search for imperative forms is not as widespread as the problem of the "missing" passive forms.However, quite a number of examples do appear from the log fi les, and it might therefore still be relevant to include this form in the dictionary for the reasons mentioned above.

Dyslexia
Among the not-founds, a number of misspellings appear in which the problem seems to be reversal of letters.
Neither the top 100 of most frequent not-founds, nor the top 500 re gister any examples of this problem.A study of 300 consecutive searches in a random place in the log fi le revealed only 3 examples of the phenomenon, i.e. onanym instead of anonym (anonymous), akapo lypse instead of apokalypse (apocalypse) and medmnidre instead of med mindre (unless).Thus, the problem does not seem to be very common.Also, it is impossible to tell whether these misspellings are ac tually due to dyslexia or whether they are simply typing errors.

Spelling Mistakes Affected by Pronunciation
An extremely large proportion of the misspellings found in the log fi le can be ascribed to users spelling the word as it is pronounced.Among the 100 most frequent not-founds, 8 examples appear, and a further 48 examples can be found among the 500 most frequent not-founds: The many examples among the 100 most common not-founds show that this issue is highly relevant.However, not quite as many occurrences are found among the 500 most common not-founds, and a study of 200 con secutive searches in a random place in the log fi le yields only 10 examples: holdspil (team play) (this is actually the correct spelling, but the word is not included in the dictionary -see also the section "Lemma Lacuna" below) hold-spil hold spil lamme koteletter (twice) instead of lammekoteletter (lamb chops) fornylig instead of for nylig (recently) så som instead of såsom (such as) somregel instead of som regel (usually) ligemeget instead of lige meget (of no consequence The above-mentioned reference book HANDBOOK OF CONTEMPORARY DANISH contains a section that explains the use of one or more words, both in regard to compounds and prepositions.Based on the search results from the log fi le, it may be relevant to include similar information on this issue in THE DANISH INTERNET DICTIONARY, e.g.under the head ing "Sprognormer" (linguistic norms).

Non-existing Words
A study of the not-founds also reveals that a number of searches have been performed for words that are non-existing.In some of these cases, however, e.g.privilegie and implementation, it is debatable whether the words exist or not.A search on Google shows that these forms are very widely used, although the correct forms are privilegium and implementering.
Other examples of non-existing words occurring in the log fi le include: As mentioned above, it may be debatable whether all of the examples listed in group 1 may be defi ned as non-existing words.For example, a search on Google returns a number of hits for implementation although the correct form is implementering.This means that such words exist in common usage even though they are not offi cially authorized.Many of these incorrect word formations, particularly those concerning the endings -else/-ing and -ing/-ion, are very common, and it may therefore be a good idea to include them in the dictionary with a reference to the cor rect term.This is particularly relevant for the words occurring in the top 100 and top 500 of the most common not-founds.
It is much more diffi cult to make allowances for words like those listed in category 2 as very few users have searched for them.Thus, these words are not part of common usage, and none of them occur in the top 100 or top 500 either.

Linking Morphemes
Neither the top 100 nor the top 500 of most common not-founds contain any examples of problems related to linking morphemes.
A study of 200 consecutive searches in a random place in the list of not-founds from the log fi le yields only two examples, tidalder instead of tidsalder (age or era) and vidensproget/videnssproget (the language of knowledge).A search for tidalder reveals that two different users have searched for this word at two different occasions, whereas only the one search has been carried out for vidensprog/videnssprog.
Other words related to the issue of linking morphemes that can be found in the log fi le include kontraktsforhold (contractual relationship), which has also been searched for at two occasions, and isterningmaskine/ is ter ningemaskine/isterningsmaskine (ice machine).According to the DANISH INTERNET DICTIONARY, the correct form is isterningsmaskine, and the other two alternatives have references to this term.
Thus, it seems that only certain compounds cause problems for users, whereas problems relating to linking morphemes in general do not seem to be very common.In the case of compounds that occur more than once, e.g.tidalder and the compounds containing kontrakt-/kontrakts-, it may be considered whether it would be relevant to include these words in the dictionary with a reference to the correct form as it has been done with isterningsmaskine.

Lemma Lacuna
Undoubtedly, the most obvious way that log fi les can be used to improve internet dictionaries is as a tool to discover lemma lacuna.Particularly the lists of the 100 and 500 most common not-founds are interesting in this connection as a large number of users have searched for the terms included on these lists, i.e. the terms are commonly used.Consequently, it is very relevant to add these words to the dictionary.A large proportion of these terms, more specifi cally 38, are technical terms that may be classifi ed as follows: Computer-related terms: Top 100: site (site), adsl (adsl) Top 500: integrator (integrator), avatar (avatar), e-learning (e-learning), programpakke (programme package)
The main question arising from this analysis is whether such terms should be included in a dictionary like THE DANISH INTERNET DICTIONARY, or whether this dictionary should merely contain words from the common language.We suggest that at least the most common of the technical terms, e.g.site (site), franchise (franchise), e-learning (e-learning) etc., which are part of everyday usage, should be included in THE DA-NISH INTERNET DICTIONARY, whereas highly technical terms such as nomotetisk (nomothetic) or fokal (focal) might be left to specialised dic tionaries.

Lemmas not searched for
Another interesting aspect to consider in connection with dictionary use is whether the dictionary is actually used to its full extent, i.e. whether all dictionary entries have been searched for, and if not, how many of the total number of entries have never been requested by the users.
According to the statistics for THE DANISH INTERNET DICTIONARY, a total of 104,097 orthographical words have been searched for.This figure constitutes approximately one third of all possible searches as it is possible to search for infl ections of a particular lexeme, i.e. for the head word itself as well as for all grammatical forms listed in the fi eld of grammatical infl ections.As mentioned earlier, however, passive and imper ative forms are not included, and neither is the genitive form of nouns.In other words, if only one third of all entries have been searched for, approximately two thirds of the entire dictionary are not used in practice.But this can only be true if the user searches for infl ected forms as often as for the non-infl ected form, the lemma.Normally, or at least more frequently, the user will use the lemma sign as a search string; therefore we assume that the 104,097 search strings represent more than one third of the lemmas.
In order to establish whether this assumption is true and whether it is possible to discern a pattern in the words that have not been searched for, we examined the fi rst 100 entry words of THE DANISH INTERNET DIC-TIONARY starting with the letter b.The result was that 52 of these words had been searched for (eg B-aktie (B share), babysitter (babysitter) and bacon (bacon)), whereas 48 had not (eg B-dur (B major), B-skål (B cup) and babysprog (baby talk)).Clearly, this corresponds to the above estimate that more than one third of the lemmas in the dictionary are actually used, and we believe that this small test is representative of the entire dictionary.The really interesting question is: By how much will the number of used articles in the dictionary increase when we have got a bigger number of log-fi led lookups in the dictionary?We do not believe that the dictionary will ever be used to its full extent, but this topic is indeed interesting.We intend to make further investigations into "lemmas not searched for".Obviously, it is not possible to discern a distinct pattern on the ba sis of the above examination, e.g. that certain types of words, such as seman tic or orthographical variants, are never requested.However, it is still unclear whether such an investigation would be of practical use to lexi cographers.It depends on whether a systematic description of the requested words compared with the non-requested words is possible.

Perspectives
It is beyond the scope of interpreting log fi les to give a detailed report of the approximately 2,000 emails that the dictionary makers behind THE DANISH INTERNET DICTIONARY have received from users.About half of these users merely express their thanks to the dictionary makers for the dictionary and report that they use it often, whereas less appreciative emails were received during a period of technical problems with the server; the reason being that, in the case of a dictionary free of charge, the user expects to have access to his tool whenever he needs it.In many other cases, the users propose new lemmas or report spelling mistakes found in the dictionary.All this, however, is not the topic of this paper.We believe to have shown that log fi les can be used by dictionary makers for improving their dictionary, in this case for improv ing the lemma selection.The use of a better search system which includes the possibility of a direct search for synonyms, collocations, antonyms, word formations, grammar items etc. will provide a much more precise way of obtaining knowledge about real dictionary use.On the basis of such data, it will be possible to prepare much better internet dictionaries.However, this is only possible if the necessary funding to do this work is granted by national or private organisations in the future, or if the charging of a pay per view fee or a monthly or yearly payment for using quality internet dictionaries becomes more common.
Top 100: volatilitet (volatility), franchise (franchise) Top 500: debit (debit), effi ciens (effi ciency) Legal terms: Top 100: none Top 500: per se (per se), praejudicerende (prejudicial), pønal (penal), vanhjemmel (defective title), ankebegaering (notice of appeal), kontraherende (contracting) Medical terms: Top 100: none Top 500: praevalens (prevalence), invasive (invasive), alzheimer (Al zhei mer), fokal (focal), resektion (resection), replikation (replica tion), eradikation (eradication) We did not install a log fi le system when THE DANISH INTERNET DICTION-ARY was first made available to the public in 2002.During the time in which a log fi le system was in operation from 1 January 2003 to 3 August 2004, there was a period from June to September 2003 in which the log fi le system was switched off for technical reasons.In total, we had 1,021,139 single searches on 456 logging days.Thus, someone looks something up in the dictionary 2,239 times on average each day.We can see an increase in the number of searches over time: In 2003, the average number of searches was 1,631; in 2004 it increased to an average of 2,520.On the fi rst four days of the week, between 4,000 and 4,500 searches are normally performed, and on Fridays, the number is about 3,500 searches.During school holidays and in weekends, between 1,000 and 1,500 searches are carried out.By a single search we mean every new article searched for, either by looking for a new lemma or by linking from one article to another (all items that are identical with lemmas in THE DANISH INTERNET DICTIONARY are also functioning links).
The log fi les also allow us to see if the user looked for a word and did not fi nd it, a so-called lemma lacuna.In total, there are only in terested in seeing how they are explained and what kind of exciting collocations they have.More closely related to the intended genuine purpose of THE DANISH INTERNET DICTIONARY, i.e. helping Danish users solve text production problems, is the search for synonymous or almost synonymous expressions, e.g.
Consequently, these two lists alone show that the problem of mis spellings affected by pronunciation is the reason for a very large part of the non-successful searches.A study of 200 consecutive searches in a random place in the log fi le provides the same picture.19 examples were found: Another common problem revealed by the log fi le is whether a particular term should be written in one or two words.The top 100 and top 500 lists of most common not-founds contain the following examples: The top 100 list con tains 6 examples of words that can be classifi ed as lemma lacuna, where as the top 500 contains a further 53 such words: