Saturday 12 July 2008

Anachronism isn't what it Used to Be...

I'm not an English native speaker - and sometimes I find myself in a situation where I need to look up a word from the dictionary in order to find more graceful-sounding synonyms, et cetera. But the dictionary is not always very helpful. Consider the following query in a thesaurus:

Synonyms for begin: … inaugurate …
Well - using to inaugurate right away would lead to strange results, because it's not really used for anything besides politics nowadays.

Concordance and Collocation

This means that sometimes you'll have to not only search for the translation or synonyms of a word, but also for usage examples. A good Dictionary typically provides you with some, but is also usually limited in the amount of examples it can display. This is why looking at occurrances of a word in real text together with their immidiate context is sometimes unaviodable (and in fact the way dictionaries are made). Linguists call that concordance, for grammatical agreement (e.g. using the right temporal form or aspect or the right preposition) and collocation for statistic relatedness (two or more words that form a phrase or a certain reoccurring pattern in language). What linguists now do to find out how a word is used, is, they search for the word in corpora and then classify its context - and patterns thereof.

A Poor Man's Corpus-engine: Google

But such corpora are usually hard to get or expensive, since virtually every accumulation of a non-trivial amount of text will contain copyrighted material (and it's often not easy (read: expensive) to enhance the signal-to-noise ratio). So, as a poor man, one has to resort to Google or equivalent search machines. Searching for a particular word or phrase in Google will often give satisfactory results for its most typical usage patterns. Phrase queries often help to narrow the scope of possible collocations.

… and while doing a bit of research I discovered the following: A word is archaic if the first two pages of Google hits for it return only results from dictionaries. Try searching for advertent (heck, it's not even in vim's default word file on Debian anymore…) as an example...

No comments: