Google Books Used to Track Life and Death of Words

By Max Eddy Mar 19th, 2012, 9:45 am

Recommended Videos

Google’s ongoing attempt to turn the libraries of the world into a single, massive collection of scanned documents has found an unusual use in the hands of some physicists. Venturing into realms usually left to the English majors and linguists, the researchers used Google Books’ scanned tomes as a massive data set announcing new findings on the evolutionary life and death of words, including the assertion that English contains about 1,000,000 words — far more than most dictionaries would have you believe.

The paper, which was published in the journal Scientific Reports, looked at words in the Spanish, English, and Hebrew language, discovering the same remarkable patterns in all of them. For instance, they put forward that the death of words has increased while the birth rate has slowed dramatically. English, the authors say, seems to be adding 8,500 words a year. However, the authors believe the word-birth rate is slowing as existing words adequately describe most things and the “marginal utility” of new words is limited.

Think of how some web startups often invent their own terminology to describe what users do. For instance, on Twitter users can “retweet” the tweets of other users. Some of these words catch on, while others do not. This jostling for public acceptance is illustrated in the study by looking at “X-ray.”

Personally, I can think of no other way to describe the idea of “X-ray” without using the word itself, but it actually competed against “Roentgenogram” (named for the X-ray’s discoverer Wilhelm Rontgen) and “Radiogram.” It seems it took until about 1980 for X-ray to overtake the others and become the much beloved term we have today.

Though X-ray took about 80 years to achieve dominance, the researchers say that 40 years is a critical turning point in the life of a word. After about 30-50 years from the invention of a word, if it does not become widely adopted it quickly fades into obscurity. The reasoning for this timeframe may be because that is the typical amount of time it takes for a word to appear in a dictionary. However, the authors also suggest that adoption of the word by a younger generation — perhaps similar to creolization — might be the cause. It may be that if a new word can’t be hip to the kids, it will not survive.

Fascinating as these conclusions are, the study is not perfect. Mark Liberman at the Language Log points out:

One critical consideration, however, is that this paper is not really about words at all — it’s about contiguous letter-strings in optical-character-reader output for scanned printed books. Different inflected forms of a word are different “words”; word spellings are different “words”; word-fragments split typographically across lines are different “words”; typos are different “words”; OCR errors are different words”.

This may account for the enormous number of English words claimed in the study, which the Wall Street Journal says clashes with Webster’s count of 348,000 English words. It may also be that some jargon terminology, like the scientific names of animals, may be further schewing the study. Liberman’s point about typos and odd typography is addressed somewhat in the paper, where the authors concede that the increasing death rate of words likely has to do with improved editing standards. Spellcheck, for instance, is built right in to most computers these days, which is almost certainly homogenizing language.

Grain of salt firmly taken, the research is still a fascinating — and exhaustive — look at language using tools that until now were simply unavailable. As interesting as it is, what is perhaps more exciting is how much more research is almost certainly coming down the line as more and more books are added to Google’s digital collection and researchers find new ways to use this data.

(Wall Street Journal via Slashdot, full text of Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death, image via David Flores)