Box 16.1 Zipf’s Law
Zipf ’s law is based on the observation
that the most frequent word in English
(“the”) is twice as frequent as the sec-
ond-most frequent word (“of ”), three
times as frequent as the third-most fre-
quent word (“and”), and so on. Zipf ’s
law holds quite well for, at least, the first
1000 words in the English language
(Schroeder,
2009
). The frequency of
words is then derived from the harmonic
series:
1
1
2
1
3
1
4
1
, , ,
,
,
,
N
or more precisely, the frequency f(k; N)
of the k-th-most frequent word is:
f k N
k
j
k
N
j
N
;
1
1
1
1
/
/
/
ln
,
in which N is the number of words in the
English language and γ ≈ 0.57722… is
the Euler-Mascheroni constant. We have
used the fact that:
k
N
k
N
O
N
1
1
1
ln
.
The notation O(1/N) (the “big O” nota-
tion) indicates that this term decreases
at least as fast as 1/N as N increases. For
large N, the last term can, therefore, be
ignored. The statistical distribution with
frequency f(k; N) is also called the
Zipfian distribution. Note that the distri-
bution depends on the cutoff N.
Zipf ’s law describes, in addition to
word usage, the rank distribution of
amazingly many natural and sociologi-
cal phenomena: size of cities, size of
countries (except China and India),
length of rivers, size of sand grains,
wealth among people, and, as we have
just seen, popularity of books.
Do'stlaringiz bilan baham: |