OK, I think I have an answer to my question about the particular shape of the Stardust Top 100
score vs. ranking curve, and it's more interesting than I'd ever expected because of the fascinatingly wide range of phenomena found to exhibit the same distribution for reasons not always apparent.
Pier was absolutely right that it seems to be a variety of what is sometimes called a ‘shifted’ power-law relationship with the formula y=a(x-b)^c
, though I’ve found it more often expressed in the form y=a/((b+x)^c)
, known as a Zipf-Mandelbrot
(ZM) discrete probability distribution (the formulae are mathematically equivalent as long as constant
a is positive and
b & c negative in the first while all are positive in the second)
This is a generalisation of Zipf's law
, initially proposed in the 1930’s to model word usage in English texts, frequency being almost inversely proportional to rank, i.e. y=a/(x^c)
with exponent c ~= 1.0 (or –1.0 in the first formula), and closely related to the Pareto distribution
(initially regarding wealth-distribution in Italy), aka the "80-20 rule"
or “Law of the vital few”
(a widely-used management and quality-control rule-of-thumb, e.g. 80% of income comes from 20% of clients, or faults from potential causes, etc.).
If truly ‘Zipfian’, a log-log plot should be entirely linear with slope ~= -1.0. Here are some recent examples regarding (A) word frequency
in a variety of English texts, (B) website popularity
in Russia, (C) note pitch distribution
in Bach’s ‘Air on the G string’(!), and for comparison (D) the SD@H full list of Phase 1 scores
. In each case the law indeed seems to apply remarkably well throughout most
of the range, except
for the ‘droopy’ tops and tails:
In order to generalise the formula, Mandelbrot added another (arbitrary) constant b
to the divisor in Zipf’s original, (i.e. giving y=a/(b
+x)^c), which assumes more significance the lower the rank number becomes, thus modelling the top-end droop. The authors of graphs A & B suggest further (but different) modifications to account for the tails also consistently evident in their data.
Although figure D shows that the top 100 SD@H scores alone fall largely within the top-end drooping region, the log-log plot of all Phase 1 scores (534 dusters) does
show an overall slope of almost exactly -1.0 (using the first version of the formula in CurveExpert 1.3 as pier suggested, with a=1075444.1, b=-6.8524834, c=-0.86919775.)
What this may mean at any deeper level I’m not sure, but have always felt certain there had to be something
to it, and suggest SD@H can now be added to the growing range of ZM phenomena including not only the examples above but also global manufacturing
, earthquake time intervals
, the magneto-rheological properties of ferrofluids
(whatever that may be), and many many more...