Phase 6 stats and Skill score 'jitter' simulation

Post here if you are having any kind of problem with the Stardust@home website.

Moderators: Stardust@home Team, DustMods

jsmaje
Posts: 616
Joined: Tue Aug 15, 2006 8:39 am
Location: Manchester UK

Phase 6 stats and Skill score 'jitter' simulation

Post by jsmaje »

Due to an injury at the beginning of August it hasn’t been comfortable for me to dust effectively, but I’ve used the time to revive the old ‘Top 100’ statistics site that some dusters once found of interest, even though barely more than 20 have been active lately. Also to write a simulation of the skill score ‘jitter’ issue that has been causing some frustration, which also allows testing of certain means to ameliorate it.

As before, the stats site provides interactive graphs of power & skill score alone and in combination, weekly updates (each Wednesday with luck, except for an unavoidable gap of 6 weeks) of overall and personal duster performance, general activity, and the previous phase results. Keep an eye out for the occasional ISP in the background!

The sim demonstrates the known issue of a moving average window as currently implemented, where all PM scores in the window contribute equally to the final skill score calculation*, resulting in unexpected and unwelcome drops in average skill score when the latest PM track is correctly identified at the same time as the oldest PM in the window - correctly identified but of higher value - is excluded. That the same number of rises over time occurs in the opposite circumstances generally goes unremarked, however!

In order to minimise the number and/or mean size of such unexpected ‘events’ the contributions from older PM scores that will eventually drop out of the window would need to be downgraded in some manner, i.e. carry less weight in the latest skill score calculation. To exclude only the oldest PM score(s) would simply be equivalent to narrowing the window.

Many so-called ‘weighting filter’ types have been devised for specific purposes in other applications such as financial market trending, image processing, etc . (see here & endless other web references). Filter-folk refer to the present equally-weighted average as having a (1) ‘simple’ filter (effectively none!), but the sim allows testing of three others that satisfy the above requirement, called (2) ‘weighted’, perhaps better thought of as ‘linear’ in contrast to (3) ‘exponential’ and its inverse (4) ‘logarithmic’; their respective profiles are plotted in fig 3 of the following post.
Symmetric filters such as ‘gaussian’ & ‘moving-median’ provide superior data-smoothing, but would incur the penalty of a delayed response due to downgrading the most recent as well as oldest scores, perfect for other situations but likely to be even more frustrating for the present purpose.

Each sim run delivers 1000 PMs valued randomly from 5 - 85 in 5-point steps of difficulty, and emulates the skill of a theoretical average duster by being linearly biased from 90% for the easiest (5) downward to 10% for the most difficult (85), with a superimposed random element of +- 10%. No improvement in skill, or memory/record of track coordinates, is taken into account. Skill score therefore approximates 0.5 throughout.
Demo mode reflects the present situation with a 100-PM-wide window and equally-weighted PM score contributions.
Custom mode allows for changes to window width (10 - 500 PMs), weighting filter type and %- width (10 – 100%).

My own results and conclusions using the sim are detailed in the following post.

Both programs have been tested using Windows 7 on a 16:9 aspect monitor, and the latest browser versions of Microsoft IE (9-11), Mozilla Firefox (22 -26) & Google Chrome (31) (in which the window flickers annoyingly for some reason).
My apologies for any problems if you use a different OS or other browser/version, and for any unfound bugs.

“Stardust@Home Top 100 – phase 6” is available here
“SD@H Skill Score ‘Jitter’ Simulation” is available here

John

* Present skill score calculation:
total PM track values if correctly identified / total 'adjusted' values, where 'adjusted' values = PM values if correct (5 - 85) plus their inverse-values (90 minus value, i.e. 85 - 5) if incorrect.

NB: the ‘jitter’ phenomenon has nothing to do with the skill score formulation as such, merely the use of a ‘moving average window’. This was of course adopted because of other frustrating issues arising from the previous ‘cumulative average’ employed in phase 5.
Last edited by jsmaje on Tue Nov 28, 2017 10:38 am, edited 5 times in total.
jsmaje
Posts: 616
Joined: Tue Aug 15, 2006 8:39 am
Location: Manchester UK

Re: 'Jitter' simulation

Post by jsmaje »

Please read the preceding post and preferably try the ‘jitter’ simulation first in order to make best sense of what follows.

Sim results:
(1) For illustration, fig 1 shows a cropped screen shot from the sim in demo mode of a single random 1000-PM run with the present 100-window and ‘simple’, i.e. equally-weighted filter. PM circles are coloured green for correctly-identified tracks, red for incorrect; notice the general vertical gradation vs PM value. Skill score is shown by the blue plot as the window moved.

Image

The summary panel shows 112 unexpected ‘events’ to have occurred during this particular run, approx. 11%, and though rises & drops are nearly equal, the 5.5% of drops would naturally have been unwelcome.

(2) Window width can be changed in custom mode (10, 25, 50, 100, 250 or 500 PMs), and fig 2 is a plot of the effects on the number and mean size of events when using each width for 10 different runs, again using the simple filter. The bars show the range of results and circles the mean:

Image
Note that the linear reduction of event number vs window width (in orange) is simply an artefact arising from the fixed run length, since wider windows necessarily encounter fewer new PMs before reaching the end. Once this effect is corrected for, the number of events (in red) in fact remains stable at around 120 (12%) regardless of window width.
However, the same is not true of the mean size of events (in blue), being independent of the overall number, which declines in exponential manner vs increasing window width due to the increased data-smoothing effect.

(3) The formulae for the weighting filters provided are as follows, expressed as % of PM ‘adjusted’ value at each position p in the window, where w = filtered width.

Simple: 100 %
Weighted: 100 . (p /w) %
Exponential: 100 ^ (p/w) %
Logarithmic: rather than using math logs as such, this has been implemented by arithmetically inverting the exponential formula, i.e. 100 – (100 ^ ((w-p)/w)) %

The weighting factors calculated using each formula are plotted below, using the example of a 100-window and 25%-width filter. In subsequent figures the type of filter used is indicated by the simplified profile above each plot.

Image

Fig 4 below shows the results from just a single run, but using each window width with each non-simple filter type and filtered-width, all corrected for the finite run length. While varying in detail, a general reduction in event number vs window width is evident. In addition, wider filtered widths are seen to have a consistently greater effect.

Image

Meanwhile, fig 5 shows the same single run’s results regarding mean size of events, and although declining in the same exponential manner vs increasing window width as in fig 2, it shows that wider filters can have the undesirable effect of increasing mean size.

Image

Conclusions:
(1) Regarding window width (fig 2), although the current 100-window isn’t perfect, anything narrower would incur significantly increased mean event size. And anything much wider would mean a longer time for a duster to get initially or back into the skill rankings, likely to be even more frustrating; there are only 31 at present as it is.

(2) Given the trade-off between number and size of events shown in figs 4 & 5, filtered width of the window appears optimal at around 25 – 50%. Personally I’d choose 25% or thereabouts, in order to retain as many unfiltered PM scores so as to remain responsive to the latest skill change without overly compromising filter performance.

(3) In any case, the use of any non-simple filter type would considerably reduce the number of events. Examining figs 4 & 5 suggests that the (linear) weighted filter has advantages. Exponential and logarithmic filters tend to increase number & size of events to some degree, probably since the first effectively narrows PM contributions to the window, while the second is closer to the simple filter profile.

(4) Adopting points (1) & (2), fig 6 shows the mean & range of event number & size for 10 separate runs using a 100-window with each filter type of 25% width, emphasising the extent to which the linearly-weighted version could reduce event number by about 75% (30 vs 118 per 1000 PMs) and mean event size by about 30% (0.0040 vs 0.0057):

Image

(5) My vote would therefore be to continue using a 100-window, but to apply a (linear) ‘weighted’ filter of 25% width to the oldest PMs within the window.

Though the sim’s premise of an ‘average’ duster who doesn’t improve in skill, nor has a good memory for or maintains a record of PM track coordinates, is most likely unrealistic, these findings might be worth consideration for future phases, and perhaps other distributed projects that might consider measuring ‘skill’ with regard to a limited number of calibrations in similar manner.

Please try the sim and post your own conclusions, including if you find any errors in or improvements to the filter formulae, or of course have any other comments & suggestions.

John
Last edited by jsmaje on Tue Nov 28, 2017 4:35 am, edited 2 times in total.
caprarom
Posts: 337
Joined: Thu Aug 02, 2007 7:12 am
Location: Riverview, MI

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by caprarom »

Beautiful work, John. I hope the injury is well on the mend.
eagle
Posts: 14
Joined: Thu Apr 03, 2008 4:10 am
Location: Europe

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by eagle »

It's a fantastic work !
Well done John.
voyager1682002
Posts: 47
Joined: Fri Aug 04, 2006 11:27 pm
Location: Singapore

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by voyager1682002 »

Thank you for sharing John. Hope you are getting much better now.
DanZ
Site Admin
Posts: 777
Joined: Fri Feb 27, 2009 2:44 pm
Location: Berkeley, CA
Contact:

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by DanZ »

Welcome back John! Glad to hear you are on the mend, though I can't imagine anything holding you back for too long :)

Great work here - very impressive! I'll try and get a team response soon.

All the best,

Dan
DanZ
Site Admin
Posts: 777
Joined: Fri Feb 27, 2009 2:44 pm
Location: Berkeley, CA
Contact:

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by DanZ »

John,

What's your theory on the 5 perfect skill scores and how they seem to be holding on to them?

Dan
caprarom
Posts: 337
Joined: Thu Aug 02, 2007 7:12 am
Location: Riverview, MI

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by caprarom »

Just to interject regarding the "holding on" question posed to John. I'm currently at the 1.0000 skill score for the 12th time. That means I've lost (and recovered) it eleven times. So, maybe we're not so good at holding on as clawing our ways back. It does largely depend on the distribution of those more nasty 80-point PMs, they seem to come and go with quite variable frequency.
jsmaje
Posts: 616
Joined: Tue Aug 15, 2006 8:39 am
Location: Manchester UK

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by jsmaje »

Thanks Dan, and to all others who have responded.

So, what is my theory as to how some manage to attain and maintain perfect skill?
Well, Mike has pointed out that it isn’t necessarily maintained all the time, needing up to 100 consecutively correct PM track identifications after even a single miss; "up to" since there are those unexpected rises when a missed PM drops out of the window, and it will also depend heavily on any mathmatical rounding elements in the team’s calculation to arrive at a final 4-decimal result. Can we see the actual programming lines concerned?

Obviously a high work rate and good memory will help, by quickly gaining familiarity with the limited number of PM tracks themselves and reducing the time to recover skill score.

So too, of course, would be compiling a database of approximate track coordinates from the ‘red bar’ missed-movies pages, after making inadvertent or even deliberate mistakes. Just how many unique PMs are there in fact?

I don’t know if anyone does this – you’ll need to ask them directly – and I’d hesitate to call it ‘cheating’, since it’s really no more than an ‘aide-memoire’ little different from having a ‘bonne-memoire’!
In any case, it’s the inevitable consequence of having a limited number of unchanging PMs, many of which are only resizings & re-orientations of a single example, as well as having decided to provide feed-back as to track location, itself a response to duster requests.
The only solution would seem to be an infinite series of randomly-generated PMs, though I doubt the practicality of this (on second thoughts, perhaps it’s a more achievable prospect to program than automatic track recognition; I may even give it a whirl!)

Another factor may be that some attained a perfect score fairly early on, and have dusted at a relatively low rate since.

If the team’s concern, if concern they have, is that this indicates a purely video-game approach to dusting, at least it helps to keep people engaged.
And perhaps the true measure of skill will only become apparent in terms of how well our ‘real’ movie track identifications eventually correspond with verified features of interest to the team, which also depends on their progress with extraction & analysis, feeling to me somewhat slow if not actually stalled at present.

John
Last edited by jsmaje on Sun Jan 19, 2014 11:09 am, edited 1 time in total.
caprarom
Posts: 337
Joined: Thu Aug 02, 2007 7:12 am
Location: Riverview, MI

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by caprarom »

Random PM generation - another fine idea, John! Might be hard to pull off. Good luck if you decide to attempt it.

Missed three today (so far), but they were "good" misses, not flubs or questionable PMs, so I don't mind. Working on yet another comeback.
caprarom
Posts: 337
Joined: Thu Aug 02, 2007 7:12 am
Location: Riverview, MI

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by caprarom »

After several false starts, I can attest that my latest return to the 1.0000 skill ranking was accomplished without the help of any "aide memoire." Of course, you'll have to take my word on that. It does seem unfair, however, to leap-frog over other dusters already at the 1.0000 level due to the current skill ranking protocol. I'd suggest that protocol be amended to simply rank those tied at the top by the length of their current string of correct PMs. So, someone who just made it to 100, would be at the bottom of the top tier, while someone with 742 in a row would rank ahead of someone with 721 in a row. If the team wanted to reduce the number of dusters at the 1.0000 ranking, they could simply expand the "window" from 100 to 200 or more, which would also smooth those "jitters" some.
DanZ
Site Admin
Posts: 777
Joined: Fri Feb 27, 2009 2:44 pm
Location: Berkeley, CA
Contact:

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by DanZ »

Good to have John back, isn't it!?

I have Dr. Westphal looking over your simulations and recommendations and will get back to you with his report. Yours too caprarom.

Major papers set to be published, so everyone busy. But we'll keep you posted.
jsmaje wrote:Can we see the actual programming lines concerned?
Oh my! But sure, let me see.
jsmaje wrote:Just how many unique PMs are there in fact?
Good question - ~200? I'll have to ask!

Thanks for the insight on the perfect scores - saves me from having to fret too much.

More to come!

Dan
jsmaje
Posts: 616
Joined: Tue Aug 15, 2006 8:39 am
Location: Manchester UK

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by jsmaje »

In response to the parallel discussion here, the skill score ‘jitter’ simulation has now been extended to allow choice of initial skill level between 0 & 1.0, rather than being set only at 0.5 for an ‘average’ duster, and should be back up & running correctly.

The result is that there are now well over 1000 possible custom-setting combinations, and that’s considering only the single-decimal skill choices that can actually be to 4 decimals! The following results have therefore been limited to the previous best choice of a 100-PM-wide window & 25%-width linearly-‘weighted’ filter compared to the current equally-weighted so-called ‘simple’ filter (i.e. none!).

Results:

(1) Fig 1 shows the numbers (range & mean) of unexpected ‘events’ during 10 runs of 1000 PMs using the simple filter, with separate plots for ‘drops’ (red), ‘rises’ (green), and the sum of these (blue).
Fig 2 shows the same when using the weighted filter:
Image
Both show that unexpected drops & rises increase & decline in symmetrical manner vs skill. They are equal at skill level 0.5, which I’d at first mistakenly generalised to all other levels. As with ERSTRS’ experience, there is of course a marked relation to the proportions of correct & incorrect PMs depending on actual skill level.
Meanwhile, fig 1 confirms a 12% incidence (~120 / 1000) of combined unexpected drops & rises at skill 0.5, but rising to ~20% at low & high levels.
The slight fall-offs at initial skills 0 & 1.0 evident in fig 1, yet not fig 2, are due to the necessarily-reduced range of possible values susceptible to unexpected events that the unfiltered running skill score can attain (i.e. never going below 0 or above 1.0).

Importantly, it’s clear that use of a weighted filter considerably reduces the number of events throughout all the skill range, as originally found using just the 0.5 skill level. Of the mean 578 changes in running skill score that occurred per unfiltered 1000 PM run, 161 (27.9%) were ‘unexpected’. The respective figures with the filter were 595 & 44 (7.4%), representing an overall reduction by 73.5%.

(2) Figs 3 & 4 show the total amounts of event change without & with the weighted filter, and figs 5 & 6 the amounts per event:
Image
Total skill change is again reduced by use of the filter, though the effect on change per event is significantly less. Both display a fall-off toward the extremes*. The unfiltered mean total change (absolute value of drops plus rises) was 3.15 of which 0.43 (13.6%) was unexpected, and the filtered results 2.37 & 0.125 (3.6%) respectively, an overall reduction by 61.0%. However, the reduction in per-event change was only 42.3%.

(3) Finally, to compare the sim with reality, I have used ERSTRS’ figures with permission, both from here and via personal communication. 749 PMs were viewed, with a partially-estimated 100 unexpected drops but just one rise, and skill 0.9139. But her skill level was lower when starting phase 6, so making a guess for her average skill at 0.8000, the sim would actually have predicted 117 drops & 8 rises, not a million miles off given all the approximations involved. And use of a 25%-width weighted filter would have reduced these to only 24 drops & 1.6 rises, each of lower magnitude.

Conclusions:

(1) Should it be decided to continue with a moving average window in the next phase (albeit possibly optional), I feel the original proposal to use a weighted filter at the start of the window remains valid.

(2) Though not analysed here, I’d be surprised if similar qualitative results weren’t obtained with different window widths, non-simple filter type and width.
If anyone else feels brave enough to make such a comprehensive comparison, I’d be interested in their findings.

(3) And those preferring to return to the previous cumulative average, should first review the earlier complaints about even that particular method.

John

* Quite why the fall-offs are more prominent than the total number of events I’m unsure.
And that the peaks of changes per event (fig 5) are reversed compared to total amount (fig 3; i.e. the red plot peaks to the left of the green ), yet the filtered are not, seems odd. I’ve checked this over & over again, both by further runs and alternative ways to program the calculations, but with the same result.
This is of no consequence regarding my conclusions, but can anyone provide an explanation, including possible errors in my programming? The code source can be examined via the browser View tab, but is unannotated.
Last edited by jsmaje on Tue Nov 28, 2017 10:56 am, edited 3 times in total.
caprarom
Posts: 337
Joined: Thu Aug 02, 2007 7:12 am
Location: Riverview, MI

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by caprarom »

Well, I see from the link that I am on record as having no problem with the original sensitivity rating. That's still true. And although I personally prefer the cumulative scoring (which I now track on my own), I must admit that with the moving window approach it is "fun" after a missed PM to more quickly claw one's way back up toward whatever target level one aspires, jitters regardless.
jsmaje
Posts: 616
Joined: Tue Aug 15, 2006 8:39 am
Location: Manchester UK

Re: Phase 6 stats and Skill score 'jitter' simulation

Post by jsmaje »

Mike, a weighted filter has no relevance to the expected drops on missing a PM track which you understandably find 'fun' to claw back. Keeps the mind alive!

It’s merely a well-known statistical method to reduce the number and size of unexpected skill score changes when instead the opposite happens, e.g. a drop on finding a track, which so troubles some dusters such as jasonjason & ERSTRS (to the point of wanting to cease participation) and becomes more prominent as one's skill improves, as demonstrated by the sim.
That this phenomenon is a mathematical inevitability, and merely a perfectly correct update in skill score over the last window-width number of PMs (currently 100 PMs) and has no overall effect on long-term skill assessment, doesn't reduce the immediate frustrating effect (at least for drops if not rises!), nor provides enjoyment to those who perceive that this then requires a undeserved claw back.

And I suppose it’s ‘horses for courses’ regarding a return to the cumulative average. I personally have no overwhelming preference but do appreciate the attempt by the team to have addressed the previous complaints about that method being too slow and incomplete to respond to a dusters’ current abilities, surely the point of any skill measure if intended to reflect personal improvement (or otherwise) in a timely & more complete manner.

John
Post Reply