Phish.net Mobile

This site is no longer maintained. Use phish.net or current.phish.net.

Phish.net Blog

Comments on "Phish.Net Show Ratings, Part 4: Can Rater Weights Improve the Accuracy of Show Ratings?", posted by phishnet

Posted by paulj, on 08/26/2024

For anyone who is interested, here is a link to the RYM Explanation of Rater Weights

The discussion thread can be found at This Link.

Posted by lysergic, on 08/26/2024

This is all fabulous. Thanks for doing this. Some scattered thoughts:

1. Dropping raters with exceptionally high deviation scores will mechanically increase R2 and decrease RMSE, so I would be careful to only use those metrics when comparing weighting schemes that treat high-deviators the same.

2. Using the F stat is a very clever metric. I'm sure you're aware it's imperfect for these purposes, but still it's a really nice way to frame the issue, IMHO. Bravo!

3. I would be curious to see a graph of avg rating by year, comparing the different weighting schemes. Which years win? Which lose? (Ideally extend back to 1.0)

4. You've probably thought of this, but a middle ground option between doing nothing and doing full-on real-time adjusted weights would be to generate the weights at regular intervals, e.g. monthly. This may be easier coding-wise, although of course it would require some human labor with each update. (The idea being that any new accounts created in between weighting updates would get an initial weight of zero or close to zero.)

Posted by paulj, on 08/26/2024

@lysergic said:

This is all fabulous. Thanks for doing this. Some scattered thoughts:

1. Dropping raters with exceptionally high deviation scores will mechanically increase R2 and decrease RMSE, so I would be careful to only use those metrics when comparing weighting schemes that treat high-deviators the same.

I've been using a combination of deviation, entropy, and # of shows rated, which seemed most closely aligned with the RYM system. Entropy is the metric that drives most of the weights, followed by deviation and then # of shows. Typically, anomalous raters are identified by more than one metric, i.e., about half of all people identified with deviation also have low (zero, or near zero) entropy. But your point is well-taken.

3. I would be curious to see a graph of avg rating by year, comparing the different weighting schemes. Which years win? Which lose? (Ideally extend back to 1.0)

I haven't done this for every year, but here's a couple of things that aim in that direction. For all of the weighted averages (two of which are depicted below), the distribution remains left-skewed, but is squished from the top. There's a bit more mass in each tail, and the overall mean rating (the mean for all shows) actually increases. This graph includes shows from all eras.

As for the year-by -year thing, I've done it for selected years--one from each era (Modern, 2.0., and 2 x 1.0). There was no real rhyme or reason as to which years were selected; mostly they seemed like interesting years, and I had the Phish Studies Conference deadline bearing down on me.

4. You've probably thought of this, but a middle ground option between doing nothing and doing full-on real-time adjusted weights would be to generate the weights at regular intervals, e.g. monthly. (...any new accounts created in between weighting updates would get an initial weight of zero or close to zero.)

This is exactly how RYM handles its weights.

Posted by Zeron, on 08/26/2024

I'll be honest. I still don't like the 5 star rating system. I understand the arguments for it.
I just think that since we are seriously considering a change to ratings, then we should also consider letting users have more than 5 options to differentiate shows.
Having to rate a good Phish show as a 2 is not something that appeals to many fans i'm sure. With the 5 star rating system, how often are users really giving the 20-40th percentile shows a 2? I would guess, almost never.

I'm not an expert, I am just going by how I feel about things. I hope those who have all the data make the correct decisions about how to proceed. The ideas presented in this article seem like a step in the right direction.

Posted by paulj, on 08/26/2024

@Zeron said:

...since we are seriously considering a change to ratings, then we should also consider letting users have more than 5 options to differentiate shows.

This was brought up a couple of days ago in the Discussion Thread. You are not alone in this belief, and I'll certainly raise that as an issue in my memo to the Admins.

Posted by lysergic, on 08/26/2024

Thanks for the reply, @paulj

Your Figure 9 looks promising. That's a more plausible distribution of show ratings, rather than the bimodal distribution under the current rating system.

One other idea I just had, which would address the point that @zeron and others have made... We see users mostly rating shows 4 or 5 rather than using the full range. Personally I think this is fine. However, if we want to "spread out" the resulting scores, the site could display each show's percentile ranking, in addition to (or as an alternative to) its avg star score.

So --- just making numbers up --- a given show might say: "Show rating: 3.914, 42nd percentile among all Phish shows"

That way we would have an absolute measure as well as a relative measure of each show's quality.

A comment by HotPale was voted down.

Posted by jdonovan, on 08/26/2024

"My previous research ( https://phish.net/blog/1539388704/setlists-and-show-ratings.html ) demonstrated that show ratings (with equal rater weights) are significantly correlated with setlist elements such as the amount of jamming in a show, the number and type of segues, the relative rarity of the setlist, narrative songs, and other factors."

If we believe to know the majority of setlist elements that typically correlate to a rating, would it make sense to list those elements for a user to individually rate when they go to rate a show? Subsequently a formula could crunch those numbers together to generate an "overall rating". Personally, I think I would stop and think twice about each category and probably be less impulsive with my rating after attending a show. Yes, it could drive some people away from rating due to the "survey-ness" of the approach, but at least you're able to understand and measure the impact of a rating based on those elements.

Posted by TwiceBitten, on 08/26/2024

None of this explains why Saturday of the fest is still the 3rd best show of all time ????

Posted by jr31105, on 08/26/2024

Why not show the rating and the weighted rating next to each other on setlist pages? For top-rated shows, allow a filter or toggle between for users to custom-sort. Both data sets give interesting results and would have value as distinct views.

I am happy to help if you like the idea and need a volunteer. I have 20+ years of experience in eCommerce and data analysis.

Posted by dscott, on 08/26/2024

@jdonovan said:

"My previous research ( https://phish.net/blog/1539388704/setlists-and-show-ratings.html ) demonstrated that show ratings (with equal rater weights) are significantly correlated with setlist elements such as the amount of jamming in a show, the number and type of segues, the relative rarity of the setlist, narrative songs, and other factors."

If we believe to know the majority of setlist elements that typically correlate to a rating, would it make sense to list those elements for a user to individually rate when they go to rate a show? Subsequently a formula could crunch those numbers together to generate an "overall rating". Personally, I think I would stop and think twice about each category and probably be less impulsive with my rating after attending a show. Yes, it could drive some people away from rating due to the "survey-ness" of the approach, but at least you're able to understand and measure the impact of a rating based on those elements.

The findings in question were derived from 3.0 shows only, from 3/6/09 through 12/31/17. We don't know whether the same elements would correlate as reliably with higher ratings for shows from 1983 to 2004. We also don't know whether those elements have continued to exert the same effects on ratings from 2018 to present.

Similarly, we don't know whether a mathematical sum of ratings applied to each of the identified elements would be reliably representative of a given fan's global impression of a given show.

Ultimately it might be impossible to deviate from a simple average of user ratings, where each rater gives a single rating to a given show, without imposing distortive biases and introducing new layers of systematic inaccuracy.

Posted by Zeron, on 08/26/2024

@TwiceBitten said:

None of this explains why Saturday of the fest is still the 3rd best show of all time ????

Good shows sometimes take months to come back to reality. Unless you are suggesting that they delay ratings until days/weeks after a show, then there's not really a way to combat it

Posted by BigJibbooty, on 08/26/2024

Thank you, @Pauli, for this. Super interesting (even though I don't understand a significant percentage of it). The one factor that I didn't see discussed in these posts is the timing of ratings. It seems that a lot of the abuse of the system has happened all of a sudden - where a show is sitting at say 4.3, and then in the space of an hour, 50 one-star ratings come in that drop it to a 3.6. And from what I recall, they come in from 50 different accounts. Is there any way to take that into consideration as well as the general rating patterns of a user you've described in the previous posts?

On another note, I like @Lysergic's idea above - keep the current 5 point scale but add percentile. That would keep everyone happy!

Posted by Theprophetsandwichulus, on 08/26/2024

@TwiceBitten said:

None of this explains why Saturday of the fest is still the 3rd best show of all time ????

Whether it is that good or not, there�s an entire segment of the fan base that�s actually angry enough that the band could possibly play a show of such high quality today that they�re resorting to burning down a silly rating system. It�s awesome that the boys can still bring so much heat 40 years in that we have this kind of passion about it.

I think everything @paulj wrote is incredibly interesting. I also worry that by codifying the ratings system, we�re baking in a lot of a lot of rules about how you can properly enjoy Phish. The ratings are an incredibly helpful way to find great shows to listen to; taking them more seriously than that kind of discounts that people enjoy Phish for a ton of different reasons. >5% of reviews are potentially from fluffers - just clean out bots and maybe purge the most egregious review scores and call it a day?

Posted by cbaratta, on 08/26/2024

The following statement presupposes that these �show elements� merit a show being rated more highly, which is a highly debatable proposition�

�If the deviation and entropy measures have successfully identified the raters who contribute the most information and least bias, then weighted show ratings should be even more correlated with show elements relative to the current system. Technically, this analytical approach is a test of convergent validity.�

Who is to say that a poorly performed but major �bust out� always merits a higher rating, for example? The Baker�s Dozen Izabella was a huge bust out but pretty poorly performed. The first Tela I ever saw made me depressed because of how not like the 90s it was (subsequent ones were much better).

It seems to me the entire endeavor to ascertain a �reliable� rating system described in these series of posts itself presupposes that (1) there is an objective, verifiable way to measure the quality of a show and (2) the rating system should be used for that purpose instead of to reflect the aggregate of the assessments of the people who have heard or experienced the show.

I would say both of those assumptions are wrong, and the whole effort is misguided in that sense.

To me, a useful rating system should tell us (1) what did the people who attended this particular show think (ie, the fun factor) and (2) what do the people who regularly listen to and attend Phish concerts think about the quality of this show (ie, replay quality). (If someone went to a show and got puked on and had a bad time, it�s not �wrong� for him to rate it poorly.)

Distinguishing ratings on those two dimensions would help a lot.

A third may be to distinguish quality of jamming from quality of composed performances. For me there are shows where the jamming is better and others where the compositions are played perfectly, and they don�t always coincide.

Allowing a 10 point rating scale would also help for the reasons mentioned above�most phish shows are amazing, and no one wants to rate one a 3/5. 6 or 7 out of ten seems more generous even when it technically is not. In my view most phish shows should be in the 6-10 out of 10 range.

How to verify that the ratings submitted by users reflect genuine beliefs is a separate issue and I applaud the thinking to develop some sort of control. Weighting rankings could be one way although similar to others here it rubs me the wrong way.

Posted by Theprophetsandwichulus, on 08/26/2024

@cbaratta said:

To me, a useful rating system should tell us (1) what did the people who attended this particular show think (ie, the fun factor) and (2) what do the people who regularly listen to and attend Phish concerts think about the quality of this show (ie, replay quality).

Distinguishing ratings on those two dimensions would help a lot.

An easily-calculable split based on an existing dataset for user attendance and the current ratings that would be highly informative and require a few lines of code and maybe a table join? Pump it straight into my veins!!

Posted by Multibeast_Rider, on 08/26/2024

The rating data is always going to be messy and imprecise and there is no alternative system that will turn this into a scientific result. The modified RYM seems interesting and would be a nice comparison point alongside the pure average.

I'm personally most interested in this data just to understand the human behavior behind it. And the solutions I most like are things that prevent people from gaming the system (i.e. fake/dupe accounts).

One thing that would really nicely augment the ratings system would be a curation system that lets people build and collaborate show lists based on criteria defined by phish.net members (i.e. Top 1.0/2.0/3.0, top SoaMelts, etc). Kind of like we do in the threads, but in a more permanent structure. N00b100's long posts could take on a permanent life of their own.

Posted by paulj, on 08/27/2024

Okay, folks. A lot of comments have arrived in the time I've been offline. I'm scurrying today to finish household stuff before tomorrow's departure for Dick's; I'll try to get some responses ready before leaving, but that depends on how today goes. And once I get to CO, I can't guarantee any response until next week!

Regardless of when I respond, ALL comments from the blogposts and the discussion thread will be summarized for the .Net Overlords. My goal is to get that ready by mid- to late-September, and I'll get a PDF copy available to anyone netter who requests it.

Posted by CreatureoftheNight, on 08/27/2024

@TwiceBitten said:

None of this explains why Saturday of the fest is still the 3rd best show of all time ????

Not anymore! It took a big drop after this article came out. Coincidence??

Posted by HotPale, on 08/28/2024

@HotPale said:

Just keep IT the same please. Nobody's rating should outweigh anyone else's rating. That is not cool. I think y'all are thinking a bit too hard on this. Keep IT simple! You know, Phish's anthem...no need to overcomplicate any of this. Obviously, raters should only use one account to make ratings, but giving some people more rating power is simply unethical. That's like the super delegates at the DNC, not that they matter anymore since the Democrats have completely subverted the will of the people in this current "primary." Just let everyone have the same opportunity to vote and vote once.

What can I say? People don't like IT when you tell it like it is. Sorry for the truth bomb. There was nothing wrong with the system, but apparently needs more thought put into this than anything else in life. Power! All about the power!

Posted by EZgo2, on 08/28/2024

This analysis is very interesting and helpful. I love it. Thanks for donating your time and expertise Paul.

Posted by Jeremy8698, on 08/30/2024

@TwiceBitten said:

None of this explains why Saturday of the fest is still the 3rd best show of all time ????

Recency bias. The shows then settle into the overall scope of phish.

Posted by all_things_reconsidered, on 08/31/2024

Can someone who is so dead set on determining the exact right methodology to rank shows explain to me why it matters if night 3 of Mondegreen is 5th or if it�s 25th or 50th? It�s a show everyone should listen to regardless. And a good reminder why new phish fans should still be going to every show they can.

Who are the people who are like, �Oh thank god that rating was adjusted, I really felt like it was wrong�?

Posted by Multibeast_Rider, on 09/02/2024

@BigJibbooty said:

Thank you, @Pauli, for this. Super interesting (even though I don't understand a significant percentage of it). The one factor that I didn't see discussed in these posts is the timing of ratings. It seems that a lot of the abuse of the system has happened all of a sudden - where a show is sitting at say 4.3, and then in the space of an hour, 50 one-star ratings come in that drop it to a 3.6. And from what I recall, they come in from 50 different accounts. Is there any way to take that into consideration as well as the general rating patterns of a user you've described in the previous posts?

On another note, I like @Lysergic's idea above - keep the current 5 point scale but add percentile. That would keep everyone happy!

Agree with this perspective. If you have the rating timestamps I�d love to see a time series of a bunch of shows like 7-14-19 or even the recent Dick�s run per this post.

The most valuable thing that could come out of this work is finding the users who have a bunch of bots pounding the ratings.

https://forum.phish.net/forum/show/1379933076#page=1

Posted by HotPale, on 09/03/2024

@all_things_reconsidered said:

Can someone who is so dead set on determining the exact right methodology to rank shows explain to me why it matters if night 3 of Mondegreen is 5th or if it�s 25th or 50th? It�s a show everyone should listen to regardless. And a good reminder why new phish fans should still be going to every show they can.

Who are the people who are like, �Oh thank god that rating was adjusted, I really felt like it was wrong�?

Right on. Can't up vote you, but I tried. I agree with your analysis. Much more in tune with the band than small fan base that worries about such things as the ratings.

Posted by InsectEffect, on 09/03/2024

@paulj said:

@Zeron said:
...since we are seriously considering a change to ratings, then we should also consider letting users have more than 5 options to differentiate shows.
This was brought up a couple of days ago in the Discussion Thread. You are not alone in this belief, and I'll certainly raise that as an issue in my memo to the Admins.

Thank you @paulj and all for a very interesting, nerdy and generally considerate discussion here. I'll echo a few points brought up by others, and provide an example for reference.

RATING SCALE ~ Rotten Tomatoes uses a 5-Star rating system that includes half-stars, effectively making a 10-point scale, allowing for more nuanced ratings. When selecting a rating, RT also provides a broad definition of what each point "means," ie with a 2.5-star rating suggested to be "Not Bad, but Not My Favorite" and 3.5-star rating suggested to mean "Worth a Watch."

These "meanings" may be problematic in data collection (I'm not sure, would love to hear thoughts) and would almost certainly be contentious in the context of Phish show ratings, but for RT I suspect they are at least broad enough to allow for some basic guidance without overly constricting reviewers' ratings. For RT, they also point toward the "value" of ratings being primarily intended to help potential viewers decide whether or not to watch a given movie.

In this Phish context, beyond community consensus of the all-time "best" shows, the "value" of ratings could potentially be construed to be helping fans find quality shows to listen to. However, those two things---"best" and "quality listening"---probably generally correlate, but aren't always the same (ie for highly-theatrical shows whose entertainment value might not be fully represented "on tape").

Regardless of "defining" the scale, I do think a 10-point scale would be better for Phish.net.

RATING TIMING ~ Rotten Tomatoes does not allow audience ratings until after the theatrical release of a film. There may be an additional lag, I'm not sure (and streaming releases complicate things). For Phish.net, I'd suggest a 24-hour delay, from end of show time, before ratings open. You know, sleep on it.

SHOW TRACKING ~ Somewhere over the years, probably in the forums, I've gotten the sense that some fans use ratings to track the shows they've attended (or even just those they've listened to, for the completists out there), which could explain why all ratings from a given user are 5-star (or any other single value). This possible usage pattern may also inform the prospect of implementing a weighted scoring system.

This may simply be convenience, because the "I was there!" button is buried in the "Attendance" tab. Making personal attendance tracking more accessible (perhaps elevating it alongside the ratings input) could ultimately improve the ratings data. Attendance could also be better nuanced as "Engagement" and could include "listened to," "livestreamed," etc, thus expanding individual fans' personal tracking and stats, while also giving more overall site data.

If a weighting system is implemented, I like @jr31105's suggestion of making both weighted and unweighted scores available to users. Cheers!

Posted by paulj, on 09/04/2024

@all_things_reconsidered said:

Can someone who is so dead set on determining the exact right methodology to rank shows explain to me why it matters if night 3 of Mondegreen is 5th or if it�s 25th or 50th?

Not sure if this is aimed at me, but I think so.

The sampling and response biases inherent in Phish.Net ratings are such that, short of full-blown, well-designed survey of all Phish fans, we will never get the methodology "exactly right". What I'm trying to do is identify the low hanging fruit: are there biases inherent in the ratings data that we can fix relatively easily?

Obviously, I think we can correct for some of the biases we all know are in the data. My approach has been to review the extensive academic literature to see what others have done when they have encountered the same issues as .Net. I'm not proposing anything new here; I am simply advocating that we do what others have done.

As for ranking of shows, show ratings estimated to the third decimal is highly likely to be false precision--even if a weighting system is adopted! Weights will reduce the overall bias in the data, but not eliminate it. So, it is NOT ME who is arguing whether Mondegreen N3 is #5 or #20 or whatever.

Since May of 2010 I've seen countless complaints about biased ratings, and all I'm saying is, "Hey, we can fix at least some of that shit."

Back to Blog

Back to Phish.net Mobile

Phish.net

Phish.net Blog

Comments on "Phish.Net Show Ratings, Part 4: Can Rater Weights Improve the Accuracy of Show Ratings?", posted by phishnet

© 2024 The Mockingbird Foundation Powered by Phish.net Designed by Adam Scheinberg

Phish.net Mobile

© 2024 The Mockingbird Foundation
Powered by Phish.net
Designed by Adam Scheinberg