Sorting items based on users rating: a more "fair" approach
September 7th, 2007
Some websites allow their visitors to rank the articles or pictures they view, by voting a certain score, for example 1 for "bad" and up to 5 for "excellent". Then someone can request to view the top-rated items, with the ones with the highest score coming first.
The simpler way to implement this should be to have two fields for each item: number_of_votes and total_score. When a user votes for an item with a given score, you increase number_of_votes by one, and you add the selected score to total_score. Then you could just sort the items by the average scoring (total_score divided by number_of_votes) to get the top-rated ones. In terms of SQL that would be something like:
This seems pretty much right, but it may lead into unwanted results. For example, assume two items A and B. Let's say item A has been voted 10 times, with a total score of 40. And item B has been voted only once given a score of 5. If we sort them with the method above, item B will come before item A because item B's average (5) is greater than item A's one (4).
Someone could argue about the validity of these results, claiming that item A with average score 4 should come first because this 4 was formed by a much larger number of opinions than a single one who believed that item B deserves a 5. Also, what would have happened if a third item, let's call it "C", was never voted yet? The aforementioned query would bring this item last, because a division by zero would return NULL. Or an implementer could even decide not to display items with zero votes at all. But is item C really worse than items A and B to deserve to be placed in the last position (or no position at all) in the results? Actually, we don't know.
A workaround to this problem would be to set a minimum number of votes required to consider an item's rating as valid. Until this minimum number of votes is reached, the item will be rated with the mean of the two extremes of the scores range (in our case that mean would be (5+1)/2=3). If we (arbitrarily) choose the minimum number of votes required to be 5, the query would be rewritten as follows (in the MySQL dialect):
A more precise approach would be to pad fake votes that give the item a score of the mean value, until the minimum number of votes is reached. In that case the query could be written as follows:
I did some tests and these seem to work pretty neat. However, I would really like to hear other people's opinions on this.
The simpler way to implement this should be to have two fields for each item: number_of_votes and total_score. When a user votes for an item with a given score, you increase number_of_votes by one, and you add the selected score to total_score. Then you could just sort the items by the average scoring (total_score divided by number_of_votes) to get the top-rated ones. In terms of SQL that would be something like:
SELECT * FROM Items
ORDER BY total_score / number_of_votes DESCThis seems pretty much right, but it may lead into unwanted results. For example, assume two items A and B. Let's say item A has been voted 10 times, with a total score of 40. And item B has been voted only once given a score of 5. If we sort them with the method above, item B will come before item A because item B's average (5) is greater than item A's one (4).
Someone could argue about the validity of these results, claiming that item A with average score 4 should come first because this 4 was formed by a much larger number of opinions than a single one who believed that item B deserves a 5. Also, what would have happened if a third item, let's call it "C", was never voted yet? The aforementioned query would bring this item last, because a division by zero would return NULL. Or an implementer could even decide not to display items with zero votes at all. But is item C really worse than items A and B to deserve to be placed in the last position (or no position at all) in the results? Actually, we don't know.
A workaround to this problem would be to set a minimum number of votes required to consider an item's rating as valid. Until this minimum number of votes is reached, the item will be rated with the mean of the two extremes of the scores range (in our case that mean would be (5+1)/2=3). If we (arbitrarily) choose the minimum number of votes required to be 5, the query would be rewritten as follows (in the MySQL dialect):
SELECT * FROM Items
ORDER BY IF( number_of_votes >= 5,
total_score / number_of_votes, 3 ) DESCA more precise approach would be to pad fake votes that give the item a score of the mean value, until the minimum number of votes is reached. In that case the query could be written as follows:
SELECT * FROM Items
ORDER BY IF( number_of_votes >= 5,
total_score / number_of_votes,
(total_score + 3*(5-number_of_votes) ) / 5 ) DESCI did some tests and these seem to work pretty neat. However, I would really like to hear other people's opinions on this.