Statistical Significance in Playtesting

The Problem

Why care about Statistical Significance in Playtesting? Maybe you are wondering if your deck is a good one? Maybe you want to know if it beats another (type of) deck consistently? Do you want to improve your Deckbuilding Process? For example, can your deck consistently beat a specific Telvanni Conscription deck your friend is always playing?

I found an interesting article from Frank Karsten, who is a Pro-tour Hall of Famer in Magic – The Gathering. He wrote an article about the Statistical Significance in playtesting. If you are interested, the full article can be found here: Magic Math: How Many Games do You Need for Statistical Significance in Playtesting?

Statistical Significance – The Experiment

Frank runs a thought experiment, in which two players bring two different decks to the “table” and play 10 games. Each game, they both keep playing with the same deck. After 10 games, the one deck beat the other one by 7-3. What can we derive from this result? Is the deck that had 7 wins really better? Is it having a 70% win rate?

Of course not, because the sample size is way, way, way to small. Frank is walking us through the details of the calculation, but to put it in a nutshell: After a 7-3 result, we can be at least 95% confident that the true game win probability lies between 34.8% and 93.3%. So the win rate can be 34.8 or 93.3% or anything in between? So it could be 70%? Yes, but does that tell us anything? Not really, because the spread of those probabilities is way too big with about 58.5%. 10 games is just not enough. We would need so many more test games between those two decks.

Statistical Significance – The Result

But how many? And here is the “Bad News of Statistical Significance” for all us:

If you want a 95% confidence interval for the true game win probability no wider than L units, you should play approximately 4/(L^2) games.

This means, if you want to claim with 95% confidence, that the win probability is within a spread of no wider than 20% (+/- 10%), you would need to play about 100 games between those two (!) decks. If you wanted to get this spread down to 4% (+/- 2%) you would already need 2,500 games. To put that into perspective: if you could test 6 games per hour and play 14 hours a day, you would need an entire month to finish that many games. Saturdays and Sundays included.

Statistical Significance – Moral of the Story

So be careful, when you read on that a certain deck has a win rate of 85%. Against what? Against the same deck? Against a random sample of decks? How many samples where taken? How many different decks were played? What was the opponent’s “skill-level”? Most often, players just do not have the statistical evidence to back their claims.

Also, be careful, when players tell you that a certain card is “broken” or that a decklist or deck archetype is “superior” or “part of Tier X, Y or Z”. It might just be personal bias or (too small) a subset of observations to form the foundation for these claims. Of course, they are interesting and valuable opinions, but we all should just keep them in appropriate perspective. Most importantly, authors should at least make explicit the number of experiments they ran to support their claim, or the sub-context of their analysis. Some sources already do this, but I do not see this being done consistently, particularly not on Reddit (there may be exceptions, of course).

What can Bethesda and Sparkypants do?

The only ones being able to share such information with us would be Sparkypants or Bethesda who do have the actual large number of results of the different match-ups of all the players in the game. It would be great, if Sparkypants and/ or Bethesda did explain for future nerfs both the objective and subjective arguments at the same time for each nerf. This would allow the community to rationalize and understand certain decisions in an easier and better way.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.