Statwing's approach to statistical testing

# Effect Size (T-Test)

If there’s enough data, the difference between these two groups may be statistically significant, but the effect size would show that it’s not practically relevant.

## Short Definition

A t-test’s effect size indicates whether or not the difference between two groups’ averages is large enough to have practical meaning, whether or not it is statistically significant.

## Example

Let’s say you’re interested in whether the average New Yorker spends more than the average Californian per month on movies, and whether the average New Yorker likes movies more than the average Californian.

You ask a sample of 300,000 people from each state how much they spend per month on movies, and you also ask them how much they like movies on a 1 to 7 scale, where 1 = “Not at all” and 7 = “Very much”. You find that the averages for New Yorkers are \$18 and 4.8 and the averages for Californians are \$18.03 and 5.3, and both differences between New Yorkers and Californians are statistically significant.

Since the monthly spend is in units that have concrete meaning (dollars), it’s easy to look at the difference of \$0.03 and conclude that while it is statistically significant, it’s not very meaningful because \$0.03 is not that much money. On the other hand, it’s difficult to understand whether the difference between 4.8 and 5.3 is meaningful, since units on a 1 to 7 scale don’t have concrete meaning in daily life.

The goal of effect size is to make it easier to understand whether a difference like that between 4.8 and 5.3 is actually meaningful or not. A low effect size like 0.1 would tell you that there really isn’t much difference between the two, and New Yorkers and Californians are similar in how much they like movies. A high effect size like 1.2 would tell you that there is a meaningful difference there, and Californians like movies noticeably more than New Yorkers.

## Long Definition

The t-tests’s effect size is, along with the t-test’s statistical significance, one of the two primary outputs of the test.

It’s goal is to give a concrete sense of whether a difference between two groups is meaningfully large, independent of whether the difference is statistically significant.

Statwing uses the most common type of effect size for the t-test, Cohen’s d.

There is no strict cutoff that delineates a “small” effect sizes from a “medium” one, but the below are generally accepted guides for how to think about effect size (from Cohen, the creator of effect sizes):
.2: “Small”, a “hardly visible” effect like the difference between the heights of 15- and 16-year-old females.
.5: “Medium”, an “observable” or “noticeable” effect, as between the heights of 14- and 18-year-old females.
.8: “Large”, a “plainly evident” effect, as between the heights of 13- and 18-year-old females.

The effect size will be larger if
(1) The absolute difference between the averages is higher, like \$5 instead of \$0.03 in the example above or
(2) responses are consistently close to the average values and not widely spread out (the standard deviation is low)&mdash;if every New Yorker and Californian liked movies to the exact same degree as every other resident of that state, even relatively small absolute differences would be very noticeable.