0 like 0 dislike
0 like 0 dislike
I do some hiring from time to time in my job, and I wanted to share a simple two-part question I ask data analyst candidates. I've found that it distinguishes between people who have studied vs. practiced their trade.

1. Why would we want to use a median instead of a mean?  
\-> Almost everyone has a reasonable answer to this. Most say that medians are robust against outliers. I'll also accept that medians tell you how a "typical" data point behaves.
2. Then why don't we just always use a median instead of a mean?  
\-> I've seen people *really* struggle with this one. Even if they understand the math, it's hard to answer this question if you haven't tried to use a median when you shouldn't. The main answer I'm looking for is that the mean is an unbiased (or linear) operator, and then an example of when that matters.

Hope this is helpful!
0 like 0 dislike
0 like 0 dislike
I find these questions vague.

Whether you want to use the mean or the median depends on the problem.

1. If you want mathematical justifications, you can also say that you should use the median if you want to minimize the MAE and the mean if you want to minimize the MSE.

2. I would also be puzzled by this question. Maybe you could rephrase this? Just saying this as someone with a PhD in statistics and as a technical and methodological lead in data science.
0 like 0 dislike
0 like 0 dislike
I prefer the harmonic mean, personally.
0 like 0 dislike
0 like 0 dislike
>Then why don't we just always use a median instead of a mean? The main answer I'm looking for is that the mean is an unbiased (or linear) operator

Are these good reasons? Obviously (I guess) the sample mean is an unbiased estimator of the population mean. & any operator fulfilling certain conditions is linear...

Surely a major reason to use the mean is that in situations where a KPI (esp for a skewed distributions) involves summing a sample from the distribution (eg total sales or costs of an item) the mean will reflect that much better (because the calculation duplicates the one for the KPI).
0 like 0 dislike
0 like 0 dislike
the reason is that if you have a lot of data computing the mean is super cheap while computing the median is extremely expensive because you have to keep every single data point in memory.
0 like 0 dislike
0 like 0 dislike
The irony is the vast majority of data science folks use median far too often. I've seen so many times where a model uses median instead of mean "to eliminate outliers". But if the distribution isn't normal and if the outliers are actual data points rather than errors, eliminating those can be disastrous for model accuracy.


Like this is a dumb example but say you're building a model for the lottery on how much to pay out. Every ticket you sell nets you $1, and you check the median of the payout, you wouldn't want to use the mean and capture outliers like winners would you? You get $0. Sweet you're making 100% profit no matter how much you pay out to the winner.


A more realistic example is say you're a landlord building a model on how much rent to charge. If you look at median costs and revenues you'll vastly understate how much you need to charge. Because while the vast majority of tenants will pay on time and not wreck the place, the times you get tenants who you have to evict and pay 5-6 figures in repairs will bankrupt you if you're not charging more on median cases to make up for those outliers.
0 like 0 dislike
0 like 0 dislike
Very interesting that all of the responses to this post are citing various mathematical differences between the properties of means and medians and basically no one has mentioned that it really depends on the business problem/process you're analyzing and whether or not skewness and outliers matter in the context of the question you're actually trying to answer.
0 like 0 dislike
0 like 0 dislike
In large samples from a normal distribution, the median has a larger standard error compared to the mean. It's also not unimaginable to have a situation where a measure that is influenced by outliers is preferable.

Good question!
0 like 0 dislike
0 like 0 dislike
Wouldn't a mean tell you if data is skewed or not, but a median can't?
by
0 like 0 dislike
0 like 0 dislike
I think I would change the first question from "Why" to "When".

More often than not, we use the mean and not the median where I work because the median doesn't make sense for the work that we're doing.

This is the same for when I was in school, most of the applied stats courses / operations research courses I took typically viewed the median as accessory to the mean which was more valuable / leveragabe.
0 like 0 dislike
0 like 0 dislike
There's no context for these questions. Like in which context are we using mean or median? For descriptive statistics? For calculating predictions? For modeling?

In #2, are you thinking of the law of large numbers? Or are you thinking of why linear regression is a regression to the mean rather than to the median?

I think these questions would be better if you added context "Suppose you are doing X, why would you want to use median instead of the mean?" or "Thinking about X, why  is it more common to use the mean rather than the median if the median is ..."

No related questions found

33.4k questions

135k answers

0 comments

33.7k users

OhhAskMe is a math solving hub where high school and university students ask and answer loads of math questions, discuss the latest in math, and share their knowledge. It’s 100% free!