Help with understanding Wasserstein distance

The typical intuition is:

-    imagine your two probability distributions as a two piles of sand with the same amount of sand
-    now think of one pile as being the pile before moving it, and the other one as the pile after moving it
-    ask yourself in what way you can move the one pile to the other, such that you have to do the least amount of work, that is, every time you carry sand you want to move it the shortest possible distance, especially if it’s a lot of sand you’re carrying.
-    if your distance is a Lp metric, then the answer to this question is the p-Wasserstein distance

To summarize, the Wasserstein distance tells you how lazy you can be when moving stuff around while still getting it done. For a more thorough and less handwavy look at it I recommend “Computational optimal transport” by peyré and cuturi, and “optimal transport: old and new” by villani.
Maybe limit yourself to a discrete example before thinking about the general definition. Imagine I have a line of concrete blocks on the ground. How much effort would I have to exert to lift a block and lay it on top of another? If I'm laying on the block next to it not very much effort since I don't have to walk far or maybe not walk at all, just lift the block and put it down. Compare that to laying the block on top of another further down the line. I will have to walk over to where I want to place it and that will require more effort to move. In this way the measurement being taken is the effort I have to expend to move a block, where the further I want to move the more effort I need to exert.
For probability and statistics, the Wasserstein metric defines a distance between probability distributions. You are really looking at _distances between random variables_ instead of distances between points in a Euclidean space.

You basically apply p-norm ideas from analysis to the distribution functions of random variables, and rely on the nice properties of distribution functions to ensure that your geometric interpretation of distance jives with reality.

In practice to compute actual distances between things, you do end up needing to do a bunch of integrals as you have noticed.

This is sort of related to how we can use _covariance_ as a distance between two zero-mean random variables, and then use that for ordinary least squares stuff.
Thank you everyone. The replies gave me a better intuition on Wasserstein metric
In addition to the geometric explanations here, I also like the fact that, if you use the squared L2 distance function, then the transport “map” is the joint distribution that maximizes the correlation

0 like 0 dislike