The typical intuition is:
- imagine your two probability distributions as a two piles of sand with the same amount of sand
- now think of one pile as being the pile before moving it, and the other one as the pile after moving it
- ask yourself in what way you can move the one pile to the other, such that you have to do the least amount of work, that is, every time you carry sand you want to move it the shortest possible distance, especially if it’s a lot of sand you’re carrying.
- if your distance is a Lp metric, then the answer to this question is the p-Wasserstein distance
To summarize, the Wasserstein distance tells you how lazy you can be when moving stuff around while still getting it done. For a more thorough and less handwavy look at it I recommend “Computational optimal transport” by peyré and cuturi, and “optimal transport: old and new” by villani.