ago
0 like 0 dislike
0 like 0 dislike
Which is a bigger evil - Endogeneity or Multicollinearity ?

Can the problems arising due to Endogeneity or Multicollinearity be ignored if the purpose is only prediction and not inference ?
ago
0 like 0 dislike
0 like 0 dislike
The worst thing that can happen with multicollinearity is perfect collinearity, in which case you will not be able to estimate your model as your X matrix is not invertible. Multicollinearity is in and of itself not an issue - it will leave you with less precise estimates, but that's grosso modo all that happens. Note that in the presence of multicollinearity your point estimates are not affected, only the estimated standard deviation. Endogeneity is a much bigger problem as your estimates will be biased. In my opinion, endogeneity can never be ignored.
ago
0 like 0 dislike
0 like 0 dislike
Endogeneity and It’s not close.
ago
0 like 0 dislike
0 like 0 dislike
I am in undergrad and have not heard of Endogeneity yet. This is cool info to look for in my regression models. It says a Hausman Test is how you find if variables are endogenous. Thanks for the new word
ago
0 like 0 dislike
0 like 0 dislike
Multi collinearity can be countered with larger datasets
ago
0 like 0 dislike
0 like 0 dislike
Multicollinearity has some theoretical implications in Sociology and Demography. For instance, the studies of individuals' social origin, destination and mobility or the age-period-cohort effects involve the use of perfectly collinear variables  (one is the linear transformation of the other two). But all the three have substantial significance. Endogeneity in the Social Sciences is something that we are always aware of, but given the social reality it is extremely difficult (as pointed out in this thread) to find a definitive solution.
ago
0 like 0 dislike
0 like 0 dislike
In a broad sense I feel like Endogeneity is hard to escape, well more so than multicollinearity, but likely there will be both in any model. How do you know if you’re missing an important regressor unless you over specify a model and dig around? It’s more practical to look into multicollinearity. With too much multicollinearity you’ll be missing precision, but with endogeny, you’ll have a misspecified model. If you know a variable accounts for some error, don’t throw it out, even if it’s collinear.
ago
0 like 0 dislike
0 like 0 dislike
I’m on team endogeneity. Large enough samples can make issues around collinearity less salient; the same cannot be said for endogeneity.

If all you care about is prediction, then you presumably do not care about coefficients. Endogeneity isn’t an issue here, per se, though it’s still possible to have biased predictions if you don’t do it correctly. This is a different problem, however.
ago
0 like 0 dislike
0 like 0 dislike
Endogeneity leads to biased estimates, ie your statistical analysis/inference can become invalid. That being said, the usual culprit is omitted variable bias, so you can often times attempt to resolve the issue by finding and including confounding variables.

Multicollinearity doesn’t make your inference invalid, you just have a hard time getting stat sig (or coefficients in the correct sign) in the first place. There are some work arounds like aggregating collinear variables or taking a ratio of collinear variables. However, if you want to keep all of your input variables as-is in the model, then the only real solution is to get more observations from the data generating process (ie get more samples).

Whether endogeneity or multicollinearity is the more problematic issue will depend on what you are trying to do. Personally for me, the work I do generally is on post-hoc  analysis on historical data, so getting more observations is usually impossible, hence I usually have a harder time resolving multicollinearity issues in my variables.

But often times you might realize you have omitted variable bias, but it’s not feasible to get data/observations on the omitted variable(s). For example, if you’re doing market research, you might realize you need to control for competitor spending, but this data is generally very hard to obtain. In such cases, omitted variable bias can be more problematic.

So in true data science/statistics fashion, the answer is “it depends” :)

As for the 2nd question: the answer is “yes”. If all you are trying to do is minimize Mean Squared Error (ie get the most accurate predictions) then you don’t actually care about the estimated coefficients or the standard errors. In this case neither endogeneity (biased estimates) nor multicollinearity (large standard errors) are issues, since you’re not even looking at these things. For prediction, all you care about is the model’s ability to output a number on unseen data, so what matters the most is finding variables that can get you the lowest prediction errors on unseen data (usually measured using some kind of Cross Validation score).
ago

No related questions found

33.4k questions

135k answers

0 comments

33.7k users

OhhAskMe is a math solving hub where high school and university students ask and answer loads of math questions, discuss the latest in math, and share their knowledge. It’s 100% free!