Stats notes (1)

I was asked several statistics related questions, which I think quite bizarre. The following is my take/reflections on these questions.

Unlike popular belief, an ordinary linear regression model (i.e. using ordinary least squares, or OLS method) does NOT require the assumption of normality (http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm). Neither the dependent or independent variables are required to be normal. For residuals “normality is necessary only for hypothesis tests to be valid”.

With this in mind, you cannot ask what is the difference between an ordinary linear and logistic regression models on the basis that one requires normality the other does not, which is a false claim. The main difference between the two is one is for continuous dependent variables; and the other is for binary categorical dependent variables.

Now logistic regression. Sure it can be viewed as a special case of a generalised linear model (GLM) (so as ordinary linear model), however GLM is just one of many formulations. Logistic regression model can also be derived as a latent variable model (http://en.wikipedia.org/wiki/Logistic_regression#Formal_mathematical_specification). So if someone asks you “what do ordinary linear and logistic models have in common”, s/he needs to be specific of what s/he means (i.e. in terms of what?). One would never conclude/guess that “they are both GLM” if the person questioned thinks of a logistic model as a latent variable model.

Finally, how to assess whether a model is a good or bad model. Let’s be clear: the most important thing is model inference (statistics is all about making sense of data). Some are more interested in prediction. Fine. Then measures of goodness-of-fit should be used, such as R-squared, AIC, BIC, MAD etc. Some then suggested cross-validation. Let me say this: cross-validation is generally not a good idea in statistics and in many cases wrong. When you can “cross-validate”, it usually means that you have more data on hand. If you have more data, why not build your model based on the WHOLE sample rather than estimate a model using a crippled subsample? Even non-statisticians would understand that the more data, the better the model is. If you use a subsample, you would get a worse model, which would be less useful. As stated by Antonakis and Dietz (2011), “Nobel prizes have been earned in econometrics for methods to correct for truncated samples, among other contributions (e.g., Heckman, 1979; Tobin, 1958). In short, researchers must avoid or correct sample bias instead of creating it.” Cross-validation might be justifiable if your computer is so slow (for a given model) using the whole sample, so you may want to build a model based on a subsample and test the other part of the data. However, if we are talking about and committed to the “big data” analysis, this would not be an excuse and as such there is even less reason to conduct cross-validation.

It came to my mind that what you ask could actually reflect what level of your knowledge is. For example, if one only knows a logistic model is a GLM, s/he would just design a question as confused as above, leaving others wondering what the heck?!

Advertisements
This entry was posted in Subjects. Bookmark the permalink.

One Response to Stats notes (1)

  1. Little Rabit the milk chocolate says:

    Although I have admit I didn’t followed all the terminology, cannot agree more on the last paragraph…..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s