Models, models, models…

Fitting a model may be sometimes easy, sometimes difficult or even frustrated. Especially the one that I am using, the full Bayesian models. Yes I know it is extremely flexible, powerful, literally can do anything. But the problem is it is not so convenient and slow (not surprise for its simulation method). The funny part is it often gives you similar results for complex models (the very reason to use Bayesian methods). So the question is, is it worth the effort?

For example, you want to model count data. You may use Poisson, NB, Poisson lognormal with CAR; the 1st order neighbour structure, the 2nd order structure; fixed or random effect for time effects, or even using autocorrelation… Maybe NB is good enough. There are lots of researchers around the world playing around and showing the rest “complex” ones, though I guess they initially used NB for testing different parameter combinations before finalise the models. Then they “found” the results are similar for complex and less complex models. This is expected and the reason they use less complex, fast models to test isn’t it? I am not saying this is purely “showing off”, but as in this TRB 2009 meeting, one presenter stated, the full Bayes is only worth if the data has small units that are correlated, and if you have a good PhD student (like me:)) who is willing to do the analysis…

Still, there are numerous problems about the models. Sometimes the problem seems so simple and basic. Let’s look at the following example that my supervisor gave me the other day:

The Simpson’s Paradox:
Let’s say there are two persons: Sleve and Mark. They are working to produce some “products”:


Year 1

Year 2







where 500, 320 etc may be the number of products; and they are divided by, say months, to get the “outcome per month” so they can be compared with each other.

So Sleve claims that he is more productive than Mark, because clearly, indeed, the resulting values (50, 80) are higher than Mark’s (45, 70), each year.

Then Mark says, hold on, I am more productive than you (Sleve), why? Let’s take a look at the whole process: for the whole two years, Sleve got (500+320)/(10+4)=58.6; and Mark got (270+700)/(6+10)=60.6. Clearly Mark is more productive than Sleve.

So whose claim is correct? It should be also noted that this has nothing to do with sample size as we are comparing ratios.

That said, model and data are not easy to understand. Sometimes the question is as simple as, say, you got “more police more crime” relationship from your model, so is it the case that police encourage more crime; or crime decreases but crime record increases because of more police in place?

Maybe we should always keep this in mind when playing with the models: “All models are wrong, but some are useful.” – George E. P. Box.

This entry was posted in Subjects. Bookmark the permalink.

10 Responses to Models, models, models…

  1. says:

    so in practice, we always prefer the averaging method as it’s stable.

  2. says:

    i mean i would say overall the two years, Mark is more productive.

  3. TAO says:


  4. Chao says:

    @Tao: 全世界人民都羡慕朝鲜的模型应该属于非 "useful" 的模型了。好像大学排名,如果把大连大学排在清华大学之上的model就不怎么useful了,在useful的模型下我们倒是可以看看大连理工和中国人民大学那个更高了。关键是数据分析本身的复杂和how to interpretation。当然还要接受全世界researchers的检验。

  5. Chao says:

    PS. 神雕侠侣又出现了:)

  6. L says:

    I didn’t use any models in my master thesis, simply becoz realise it’s not suitable for practical issues, though model may bring higher marks.. Yes, don’t think the ones who apply models really understand what they are using…

  7. WYG says:


  8. Augustine Karho says:

    I want to give a remark here, as this phenomenon, called as Pearson-Yule-Simpson effect, is very common in Probability Theory, but rather paradoxical. Using the probability triple (O_i, F_i, P_i), where i = 1 or 2 respectively for both Sleve and Mark, it is easy to see that Sleve performs better. However, when you try to use the average to look at them, you are creating a new probability triple (O, F, P), where O = O1 \cup O2, but F >> F_1 \cup F_2 and the same P. As a result, What P measures are totally different. Further, the Influence Function (in fact, Frechet Differentiation) of the Common Mean shows that the Common Mean varies greatly when F increases or decreases too much; it indicates that Common Mean is not a good indicator for such a situation… Maybe, you can try to use another estimator, such as Windsorized Mean, Trimean….

  9. Chao says:

    An appropriate answer is:
    although Sleve is more productive than Mark in both years separately, however, a high proportion of products were produced by Mark in Year 2 when his productivity was high (at 70); and on the contrary, most products were produced by Sleve in Year 1 when his productivity was low (50). This means that many products were “shifted” (or “assigned”) from Sleve to Mark in Year 2 when the overall productivity improved (say, due to technical advances), resulting most products produced by Mark at high productivity rates. This overall much more increases Mark’s productivity compared to Sleve.

    For answering the question, if there is no particular reason why more products were assigned to Mark in Year 2 (i.e. products were shifted regardless), it is better to look at each year individually. The correct way to calculate overall productivity should be involved with some weighting scheme. However, if the technology advances (for example) affect how many products Sleve/Mark can produce, then it is better to use aggregated data (i.e. 2 years’ data).

    In this particular case, since there is no other information on the shift of the products, we could assume products were assigned in pure arbitrary manner, and therefore the conclusion is that Sleve is more productive.

  10. Pingback: A solution to the Simpson’s Paradox | Chao's home

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s