Subscribe Bookmark
Jeff_Perkinson

Community Manager

Joined:

Jun 23, 2011

Diagnose This!

This blog post was written by a blogger who is no longer at SAS

Frequently, statisticians have to act like doctors. We see statistical reports that try to describe something: how fast rumors spread based on how large a company is, or the relationship between nitrogen content and crop yield. Speed and gas usage. Almost anything you can think of.


So today, put on your diagnostician's cap and look at the four relationships I show you here. To keep you from guessing, I've hidden the labels for the two variables, so you'll be looking at Y1 and X1, Y2 and X2, Y3 and X3, and so on. Here's the data:


X1Y1X2Y2X3Y3X4Y4
108.04109.14107.4686.58
86.958

8.1486.7785.76
137.58138.741312.7487.71
98.8198.7797.1188.84
118.33119.26117.8188.47
149.96148.1148.8487.04
67.2466.1366.0885.25
44.2643.145.391912.5
1210.84129.13128.1585.56
74.8277.2676.4287.91
55.6854.7455.7386.89


I've even taken some advice I heard at a conference and added a plot to the statistics so that you can better see the relationship. I fit the least-squares line to each set and attached the plot of the line. Click on any picture to see it larger.


Here's Y1 vs. X1:


Y1 by X1


I highlighted some typical statistics that statisticians might use in discussing how well this line fits. Circles in the picture show the equation of the line (essentially y=3 + ½x), the R2(≅ 0.666), and the F-statistic (≅ 0.022. If you don't know what they are, bear with me. You'll still get the joke.


Here's Y2 by X2. Check the labels if you don't believe me:


Y2 by X2


Here's Y3 vs. X3:


Y3 by X3


And Y4 by X4:


Y4 By X4


You should have noticed that all the statistics are identical. The graphs are identical; the line of best fit is pretty much y = 3 + ½x.


Here's the playing-doctor part. Consider the fact that you've got four patients (graphs) exhibiting identical symptoms, numerically and graphically. What can you tell me about the underlying causes? It turns out, not much. Although I blindly followed the "put a graph in there" rule, it turns out I left out the most important graph, that of the data itself.


Here are the four graphs again, with the data points turned on.

4Graphs



  • The first graph is exactly what you want to see in a regression. Points are reasonably dispersed around the line of best fit.

  • The second graph is clearly not a linear relationship. If I wanted to show off, I'd say that you could fit a second-degree polynomial, a parabola, a conic section, to this data. But I don't want to show off. Just say that if your data isn't linear, you can't fit a line to it. Either make the original data linear (it's not cheating, really) or use another kind of model.

  • The third graph shows what happens when one point is an outlier.The single point is essentially pulling the least squares line upward. I would check that data point, since anyone can have a bad day when transcribing data.

  • The fourth graph is an even more extreme case of the third one. All the points line up along the vertical line x = 8. Except one. And it completely determines the equation of the line. Move it anywhere, and the least-squares line follows.



  • 2 Comments
    Community Member
    Community Member

    Daniel wrote:

    I saw this data set presented in a old Tufte book awhile back and it opened my eyes to the power and value of being able to visualize my data with JMP. Because JMP lets us examine ALL the data at once, we don't have to rely solely on summary statistics. I wonder how many calculations and decisions have been based on "lines" that are not true to the underlying data and process...

    Article Tags