BookmarkSubscribeSubscribe to RSS Feed
Choose Language Hide Translation Bar
Byron_JMP

Staff

Joined:

Apr 26, 2012

Fitting Curves to the XKCD Data Set

Randal Munroe publishes one of my all time favorite comics, xkcd.  Last week's  comic  was on curve fitting, and was, arguably kind of hilarious. @Randal Munroe, dude you're hilarious. Anyone who was a discovery and heard about his Thing Explainer concept would surely agree. 

 

So anyway, I read the comic and then dove into the data. There is an amazing tool that @brady_brady originally published for generating Homerian distributions that can be found buried in this tremendous add-in (which is a collection of other useful scripts and add-ins, a must-have). With the "Get Data Points By Clicking On Image" tool I extracted the exact data from the comic (data attached below) and then started fitting all kinds of conventional and less conventional models and some possibly dishonest models with no regard for how badly overfit the models were (even though I could/should have, with JMP Pro.)

 

Screen Shot 2018-09-25 at 3.52.22 PM.png

             Quick Note on Axis Scale.  This one seemed good to me, although it was

             missing on the original figure.  I could have done that too, but they there

             would have been a lot of clicking and I just didn't have it in me today. 

 

 

Here is a ranked table of fits

Measures of Fit for Y sorted by RSquare

Predictor

Creator

 

RSquare

RASE

AAE

notes

regression 10 lags

Fit Least Squares

 

0.9808

2.9103

2.3275

not legal, fit X along with a series of 10 sequential lags to the data. Works great when there is any hint of a trend in the data, might over fit, but only just a little, plus there are 11 parameter and I didn't bother to report adjusted r-square here. 

partition BF

Bootstrap Forest

 

0.8469

7.4494

5.4207

500 trees, and some of them fell in the right direction

partition BT

Boosted Tree

 

0.6737

10.876

8.1600

50 layers, used defaults, un-tuned

partition over fit

Partition

 

0.6607

11.091

8.3713

made the minimum split size small and split to the end

regression ^10

Fit Least Squares

 

0.5687

12.505

9.3912

10th order polynomial, kind of wiggly

regression ^8

Fit Least Squares

 

0.5479

12.803

9.6483

8th order polynomial, not quite as wiggly as 10th order

regression 2 pieces

Fit Least Squares

 

0.4459

14.173

9.8317

used partition to split the data in two and fit two regression models

fit curve 4pR

Fit Curve Logistic 4P Rodbard

 

0.4277

14.404

10.496

4 parameter logistic regression, Robard model

partition 2s

Partition

 

0.4249

14.439

10.745

partition with only one split, results in two levels, not likely over fitted.

fit curve 5p

Fit Curve Logistic 5P

 

0.4104

14.620

10.826

5 parameter logistic regression

regression ^4

Fit Least Squares

 

0.3932

14.832

10.402

4th order polynomial

regression ^2

Fit Least Squares

 

0.3682

15.134

10.541

regular old 2nd order polynomial

fit curve 3p growth

Fit Curve Mechanistic Growth

 

0.3393

15.477

11.080

fancy non-linear model

regression

Fit Least Squares

 

0.2779

16.180

12.209

straight line

regression log

Fit Least Squares

 

0.2472

16.520

12.695

X is log transformed

regression exp

Fit Least Squares

 

0.0691

18.371

15.429

X is exponentiated (exp transform)

 

You might expect to find some rationale behind why I tried each of the different models but at the end of the day this is a blog not an academic paper, so don't burn too much time looking for that.   I got this table from the Model comparison tool in JMP Pro, all I had to do was find all the methods and then save the prediction formula, the fancy platform did all the math.

 

Just in case you wanted to play with the data (and seriously, I do mean play), it should be attached below this article.

 

 

 

 

 

Article Tags