Ruth Hummel, JMP Academic Ambassador, SAS
Rob Carver, Professor Emeritus, Stonehill College / Brandeis University
Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow.
In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this:
- Ask a question.
- Specify the data needs and analysis plan.
- Get the data.
- Clean the data.
- Do the analysis.
- Tell your story.
We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way.
Speaker | Transcript |
So welcome everyone. My name is | |
00 | 07.933 |
3 | |
Ambassador with JMP. I am now a | |
retired professor of Business | |
00 | 30.566 |
7 | |
between a student and a | |
professor working on a project. | |
00 | 49.700 |
11 | |
12 | |
engage students in statistical | |
reasoning, teach that | |
00 | 12.433 |
16 | |
to that, current thinking is | |
that students should be learning | |
about reproducible workflows, | |
00 | 36.266 |
21 | |
elementary data management. And, | |
again, viewing statistics as | |
00 | 58.800 |
25 | |
26 | |
wanted to join you today on this | |
virtual call. Thanks for having | |
00 | 20.600 |
30 | |
and specifically in Manhattan, | |
and you'd asked us so so you | |
00 | 36.433 |
34 | |
And we chose to do the Airbnb | |
renter perspective. So we're | |
00 | 51.733 |
38 | |
expensive. | |
So we | |
started filling out...you gave us | |
00 | 09.166 |
43 | |
44 | |
separate issue, from your main | |
focus of finding a place in | |
00 | 36.066 |
49 | |
you get...if you get through the | |
first three questions, you've | |
00 | 54.100 |
53 | |
know, is there a part of | |
Manhattan, you're interested in? | |
00 | 11.133 |
58 | |
repository that you sent us to. | |
And we downloaded the really | |
00 | 26.433 |
32.866 | |
63 | |
thing we found, there were like | |
four columns in this data set | |
00 | 46.766 |
67 | |
figured out so that was this | |
one, the host neighborhood. So | |
00 | 58.100 |
71 | |
72 | |
figured out that the first two | |
just have tons of little tiny | |
00 | 13.300 |
76 | |
Manhattan. So we selected | |
Manhattan. And then when we had | |
00 | 29.700 |
80 | |
that and then that's how we got | |
our Manhattan listings. So | |
00 | 44.033 |
84 | |
data is that you run into these | |
issues like why are there four | |
00 | 03.300 |
88 | |
restricted it to Manhattan, I'll | |
go back and clean up some | |
00 | 18.033 |
92 | |
data will describe everything we | |
did to get the data, we'll talk | |
00 | 28.400 |
33.200 | |
97 | |
know I'm supposed to combine | |
them based on zip, the zip code, | |
00 | 47.166 |
101 | |
102 | |
107 columns, | |
it's just hard to find the | |
00 | 09.366 |
106 | |
them, so we knew we had to clean | |
that up. All right, we also had | |
00 | 27.366 |
111 | |
journal of notes. In order to | |
clean this up, we use the recode | |
00 | 45.500 |
115 | |
Exactly. Cool. | |
Okay, so we we did the cleanup | |
00 | 02.200 |
119 | |
Manhattan tax data has this zip | |
code. So I have this zip code | |
00 | 19.300 |
123 | |
day of class, when we talked | |
about | |
data types. And notice in the | |
00 | 42.300 |
128 | |
the...analyze the distribution of | |
that column, it'll make a funny | |
00 | 03.200 |
133 | |
Manhattan doesn't really tell | |
you a thing. | |
But the zip code clean data in | |
00 | 18.466 |
23.266 | |
139 | |
just a label, an identifier, and | |
more to the point, | |
when you want to join or merge | |
00 | 41.833 |
48.766 | |
145 | |
important. It's not just an | |
abstract idea. You can't merge | |
00 | 03.166 |
11.266 | |
150 | |
nominal was the modeling type, | |
we just made sure. | |
00 | 26.200 |
31.033 | |
155 | |
about the main table is the | |
listings. I want to keep | |
00 | 45.533 |
159 | |
to combine it with Manhattan tax | |
data. | |
Yeah. Then what? Then we need to | |
00 | 03.266 |
164 | |
tell it that the column called | |
zip clean, | |
zip code clean... | |
Almost. There we go. | |
And the column called zip, which | |
00 | 33.200 |
171 | |
172 | |
Airbnb listing | |
and match it up with anything in | |
00 | 57.033 |
177 | |
178 | |
them in table every row, whether | |
it matches with the other or | |
00 | 13.233 |
182 | |
main table, and then only the stuff | |
that overlaps from the second | |
00 | 29.600 |
186 | |
another name like, Air BnB IRS | |
or something? Yeah, it's a lot | |
00 | 50.966 |
190 | |
do one more thing | |
because I noticed these are just | |
data tables scattered around | |
00 | 06.666 |
195 | |
running. Okay. So I'll save this | |
data table. Now what? | |
And really, this is the data | |
00 | 19.833 |
22.033 | |
26.266 | |
35.466 | |
203 | |
anything else, before we lose | |
track of where we are, let's | |
00 | 49.733 |
58.800 | |
01.833 | |
209 | |
or Oak Team? | |
And then | |
part of the idea of a project | |
00 | 23.700 |
214 | |
thing. So if you | |
grab, I would say, take the | |
00 | 50.100 |
218 | |
219 | |
220 | |
two original data sets, and then | |
my final merged. Okay Now | |
00 | 16.200 |
225 | |
them as tabs. | |
And as you generate graphs and | |
00 | 36.566 |
229 | |
230 | |
231 | |
even when I have it in these | |
tabs. Okay, that's really cool. | |
00 | 58.833 |
02.500 | |
236 | |
right, go Oak Team. | |
Well, hi, Dr. Carver, thanks so | |
00 | 19.233 |
240 | |
you would just glance at some of | |
these things, and let me know if | |
00 | 32.300 |
244 | |
we used Graph Builder to look at | |
the price per neighborhood. And | |
00 | 45.400 |
248 | |
help it be a little easier to | |
compare between them. So we kind | |
00 | 01.000 |
252 | |
have a lot of experience with | |
New York City. So we plotted | |
00 | 18.166 |
256 | |
stand in front of the UN and | |
take a picture with all the | |
00 | 31.733 |
260 | |
saying in Gramercy Park or | |
Murray Hill. | |
If we look back at the | |
00 | 46.566 |
265 | |
thought we should expand our | |
search beyond that neighborhood to | |
00 | 58.766 |
269 | |
270 | |
just plotted what the averages | |
were for the neighborhoods but | |
00 | 14.533 |
274 | |
the modeling, and to model the | |
prediction. So if we could put | |
00 | 30.766 |
279 | |
expected price. We started | |
building a model and what we've | |
00 | 42.800 |
283 | |
factors. And so then when we put | |
those factors into just a | |
00 | 58.833 |
287 | |
more, some of the fit statistics | |
you've told us about in class. | |
00 | 15.466 |
292 | |
but mostly it's a cloud around | |
that residual zero line. So | |
00 | 30.766 |
296 | |
which was way bigger than any of | |
our other models. So we know | |
00 | 45.800 |
300 | |
reasons we use real data. | |
Sometimes, this is real. This is | |
00 | 58.266 |
304 | |
looking? | |
Like this is residual values. | |
00 | 19.266 |
309 | |
is good. Ah, cool. | |
Cool. Okay, so I'll look for | |
00 | 34.966 |
313 | |
is sort of how we're answering | |
our few important questions. And | |
00 | 47.300 |
317 | |
was really difficult to clean | |
the data and to join the data. | |
00 | 57.866 |
03.500 | |
322 | |
wanted to demonstrate how JMP | |
in combination with a real world | |
00 | 28.700 |
327 | |
Number one in a real project, | |
scoping is important. We want to | |
00 | 47.600 |
331 | |
hope to bring to the | |
to the group. Pitfall number two, | |
it's vital to explore the | |
00 | 08.033 |
336 | |
the area of linking data | |
combining data from multiple | |
00 | 27.800 |
341 | |
recoding | |
and making sure that linkable | |
00 | 45.100 |
345 | |
346 | |
reproducible research is vital, | |
especially in a team context, | |
especially for projects that may | |
00 | 05.966 |
351 | |
habits of guaranteeing | |
reproducibility. And finally, | |
we hope you notice that in these | |
00 | 32.633 |
356 | |
on the computation and | |
interpretation falls by the | |
00 | 51.900 |
360 |