Choose Language Hide Translation Bar

Teaching Workflows from Day 1: Using JMP® Projects in the Classroom (2020-US-30MP-608)

Ruth Hummel, JMP Academic Ambassador, SAS
Rob Carver, Professor Emeritus, Stonehill College / Brandeis University

 

Statistics educators have long recognized the value of projects and case studies as a way to integrate the topics in a course. Whether introducing novice students to statistical reasoning or training employees in analytic techniques, it is valuable for students to learn that analysis occurs within the context of a larger process that should follow a predictable workflow.

In this presentation, we’ll demonstrate the JMP Project tool to support each stage of an analysis of Airbnb listings data. Using Journals, Graph Builder, Query Builder and many other JMP tools within the JMP Project environment, students learn to document the process. The process looks like this:

  • Ask a question.
  • Specify the data needs and analysis plan.
  • Get the data.
  • Clean the data.
  • Do the analysis.
  • Tell your story.

We do our students a great favor by teaching a reliable workflow, so that they begin to follow the logic of statistical thinking and develop good habits of mind. Without the workflow orientation, a statistics course looks like a series of unconnected and unmotivated techniques. When students adopt a project workflow perspective, the pieces come together in an exciting way.

 

 

 

Auto-generated transcript...

 

Speaker

Transcript

So welcome everyone. My name is
00 07.933
3
Ambassador with JMP. I am now a
retired professor of Business
00 30.566
7
between a student and a
professor working on a project.
00 49.700
11
12
engage students in statistical
reasoning, teach that
00 12.433
16
to that, current thinking is
that students should be learning
about reproducible workflows,
00 36.266
21
elementary data management. And,
again, viewing statistics as
00 58.800
25
26
wanted to join you today on this
virtual call. Thanks for having
00 20.600
30
and specifically in Manhattan,
and you'd asked us so so you
00 36.433
34
And we chose to do the Airbnb
renter perspective. So we're
00 51.733
38
expensive.
So we
started filling out...you gave us
00 09.166
43
44
separate issue, from your main
focus of finding a place in
00 36.066
49
you get...if you get through the
first three questions, you've
00 54.100
53
know, is there a part of
Manhattan, you're interested in?
00 11.133
58
repository that you sent us to.
And we downloaded the really
00 26.433
32.866
63
thing we found, there were like
four columns in this data set
00 46.766
67
figured out so that was this
one, the host neighborhood. So
00 58.100
71
72
figured out that the first two
just have tons of little tiny
00 13.300
76
Manhattan. So we selected
Manhattan. And then when we had
00 29.700
80
that and then that's how we got
our Manhattan listings. So
00 44.033
84
data is that you run into these
issues like why are there four
00 03.300
88
restricted it to Manhattan, I'll
go back and clean up some
00 18.033
92
data will describe everything we
did to get the data, we'll talk
00 28.400
33.200
97
know I'm supposed to combine
them based on zip, the zip code,
00 47.166
101
102
107 columns,
it's just hard to find the
00 09.366
106
them, so we knew we had to clean
that up. All right, we also had
00 27.366
111
journal of notes. In order to
clean this up, we use the recode
00 45.500
115
Exactly. Cool.
Okay, so we we did the cleanup
00 02.200
119
Manhattan tax data has this zip
code. So I have this zip code
00 19.300
123
day of class, when we talked
about
data types. And notice in the
00 42.300
128
the...analyze the distribution of
that column, it'll make a funny
00 03.200
133
Manhattan doesn't really tell
you a thing.
But the zip code clean data in
00 18.466
23.266
139
just a label, an identifier, and
more to the point,
when you want to join or merge
00 41.833
48.766
145
important. It's not just an
abstract idea. You can't merge
00 03.166
11.266
150
nominal was the modeling type,
we just made sure.
00 26.200
31.033
155
about the main table is the
listings. I want to keep
00 45.533
159
to combine it with Manhattan tax
data.
Yeah. Then what? Then we need to
00 03.266
164
tell it that the column called
zip clean,
zip code clean...
Almost. There we go.
And the column called zip, which
00 33.200
171
172
Airbnb listing
and match it up with anything in
00 57.033
177
178
them in table every row, whether
it matches with the other or
00 13.233
182
main table, and then only the stuff
that overlaps from the second
00 29.600
186
another name like, Air BnB IRS
or something? Yeah, it's a lot
00 50.966
190
do one more thing
because I noticed these are just
data tables scattered around
00 06.666
195
running. Okay. So I'll save this
data table. Now what?
And really, this is the data
00 19.833
22.033
26.266
35.466
203
anything else, before we lose
track of where we are, let's
00 49.733
58.800
01.833
209
or Oak Team?
And then
part of the idea of a project
00 23.700
214
thing. So if you
grab, I would say, take the
00 50.100
218
219
220
two original data sets, and then
my final merged. Okay Now
00 16.200
225
them as tabs.
And as you generate graphs and
00 36.566
229
230
231
even when I have it in these
tabs. Okay, that's really cool.
00 58.833
02.500
236
right, go Oak Team.
Well, hi, Dr. Carver, thanks so
00 19.233
240
you would just glance at some of
these things, and let me know if
00 32.300
244
we used Graph Builder to look at
the price per neighborhood. And
00 45.400
248
help it be a little easier to
compare between them. So we kind
00 01.000
252
have a lot of experience with
New York City. So we plotted
00 18.166
256
stand in front of the UN and
take a picture with all the
00 31.733
260
saying in Gramercy Park or
Murray Hill.
If we look back at the
00 46.566
265
thought we should expand our
search beyond that neighborhood to
00 58.766
269
270
just plotted what the averages
were for the neighborhoods but
00 14.533
274
the modeling, and to model the
prediction. So if we could put
00 30.766
279
expected price. We started
building a model and what we've
00 42.800
283
factors. And so then when we put
those factors into just a
00 58.833
287
more, some of the fit statistics
you've told us about in class.
00 15.466
292
but mostly it's a cloud around
that residual zero line. So
00 30.766
296
which was way bigger than any of
our other models. So we know
00 45.800
300
reasons we use real data.
Sometimes, this is real. This is
00 58.266
304
looking?
Like this is residual values.
00 19.266
309
is good. Ah, cool.
Cool. Okay, so I'll look for
00 34.966
313
is sort of how we're answering
our few important questions. And
00 47.300
317
was really difficult to clean
the data and to join the data.
00 57.866
03.500
322
wanted to demonstrate how JMP
in combination with a real world
00 28.700
327
Number one in a real project,
scoping is important. We want to
00 47.600
331
hope to bring to the
to the group. Pitfall number two,
it's vital to explore the
00 08.033
336
the area of linking data
combining data from multiple
00 27.800
341
recoding
and making sure that linkable
00 45.100
345
346
reproducible research is vital,
especially in a team context,
especially for projects that may
00 05.966
351
habits of guaranteeing
reproducibility. And finally,
we hope you notice that in these
00 32.633
356
on the computation and
interpretation falls by the
00 51.900
360