A Prediction Is Worth a Thousand Words: An Introduction to Term Selection in JMP® Pro (2021-EU-30MP-747)

1 Kudo

Level: Intermediate

Clay Barker, JMP Principal Research Statistician Developer, SAS
Paris Faison, JMP Statistical Tester, SAS
Ernest Pasour, JMP Principal Software Developer, SAS

The Text Explorer platform in JMP is a powerful tool for gaining insight into text data. New in JMP Pro 16, the Term Selection feature brings the power of the Generalized Regression platform to Text Explorer. Term Selection makes it easy to build and refine regression models using text data as predictors, whether the goal is to predict new observations or to gain understanding about a process. In this talk, we will provide an overview of this new feature and look at some fun examples.

Auto-generated transcript...

Speaker	Transcript
Clay Barker	Thank you, my name is Clay Barker. I'm a statistical developer in the JMP group and today I'm going to be talking about a new feature in JMP pro for 16. It's called term selection and I've worked on it with my colleagues Paris Faison and Ernest Pasour.
	So text data are becoming more and more common in practice. We may have customer reviews of a product we make, we may have descriptions of some events or maintenance logs for some of our equipment.
	And in this example here on my on my slides, this is from the
	aircraft incidents...incident data set that we have in sample data.
	And every row of data is a description of some airline incident, a crash or other other kind of malfunction.
	So if you've never used the Text Explorer platform, it was introduced in JMP 13 and we primarily use it to help summarize and analyze text data.
	So it makes it easy to do common tasks like looking at the most frequent or most commonly occurring words and phrases, and it makes it easy to do things like cluster the documents or look for themes, like topic analysis. And everything is driven by what's called the document term matrix.
	So what is the document term matrix? It's easiest to think of it just as a matrix of indicator variables for whether or not each document includes a particular word or phrase.
	And we may weight that by how often the word occurs in each document. So each document is a row in our document term matrix and each column is a word.
	So for a really simple example here, that first line is "the pizza packaging was frustrating." So we have a 1 for packaging,
	a 1 for pizza and a 1 for frustrating and 0 for all the other words. And, likewise, for the second line, we have a 1 for smell, great, taste, and pizza and 0s elsewhere. It's just a simple summary of what words occur in each document.
	I've also mentioned there's a there's a couple variations of the document term matrix. The easiest is the binary; it's just ones and zeros.
	But we may want to include information about how often each word occurs, so here's another slightly longer sentence where pizza appears multiple times.
	In the binary document term matrix, that pizza column is still a 1 because we don't care how often it occurs. For ternary it's 2, because the ternary is zeros, ones and twos. It's 0 if the word doesn't occur, it's 1 if it occurs once, and it's 2 if it occurs more than once.
	Regardless of if it occurs four times here, we still coded as a 2. And then frequency is really simple. It's just the number of times each word occurs in that document.
	So those are three really simple ones, we also offer TF-IDF weighting, which is kind of like a relative frequency, but you can learn more about these weighting schemes in
	Text Explorer platform.
	So what's the next step? What is the next thing you might want to use next?
	You might want to use the text to help with some outcome. So here I've made up some furniture reviews, and there's a star rating that's associated with every review, right.
	So we might think that those reviews give us some clues about why someone would rate a product higher or lower. So in this very first line, "the instructions were frustrating" and the user only rated us a 3. If that...if that pattern happens a lot and a lot of people are rating...
	a lot of the lower ratings are associated with instructions, that might tell us that we we need to improve the instructions that we ship with our furniture.
	So the the simple idea is to use those as a regression model. We're going to...we're going to take our document term matrix,
	we're going to combine it with our response information, and we're just going to do a regression.
	And that will help us understand why customers like or don't like our furniture, or whatever product we're making.
	It can help us classify objects, based on the descriptions, we'll see some of that later, or maybe we want to understand why some machines fail or not based on some of their maintenance records.
	And the way we do this is really easy.
	We're really just making a regression model, where each of our X, or our predictor variables, is one of the columns in the term matrix. So if we're modeling product ratings, that's a simple linear regression and if we're
	if we're modeling a binary outcome, like whether or not a customer would recommend our product
	is possibly hundreds of words and it would make sense to not use them all. Not all of the data are going to be useful, so we're going to apply a variable selection technique to our regression model and we'll get a simpler model that's that's easy to interpret and that it fits well.
	And here at the bottom it's easy just to visualize combining those indicator variables with our response rating, our star rating.
	So our solution that is going to be in JMP 16 is to bring regression models into the Text Explorer platform. If you've ever used the generalized regression platform to do variable selection, we're essentially embedding that platform inside of Text Explorer.
	So if you have JMP pro, you'll see term selection item in the red triangle menu for Text Explorer.
	This is what we're really excited about. It makes it easy and quick to build and refine these kinds of regression models.
	So when you launch this platform, and we'll go through some demos in just a minute, but just quickly,
	the very beginning of the launch, it's just asking for information about our response and the kind of model that we wanted to fit.
	So it can handle both continuous responses, like a star rating, and when you specify a continuous response, it provides a filter so that you can filter out some of the some of the rows, based on the response.
	And when we have a nominal response column, we select the target level. So in this case, would you recommend our product, yes or no. We're going to be modeling the...
	we're going to be modeling the recommendation equal no.
	And if we have a multiple level response like blue, green, yellow, we'll be picking the level that we want to model and we'll see an example of this in just a minute.
	Then, after you after you specify the response, we're going to give the platform some information about the kind of model we want to fit.
	So when we do variable selection, are we going to use the elastic net or the lasso? Those are both variable selection techniques built into generalized regression. And how are we going to do validation? Do we want to use the AIC or BIC?
	Additionally, we can specify details about our document term matrix. We can select the weighting and also select the maximum number of terms that we want to consider. So if we want to consider the top 200 most frequently occurring words, that's that's what it's set to now at 200.
	So what happens when you launch it and you hit that run button? Basically it does everything
	you need to do. It sets up the data properly behind the scenes so that you don't have to worry about it. It does variable selection, so now we have a small subset of words that we think are useful for predicting our response, and it presents the results in an easy
	to interpret report, and it's it's quite interactive, as we'll see in a moment. So you used to be able to do this by
	saving the document term matrix to your data set, launching a generalized regression platform yourself,
	but you don't have to do that anymore. It's it's all it's all in terms selection now.
	So that model specification, how do we know what, how do we know what to select?
	The estimation methods available are the elastic net and the lasso. The easy way to remember the difference between those
	is that the elastic net tends to select groups of correlated predictors. So in this setting, when our predictors are all words,
	that means that it will tend to select groups of words that tend to occur together. So if you think of instruction and manual,
	those words occur together frequently because of instruction manuals. So those two predictors would be highly correlated and the elastic net would probably select both of them, whereas the lasso would probably select instruction or manual but not necessarily both.
	Validation methods, we have the AIC and the BIC.
	So sort of the rule of thumb is that the AIC tends to overselect models and the BIC tends to underselect. So in our specifications,
	it's just that the AIC will tend to select a bigger set of words, while the BIC will select a smaller set of words.
	And personally I tend to use the AIC a lot in the setting because I would rather have more words than necessary than fewer but really that's a that's a matter of preference.
	And the document term weighting...
	the document term matrix weighting, I mean.
	It really depends on your problem. So in this example,
	the the word paint occurs in the review multiple times. That could mean that the reviewer was very...
	took...the paint was very important to that reviewer and that may be meaningful in our regression model, so we would want to use a weighting like frequency instead of binary.
	So then, what once you launch term selection, you'll end up in a place like this, where we have
	a summary of all the documents on the left and all of the words that we've selected on the right. And I'm just going to skip over this for now; it's easier to see when we start doing a demo.
	Another thing that we think is very useful is that we have a summary, so you can you can use term selection to fit a sequence of models and then they're all summarized at the top, and you can switch back and forth between them. And again we'll see that in just a moment.
	So let's just take a look at the platform.
	So first we'll take a look at this
	aircraft incident data set that's in JMP sample data folder.
	So every row in our table is an incident. And we know how much damage there was to the aircraft, what kind of injuries there were, and we have a description. So this last column is a description of exactly what happened in the incident.
	So we'll launch Text Explorer, and if you've never used Text Explorer before, on the left we have the most frequently occurring terms, and on the right we have them most frequently occurring phrases, so groups of words.
	So we want to use these words to maybe understand
	which...what causes a crash to cause more, you know, sustain more damage or maybe more serious injuries. So we're going to go to the red triangle menu and ask for term selection.
	So now we see this launch that we were looking at just a moment ago, and I want to learn more about damage, right. And in particular I want to...I might want to discriminate between incidents where the aircraft was destroyed versus less damage, so we'll select the target level to be destroyed.
	I'll mention that this early stopping, that kind of is a time saving feature. So
	when we do variable selection, it can be quite time consuming for much bigger problems. And if you leave this checkbox checked, it'll sort of say
	we think we have a model that's good enough; we're going to go ahead and stop early. I tend to uncheck that unless I know I have a much bigger problem.
	So I'll leave it with elastic net, AIC and we'll hit run.
	So now what's happening in the background is it created the data that I needed to fit this regression model, and it's fitting the regression model behind the scenes, and now I have my summary. So every
	every document is summarized on this left panel, so this is the first aircraft incident, this is the second, which is blank.
	So this first one is, the airplane contacted a snowbank on the side of the runway.
	And we have we have all the important words highlighted, so the blue words have negative coefficients, the orange words have positive coefficient.
	And if we're interested in the words, we can look at this panel on the right. I tend to like to sort by the coefficient so you can just click at the top.
	So, in this instance, words with negative coefficients are less likely to have been destroyed. So if
	the document contains the word landing, if the plane has made it to the point where it's landing, it's probably not a very serious incident. So these are less likely to be
	destroyed. On the flip side, we can look at the most positive terms, and these are the words that are most associated with the aircraft being destroyed.
	And those are words like fire, witnesses. If they're interviewing witnesses to describe the incident that probably means whoever was in the plane wasn't able to do an interview.
	So fire, radar, spinning, night, these are all words that are highly associated with
	with being destroyed.
	Now, maybe we're also interested in the injury level, and specifically, we're going to look at the worst injury levels.
	And we'll do term selection again.
	And I'll just accept that for to keep it quick.
	So now we can do the same sorts of things, but now looking at injury level instead.
	So if the if the interview includes the captain or landing, these are probably not very serious in terms of the injury level.
	But if if again if it mentions the radar, night, mountains, these are much more serious words. So we can kind of quickly go back and forth between these two...these two responses and see which words are predictive of those two
	injury level and damage level.
	So let's take a look at another example.
	So this is one that I really enjoy. Earlier, I was talking about pizza in my examples. Now we're going to talk about beer. So these are obviously things that are near and dear to my heart.
	So what I've got here is I've I've got...I downloaded a description of every beer style, according to some beer review
	body.
	So every single beer has a description.
	And it says, where the beer came from. So we have...
	the United States, Germany, Ireland.
	Use term selection to see