Expanding Our Text Mining Toolkit with Sentiment Analysis and Term Selection in JMP® Pro 16 (2021-EU-45MP-790)
Ross Metusalem, JMP Systems Engineer, SAS
Text mining techniques enable extraction of quantitative information from unstructured text data. JMP Pro 16 has expanded the information it can extract from text thanks to two additions to Text Explorer’s capabilities: Sentiment Analysis and Term Selection. Sentiment Analysis quantifies the degree of positive or negative emotion expressed in a text by mapping that text’s words to a customizable dictionary of emotion-carrying terms and associated scores (e.g., “wonderful”: +90; “terrible”: -90). Resulting sentiment scores can be used for assessing people’s subjective feelings at scale, exploring subjective feelings in relation to objective concepts and enriching further analyses. Term Selection automatically finds terms most strongly predictive of a response variable by applying regularized regression capabilities from the Generalized Regression platform in JMP Pro, which is called from inside Text Explorer. Term Selection is useful for easily identifying relationships between an important outcome measure and the occurrence of specific terms in associated unstructured texts.
This presentation will provide an overview of both Sentiment Analysis and Term Selection techniques, demonstrate their application to real-world data and share some best practices for using each effectively.
Speaker |
Transcript |
Ross Metusalem | Hello everyone, and thanks for taking some time to learn how JMP is expanding its text mining toolkit in JMP Pro 16 with |
sentiment analysis and term selection. | |
I'm Ross Metusalem, a JMP systems engineer in the southeastern US, and I'm going to give you a sneak preview of these two new analyses, explain a little bit about how they work, and provide some best practices for using them. | |
So both of these analysis techniques are coming in Text Explorer, which for those who aren't familiar, this is JMP's | |
tool for analyzing what we call free or unstructured texts, so that is natural language texts. | |
And it's a what we call a text mining tool, so that is a tool for deriving quantitative information from free text so that we can | |
use other types of statistical or analytical tools to derive insights from the free text or maybe even use that text as inputs to other analyses. | |
So let's take a look at these two new text mining techniques that are coming to Text Explorer in JMP Pro 16, and we'll start with sentiment analysis. | |
Sentiment analysis at its core answers the question how emotionally positive or negative is a text. | |
And we're going to perform a sentiment analysis on the Beige Book, which is a recurring report from the US Federal Reserve Bank. | |
Now apologies for using a United States example at JMP Discovery Europe, but the the Beige Book does provide a nice demonstration of what sentiment analysis is all about. | |
So this is a monthly publication, it contains national level report, as well as 12 district level reports, that summarize economic conditions in those districts, and all of these are based on qualitative information, things like interviews and surveys. | |
And US policymakers can use the qualitative information in the Beige Book, along with quantitative information, you know, in traditional economic indicators to drive economic policy. | |
So you might think, well, we're talking about sentiment or emotion right now. I don't know that I expect economic reports to contain much emotion, but the Beige Book reports and much language does actually contain words that can carry or convey emotion. | |
So let's take a look at an excerpt from one report. Here's a screenshot straight out of the new sentiment analysis platform. | |
You'll notice some words highlighted, and these are what we'll call sentiment terms, that is, | |
terms that we would argue have some emotional content to them. For example at the top, "computer sales, on the other hand, have been severely depressed," | |
where "severely depressed" is highlighted in purple, indicating that we consider that to convey negative emotion, which it seems to if somebody describes computer sales as "severely depressed" it sounds like they mean for us to interpret that as as certainly a bad thing. | |
If we look down, we see in orange, a couple positive sentiment terms highlighted, like "improve" or "favorable." So we can look for words that we believe have positive or negative emotional | |
purple for negative, orange for positive, and some sentiment analysis keeps things at that level, so just a binary distinction, a positive text or a negative text. | |
There are additional ways of performing sentiment analysis and, in particular, | |
many ways try to quantify the degree of positivity or negativity, not just whether something is positive or negative. So consider this other example and I'll point our attention right to the bottom here, where we can see a report of "poor sales." | |
And I'm going to compare that with where we said "computer sales are severely depressed." | |
So both of these are negative statements, but I think we would all agree that "severely depressed" sounds a lot more negative than just "poor" is. | |
So we want to figure out not only is a sentiment expressed positive or negative, but how positive or negative is it, and that's what sentiment analysis in | |
Text Explorer does. So how does it do it? Well, it uses a technique called lexical sentiment analysis that's based on some sentiment terms and associated scores. | |
So what we're seeing right now is an excerpt from what we'd call a sentiment dictionary that contains the terms and their scores. | |
For example, the term "fantastic" has a score of positive 90 and the term "grim" at the bottom has a score -75. | |
So what we do is specify which terms we believe carry emotional content and the degree to which they're positive or negative on an arbitrary scale, here -100 to 100. | |
And then we can find these terms in all of our documents and use them to score how positive or negative those documents are overall. | |
If you think back to our example "severely depressed," that word "severely" and takes the word "depressed" and what we call intensifies it. | |
It is an intensifier or a multiplier of the expressed sentiment, so we also have a dictionary of intensifiers and what they do to the sentiment expressed by sentiment term. | |
For example, we say "incredibly" multiplies the sentiment by a factor of 1.4 where as "little" multiplies the sentiment by a factor of .3, so it actually, kind of, you know, attenuates the sentiment expressed a little. | |
Now, finally there's one other type of word we want to take into account and that is negators, things like "no" and "not," and we treat these basically as polarity reversals. So | |
"not fantastic" would be taking the score for "fantastic" and multiplying it by -1. | |
And so, this is a common way of doing sentiment analysis, again called lexical sentiment analysis. So what we do is we take sentiment terms that we find, we multiply them by any associated intensifier | |
or negators and then for each document, when we have all the sentiment scores for the individual terms that appear, we just average across all them to get a sentiment score for that document. | |
And JMP returns these scores to us in a number of useful ways. So this is a screenshot out of the sentiment analysis tool and we're going to be, you know, using this in just a moment. | |
But you can see, among other things, it gives us a distribution of sentiment scores across all of our documents. It gives us a list of all the sentiment terms and how frequently they occur. | |
And we even have the ability, as we'll see, to export the sentiment scores to use them in graphs or analyses. | |
And so I've actually made a couple graphs here to just try to see as an initial first pass, does the sentiment in the Beige Book reports actually align with economic events in ways that we think it should? You know, do we really have some validity to this | |
sentiment as some kind of economic indicator? And the answer looks like, yeah, probably. | |
Here I have a plot that I've made in Graph Builder; it's sentiment over time, so all the little gray dots are the individual reports | |
and the blue smoother kind of shows the trend of sentiment over time with that black line at zero showing neutral sentiment, at least according to our scoring scheme. | |
The red areas are | |
times of economic recession as officially listed by the Federal Reserve. | |
So you might notice sentiment seems to bottom out or there are troughs around recessions, but another thing you might notice | |
is that actually preceding each recession, we see a drop in sentiment either months or, in some cases, looks like even a couple years, in advance. And we don't see these big drops in sentiment | |
in situations where there wasn't a recession to follow. So maybe there's some validity to Beige Book sentiment as a leading indicator of a recession. | |
If we look at it geographically, we see things that make sense too. This is just one example from the analysis. We're looking at sentiment in the 12 | |
Federal Reserve districts across time from 1999 to 2000 to 2001. This was the time of what's commonly called the Dotcom Bust, so this is when | |
there was a big bubble of tech companies and online businesses that were popping up and, eventually, many of them went under and there were some pretty severe economic consequences. | |
'99 to 2000 sentiment is growing, in fact sentiment is growing pretty strongly, it would appear, | |
in the San Francisco Federal Reserve district, which is where many of these companies are headquartered. And then in 2001 after the bust, the biggest drop we see all the way to negative sentiment in red here, again occurring in that San Francisco district. | |
So, just a quick graphical check on these Beige Book sentiment scores suggests that there's some real validity to them in terms of their ability to track with, maybe predict, some economic events, though of course, the latter, we need to look into more carefully. | |
But this is just one example of the potential use cases of sentiment analysis and there are a lot more. | |
One of the biggest application areas where it's being used right now is in consumer research, where people might, let's say, analyze | |
some consumer survey responses to identify what drives positive or negative opinions or reactions. | |
But sentiment analysis can also be used in, say, product improvement where analyzing product reviews or warranty claims might help us find product features or issues that elicit strong emotional responses in our customers. | |
Looking at, say, customer support, we could analyze call center or chats...chat transcripts to | |
find some support practices that result in happy or unhappy customers. Maybe even public policy, we analyze open commentary to gauge the public's reaction to proposed or existing policies. | |
These are just a few domains where sentiment analysis can be applied. It's really applicable anywhere you have text that convey some emotion and that emotion might be informative. | |
So that's all I want to say up front. Now I'd like to spend a little bit of time walking you through how it works in JMP, so let's move on over to JMP. | |
Here we have the Beige Book data, so down at the bottom, you can see we have a little over 5,000 reports here, and we have the date of each report from | |
May 1972 October 2020, which of the 12 districts it's from, and then the text of the report. And you can see that these reports, | |
they're not just quick statements of you know, the economy is doing well or poorly, they they can get into some level of detail. | |
Now, before we dive into these data, I do just want to say thank you to somebody for the idea to analyze the Beige Book and for actually pulling down the data and getting it into JMP, in a | |
format ready to analyze. And that thanks goes to Xan Gregg who, if you don't know, Xan is a director on the JMP development team and the creator of Graph Builder, so thanks, Xan,for your help. | |
Alright, so let's let's quantify the degree of positive and negative emotion in all these reports. We'll use Text Explorer under the analyze menu. | |
Here we have our launch window. I'll take our text data, put it in the text columns role. A couple things to highlight before we get going. | |
Text Explorer supports multiple languages, but right now, sentiment analysis is available only in English, and one other thing I want to draw attention to is stemming right here. | |
So for those who do text analysis you're probably familiar with what stemming is, but for those who aren't familiar, stemming is a process whereby we kind of collapse multiple... | |
well, to keep it nontechnical...multiple versions of the same word together. Take "strong," "stronger," and "strongest." So these are three | |
versions of the same word "strong" and in some text mining techniques, you'd want to actually combine all those into one term and just say, oh, they all mean "strong" because that's kind of conceptually the same thing. | |
I'm going to leave stemming off here actually, and it's because...take "strongest," that describes as strong as something can get | |
versus "stronger," which says that you know it's strong, but there are still, you know, room for it to be even stronger. | |
So "strongest" should probably get a higher sentiment score than "stronger" should, and if I were to stem, I would lose the distinction between those words. Because I don't want to lose that distinction, I want to give them different sentiment scores. I'm going to keep stemming off here. | |
So I'll click OK. | |
And JMP now is going to tokenize our text, that is break it into all of its individual words and then count up how often each one occurs. | |
And here we have a list of all the terms and how frequent they are. So "sales" occurs over 46,000 times and we also have our phrase list over here. So the phrases are | |
sequences of two or more words that occur a lot, and sometimes we actually want to count those as terms in our | |
analysis. And for sentiment analysis, you would want to go through your phrase list and, let's say, maybe add "real estate," which is two words, but it really refers to, you know, property. | |
And I could add that. Now normally in text analysis, we'd also add what are called stop words, | |
that's words that don't really carry meaning in the context of our analysis and we'd want to exclude. Take "district." This happen...or the Beige Book uses | |
the word "district" frequently, just saying, you know, "this report from the Richmond district," something like that, it's not really meaningful. | |
But I'm actually not going to bother with stop words right here and that's because, if you remember, | |
back from our slides, we said that all of our sentiment scores are based on a dictionary, where we choose sentiment words and what score they should get. | |
And if we just leave "district" out, it inherently gets a score of 0 and doesn't affect our sentiment score, so I don't really need to declare it as a stop word. | |
So once we're ready, we would invoke text or, excuse me, sentiment analysis under the red triangle here. | |
So what JMP is doing right now, because we haven't declared any sentiment terms or otherwise, it's using a built-in sentiment dictionary to score all the documents. Here we get our scores out. | |
Now before navigating these results, we probably should customize our sentiment dictionary, so the sentiment bearing words and their scores. And that's because in different domains, | |
maybe with different people generating the text, certain words are going to bear different levels of sentiment or bear sentiment in one case and not another. So we want to | |
really pretty carefully and thoroughly create a sentiment dictionary that we feel accurately captured sentiment as it's conveyed by the language we're analyzing right now. | |
So JMP, like I said, uses a built-in dictionary and it's pretty sparse. So you can see it right here, it has some emoticons, | |
like these little smileys and whatnot, but then we have some pretty clear sentiment bearing language, like "abysmal" at -90. | |
Now it's it's probably not the case that somebody's going to use the word "abysmal" and not mean it in a negative sense, so we feel pretty good about that. But, you know, it's not terribly long list and we may want to add some terms to it. | |
So let's see how we do that, and one thing I can recommend is just looking at your data. You know, read some of the documents that you have and try to find some words that you think might be indicative of sentiment. | |
We actually have down here a tool that lets us peruse our documents and directly add sentiment terms from them. So here, I have a document list. You can see Document 1 is highlighted and then Document 1 is displayed below. I could select different documents to view them. | |
Now, if we look at Document 1, right off the bat, you might notice a couple potential sentiment terms like "pessimism" and "optimism." | |
Now you can see these aren't highlighted. These actually aren't included in the standard sentiment dictionary. | |
And a lot of nouns you'll find actually aren't, and that's because nouns like "pessimism" or "optimism" can be described in ways that suggests their presence or their absence, basically. So I could say, you know, "pessimism is declining" or | |
"there's a distinct lack of pessimism," "pessimism is no longer reported." | |
And, in those cases, we wouldn't say "pessimism" is a negative thing. It's...so you want to be careful and think about words in context and how they're actually used before adding them to a sentiment dictionary. | |
For example, I could go back up to our term list here. I'm just going to show the filter, | |
look for "pessimism" and then show the text to have a look at how it's used. So we can see in this first example, "the mood of our directors varies from pessimism to optimism." | |
And the next one | |
"private conversations a deep mood of pessimism." If you actually read through, this is the typical use, so actually in the Beige Book, they don't seem to use the word pessimism in the ways that I might fear, | |
"optimism is increasing." | |
So I actually feel okay about adding "pessimism" here, so let's add it to our sentiment dictionary. | |
So if I just hover right over it, | |
you can see we bring up some options of what to do with it. So here I can, let's say, give it a score of -60. | |
And so now that will be added to our dictionary with that corresponding score, and it's triggering updated sentiment scoring in JMP. So that is, it's now looking for the word "pessimism" and adjusting all the document scores where it finds it. | |
So let's go back up now to take a look at our sentiment terms in more detail. If I scroll on down, you will find "pessimism" | |
right here with the score of -60 that I just gave it. Now I might want to actually...if you notice, "pessimistic" is, by default, has a score of -50, so maybe I just type -50 in here to make that consistent. | |
And I could but I'm not going to, just so that we don't trigger another update. | |
You'll also notice, right here, this list of possible sentiment terms. So these are terms that JMP has identified as maybe bearing sentiment, and you might want to take a look at them and consider them for inclusion in your dictionary. | |
For example, the word "strong" here, if you look at some of the document texts to the right, you might say, okay, this is clearly a positive thing. And if you've looked at a lot of these examples, it really stands out that | |
the word "strong" and correspondingly "weak" are words that these economists use a whole lot to talk about things that are good or bad about the current economy. | |
So I could add them, or add "strong" here by clicking on, let's say, positive 60 in the buttons up there. Again, I won't right now, just for the sake of expediting our look at sentiment analysis. | |
So we could go through, just like our texts down below, we could go through our sentiment term list here to choose some good candidates. | |
Under the red triangle, we also can just manage the sentiment terms more directly, so that is just in the full | |
term management lists that we might be used to for a Text Explorer user, so like the phrase management and the stop word management. | |
You can see we've added one new term local to this particular analysis, in addition to all of our built-in terms. Of course, we could declare exceptions too, if we want to just maybe not actually include some of those. | |
And importantly, you can import and export your sentiment dictionary as well. Another way to declare sentiment terms is to consult with subject matter experts. You know, | |
economists would probably have a whole lot to say about the types of words they would expect to see that would clearly convey | |
positive or negative emotion in these reports. And if we could talk to them, we would want to know what they have to say about that, and we might even be able to derive a dictionary in, say, a tab separated file with them and then just import it here. | |
And then, of course, once we make a dictionary we feel good about, we should export it and save it so that it's easy to apply again in the future. | |
So that's a little bit about how you would actually curate your sentiment dictionary. It would also be important to | |
curate your intensifier terms and your negation terms, and again, you don't see scores here, because these are just polarity reversals. | |
Just to show you a little bit more about what that actually looks like, if we...let's take a look at sentiment here, so we can see instances in which | |
JMP has found the word "not" before some of these sentiment terms and actually applied the appropriate score. So at the top there, "not acceptable" gets a score of -25. | |
So I show you that just to, kind of, draw your attention to the fact that these negators and intensifiers, they are kind of being applied automatically by JMP. | |
But anyways let's let's move on from talking about how to set the analysis up to actually looking at the results. So I'm going to bring up | |
a version of the analysis that's already done, that is, I've already curated the sentiment dictionary, and we can actually start to interpret the results that we get out. | |
So we have our high level summary up here, so we have more positive than negative documents. As we discussed before we can see, you know, how many of each. In fact, at the bottom of that table on the left, you see that we have one document that has no sentiment expressed in it whatsoever. | |
"strong" occurring 14,000 times, "weak" occurring 4,500 times approximately | |
and looking at these either by their counts or their scores, looking at the most negative and positive, | |
even looking at them in context can be pretty informative in and of itself. I mean, especially in, say, a domain like consumer research, if you want to know when people are feeling positively or expressing positivity | |
about our brand or some products that we have, what type of language are they using, maybe that would inform marketing efforts, let's say. This list of sentiment terms can be highly informative in that regard. | |
manufacturing, | |
tourism, investments. And sometimes we want to zero in on one of those subdomains in particular, what we might call a feature. | |
And if I go to this features tab in sentiment analysis, I'll click search. JMP finds some words that commonly occur with sentiment terms and asks if you want to maybe dive into the sentiment with respect to that feature. | |
So take, for example, "sales." We can see "sales were reported weak," "weakening in sales," "sales are reported to be holding up well" and so forth. | |
So if I just score this selected feature, now what JMP will do is provide sentiment scores only with respect to mention of "sales" inside these larger reports, and this is going to help us refine our analysis or focus it on a really well-defined subdomain. | |
And that's particularly important. | |
It could be the case that the domain in the language that we're analyzing isn't, you know, so well-restricted. Take, for example, | |
product reviews. You're interested in how positive or negative people feel about the product, but they might also talk about, say, the shipping and you don't necessarily care if they weren't too happy with the shipping, mainly because it's beyond your control. | |
You wouldn't want to just include a bunch of reviews that also comment on that shipping. And so it's important to consider the domain of analysis and restrict it appropriately and the feature finder here is one way of doing that. | |
So you can see now that I've scored on "sales," we have a very different distribution of positive and negative documents. We have more documents that don't have a sentiment score because they | |
simply don't talk about sales or don't use emotional language to discuss it, and we have a different list of sentiment terms now capturing sales in particular. | |
Let me remove this. | |
One thing I realized I forgot to mention, I mentioned it briefly, is how these overall document scores that we've been looking at are calculated, and I said that they're the average of all the sentiment terms of... | |
that occurred in a particular document. So let's look at Document 1. I'd just like to show you that | |
if you're ever wondering where does this score come from, let's say, -20, you can just run your mouse right over and it'll show you a list of all the sentiment terms that appeared. | |
And you can see, here we have 16 of them, including all at the bottom, "real bright spot," which was a +78 and then, if you divide...add all those scores up. divide by 78... | |
or divide by 16, excuse me, then you get an average sentiment of -20. And this is one of two ways to provide overall scores. Another one is a min max scoring, so differences between minimum and maximum | |
sentiments expressed in the text. | |
Now we can get a lot of information from looking at just this report, you know, obviously sentiment scores, the most common terms. | |
But oftentimes we want to take the sentiments and use them in some other way, like | |
look at them graphically, like we did back in the slides. So when it comes time for that part of your analysis, just head on up to the red triangle here | |
and save the document scores. And these will save those scores back to your data table so that you can enter them into further analyses or graph them, whatever it is you want to do. | |
So that's a sneak preview of sentiment analysis coming to Text Explorer in JMP Pro 16. The take-home idea is that sentiment analysis uses a sentiment dictionary that | |
you set up to provide scores corresponding to the positive and negative emotional content of each document, and then from there, you can use that information in any way that's informative to you. | |
So we'll leave sentiment analysis behind now and I'm going to move on back to our slides to talk about the other technique coming to Text Explorer soon. | |
And that is term selection, where term selection answers a different question, and that is, which terms are most strongly associated with some important variable or variable that I care about? | |
We're going to stick with the Beige Book. | |
We're going to ask which words are statistically associated with recessions. So in the graph here, we have over time, the percent change | |
GDP (gross domestic product) quarter by quarter, where blue shows economic growth, red shows economic contraction. And we might wonder, well, what | |
terms, as they appear in the Beige Book, might be statistically associated with these periods of economic downturn? For example, a few of them right here. | |
You know, why would we want to associate specific terms in the Beige Book with periods of economic downturn? | |
Well, it could potentially be informative in and of itself to obtain a list of terms. You know, I might find some potentially, you know, subtle | |
drivers of or effects of recessions that I might not be aware of or easily capture in quantitative data. | |
I might also find some words for further analysis. I might...I might find some | |
potential sentiment terms, some terms that are being used when the economy is doing particularly poorly that I missed my first time around when I was doing my sentiment analysis. | |
Or maybe I could find some words that are so strongly associated with recessions that I think I might be able to use them in a predictive model to try to figure out when recessions might happen in the future. | |
So there are a few different reasons why we might want to know which words are most strongly associated with recessions. | |
So, how does this work in JMP? Well, we we basically build a regression model where the outcome variable is that variable we care about, recessions, and the inputs are all the different words. | |
The data as entered into the model takes the form of a document term matrix, where each row corresponds to | |
one document or one Beige Book report, and then the columns capture the words that occur in that report. Here we have the column "weak" highlighted and it says "binary," which means that | |
it's going to contain 1s and 0s; a 1 indicates that that report contained to the word "weak" and 0 indicates that that report didn't contain the word "weak." And this is one of several ways we could kind of score the documents, but we'll we'll stick with this binary for now. | |
So we take this information and we enter it into a regression model. So here's the what the launch would look like. | |
We have our recession as our Y variable and that's just kind of a yes or no variable, and then we have all of these binary term predictors entered in as our model effects. | |
And then we're going to use JMP Pro's generalized regression tool | |
in order to build the model, and that's because generalized regression or GenReg, as we call it, includes automatic variable selection techniques. So if you're familiar with | |
regularized regression, we're talking like the Lasso or the elastic net. And if you don't know what that means, that's totally fine. The idea is that it will automatically | |
find which terms are most strongly associated with the outcome variable "recession," and then ones that it doesn't think are associated with it, it will zero those out. | |
And this allows us to look at relationships between "recession" and perhaps you know hundreds, thousands of possible words that that would be associated with them. | |
So what do we get when we run the analysis? | |
We get a model out. So what we have here is the equation for that model. Don't worry about it too much. the idea is that we say | |
the log odds of recession, so just it's a function of the probability that we're in a recession and when the Beige Book is issued is a function of | |
all the different words that might occur in the Beige Book report. | |
And you can see, we have, you know, the effect of the occurrence of the word "pandemic" with a coefficient of 1.94. | |
That just means that the log odds of "recession" go up by 1.94 if the Beige Book report mentions the word "pandemic." Then we see minus 1.02 times "gain." Well, that means if the Beige Book report mentions the word "gain," then the probability of recession... | |
or the log odds of recession drops by 1.02. | |
So we get out of that are a list of terms that are strongly associated with an increase in the probability of recession, things like "pandemic," "postponement," "cancellation," "foreclosures." | |
And we also get a list of terms that are associated with a decreasing probability of recession, so like "gain," "strengthen," "competition." | |
We also see "manufacturing" right there, but it's got a relatively small coefficient, about -.2. | |
And you'll actually notice here, and if we if we look at a graphical representation of all the terms that are selected in this analysis, you don't see too many specific domains like "manufacturing," | |
"tourism," "investments" and all that. That's because those things are always talked about, whether we're in a recession or not, so what we're really looking for words that are used, | |
you know, when we're in a recession more often than you would expect by chance. So we have...for example, those are | |
"pandemic" being the most predictive. Makes a lot of sense. We're not talking about pandemics at all until pretty recently and then we've also experienced the recession recently, so we've picked up on that pretty clearly. | |
Then we have a few others in this, kind of, second tier, so that's "postponed" "cancel," "foreclosed," "deteriorate," "pessimistic." | |
And it's kind of interesting, this "postponement" and "cancellation" being associated with recessions. It makes sense, you really want to talk about postponing, say, economic activity | |
when a recession is happening, or at least that's perhaps a reliable trend. It's...so that that's insight, in and of itself. In fact, I | |
mean, I couldn't tell you how the Federal Reserve tracks postponing or canceling of economic activities, but the the fact that those terms, get flagged in this analysis suggests that's something probably worth tracking. | |
Alright, so that's term selection. We actually get this nice list of terms associated with recessions out and we can see the relative strength of association. Now let's actually see that briefly here, how it's done in JMP. | |
So I'm gonna head back on over to JMP and what we're going to do is pull up a slightly different data table. It's still Beige Book data, though, now we have just the national reports. | |
And we have this accompanying variable whether or not the US was in a recession at the time. And of course there's some auto correlation in these data. I mean, if we're in a recession last month, it's more likely we're going to be in a recession this month than if we weren't. | |
And you know that typically could be an issue for regression based analyses, but this is purely exploratory. We're not too too concerned with it. | |
So I'm going to just pull Text Explorer up from a table scripts just because we've kind of seen how it's launched before. | |
Note though that I've done some phrasing already, as we did before. I've also added some stop words, you can't see here, but words that I don't want them necessarily returned by this analysis. | |
And I've turned on stemming, which is what those little dots in the term list mean. For example, this for "increase" now is actually collapsing across "increases," "increasing," "increasingly." | |
And that's because now I, kind of, consider those all the same concept, and I just want to know if, you know, that concept is associated with recessions. | |
So to invoke term selection, we'll just go to the red triangle, and I'll select it here. | |
We get a control window popping up first, or I should say section, where we select which variable we care about, that's recessions. Select the target level, so I want this to be in terms of the probability of recession, as opposed to the probability of no recession. | |
I can specify some model settings. If you're familiar with GenReg, you'll see that you can choose whether you want early stopping, whether you want one of two different | |
penalizations to perform variable selection, what statistic you want to perform validation. And if that stuff is new to you, don't worry about it too much. The default settings are good way to go at first. | |
We have our weighting, if you remember, we had the 1s and 0s in that column for "weak," just saying whether the word occurred in a document or not, but you can select what you want. So | |
for example, frequency is not, did "weak" occur or not, it's how many times did it occur. And this kind of affects the way you would interpret your results. We're going to stick with binary for now. | |
But I'm going to say, I want to check out the 1,000 most frequent terms, instead of the 400 by default, which you can see, that's a lot more than 436, and normally you can't fit a model with 1,000 Xs but only 436 | |
observations, but thanks to the | |
automatic variable selection in generalized regression, this isn't a problem. So once again it selects which of these thousand terms are most strongly related, hence the name term selection. | |
So I'm gonna run this. You can see what has happened is JMP has called out to the generalized regression platform and returned these results, where up here, we see some information about the model as it was fit. For example, we have 37 terms that were returned. | |
Let me just move that over. Because over here on the right is where we find some really valuable information. This is the list of terms most strongly associated with recession. | |
Now I'll sort these by the coefficient to find those most strongly associated with the probability of recession, so once again that's "pandemic" "postponement" "cancellations" and, as you might expect, at this point if I click on one of these, it'll update these previews | |
or these text snippets down below, so we can actually see this word in context. | |
So this list of terms in itself could be incredibly valuable. You, you might learn some things about specific terms or concepts that are important that you might not have known before. You can also use these terms in predictive models. | |
Now a few other things to note. | |
You can see down here, we have once again a table by each individual document, but instead of sentiment scores, we now have basically scores from the model. We have for each one | |
the... | |
what we call the positive contribution, so this is the contribution of positive coefficients predicting the probability of recession. Here we have the ones on the negative end. | |
And then we even have the probability of recession from the model, 71.8% here and then what we actually observed. | |
And we're not building a predictive model here necessarily, that is, I'm not going to use this model to try to predict recessions. I mean, I have all kinds of economic indicator variables I would use | |
too, but this is a good way to basically sanity check your model. Does it look like it's getting some of the its predictions right? | |
Because if it's not, then you might not trust the terms that it returns. You also have plenty of other information to try to make that judgment. I mean, we have some fits statistics, like the area under the curve up here. | |
Or we can even go into the generalized regression platform, with all of its normal options for assessing your model there further as well. | |
I'm not going to get into the details there, but all of that is available to you so that you can assess your model, tweak your model how you like, to make sure you have a list of terms that you feel good about. | |
Now you see right here, we have this, under the summary, this list of models and that's because you might actually want to run multiple models. So if I go back to the model...oh, excuse me...if we go back to our | |
settings up here, I could actually run a slightly different model. Maybe, for example, I know that the BIC is a little more conservative than the AICc and I want to return fewer terms, maybe did an analysis that returned 900 terms and you're a little overwhelmed. | |
So I'll click run and build another model using that instead. | |
And now we have that model added to our list here, and I can switch between these models to see the results for each one. In this case, we've returned only 14 terms, instead of 37 and I would go down to assess them below. | |
So two big outputs you would get from this, of course, is this term list. If you want to save that out, these are just important terms to you and you want to keep track of them, just right click and make this into a data table. Now I have a JMP table | |
with the terms, their coefficients and other information. | |
And | |
if what you want to do is actually kind of write this back to your original data table, maybe, so that you can use the information in some other kind of analysis or predictive model, | |
just head up to term selection and say that you want to save the term scores as a document term matrix, which if I bring our data table back here, you can see I've now written to their | |
columns for each of these terms that have been selected. In this case filled in with their coefficient values, and now I can use this | |
quantitative information however I like. | |
That's just a bit then about term selection. Again, the big idea here is I have found a list of terms that are related to a variable I care about and I even have, through their coefficients, information about how strong that relationship might be. | |
So let's just wrap up then. We've covered two techniques. We just saw term selection, finding those important words. Before that we reviewed sentiment analysis, all about | |
quantifying the degree of positive or negative emotion in a text. These are two new text mining techniques coming to JMP Pro 16's Text Explorer. We're really excited for you to get your hands on them and look forward to your feedback. |