Choose Language Hide Translation Bar

Expanding Our Text Mining Toolkit: Sentiment Analysis and Term Selection in JMP Pro 16 (2021-US-45MP-851)

Level: Intermediate

 

Ross Metusalem, Systems Engineer, JMP

 

Text mining techniques enable extraction of quantitative information from unstructured text data. JMP Pro 16 has expanded the information it can extract from text, thanks to two additions to Text Explorer’s capabilities: Sentiment Analysis and Term Selection. Sentiment Analysis quantifies the degree of positive or negative emotion expressed in a text by mapping that text’s words to a customizable dictionary of emotion-carrying terms and associated scores (e.g., “wonderful”, +90; “terrible”, -90). Resulting sentiment scores can be used for assessing people’s subjective feelings at scale, exploring subjective feelings in relation to objective concepts, and enriching further analyses. Term Selection automatically finds terms most strongly predictive of a response variable by applying regularized regression capabilities from JMP Pro’s Generalized Regression platform, which is called to from inside the Text Explorer. Term Selection is useful for easily identifying relationships between an important outcome measure and the occurrence of specific terms in associated unstructured texts.

 

This presentation provides an overview of both Sentiment Analysis and Term Selection techniques, demonstrates their application to real-world data, and shares some best practices for using each effectively.

 

 

Auto-generated transcript...

 


Speaker

Transcript

Ross Metusalem Hello, everybody, and welcome to this presentation on how JMP has expanded its text mining toolkit in JMP Pro 16, which just came out this past spring.
  I'm Ross Metusalem. I am an academic ambassador for JMP and I'm going to share with you these two new text mining capabilities
  sentiment analysis and term selection. We're going to talk about what they are and how to use them and JMP and then we'll see two examples of performing these analyses on real-world publicly available data.
  So sentiment analysis and term selection are two new additions to JMP's text explorer, which is a general purpose text mining tool.
  So you can use this to derive quantitative information from what we commonly call free or unstructured text data, so this might be anything from deriving word frequencies, so that we can make a word cloud,
  to performing latent semantic analysis, so that we can cluster semantically similar documents. So the two capabilities we're going to see today are capabilities within the text explorer tool.
  We're going to start with sentiment analysis, a popular and often requested feature that we're happy to introduce in JMP Pro 16.
  For those who aren't familiar, sentiment analysis is all about quantifying the positive to negative emotion in unstructured text data.
  Now the example that we're going to look at is from the Beige Book, which is a narrative report issued by the Federal Reserve.
  And as we can see in the map, the Federal Reserve is broken into 12 different districts and approximately eight times a year, each district issues a narrative Beige Book report
  that summarizes the current and potential future economic conditions in that geographic region itself. And these reports are based largely on qualitative data sources, so think
  conversations with business leaders, surveys things like that. So I think the Beige Book provides a good example for sentiment analysis, not only because
  it's publicly available data, and I can share it with you all, but also because it it'd be pretty interesting if you were, let's say, an investor or you,
  you know, managed a financial institution to know how strongly positive or negative the directors of the Federal Reserve feel about economic conditions in different parts of the country. You could actually use that to
  perhaps guide decision making, or even if you can quantify it, use it in some statistical analyses, maybe even some predictive models.
  So let's take a look at an example of some text from the Beige Book.
  Here we just have a snippet from one of the reports, and we can see it, starting at the top, that this report says that computer sales, on the other hand, have been severely depressed.
  Whereas, if we go down to the bottom, an individual also reported that there are a number of favorable straws in the wind, so we have "severely depressed" and "favorable" and also "improve" highlighted here,
  because these are what we would call sentiment phrases. So it's one or more words that denote some kind of positive or negative emotional content.
  So at the top, we have severely depressed, which would be strongly negative, and then we have down below, favorable, which would be generally positive, and we use the this color coding so that orange versus this blue or purple (depending on how you see it)
  to indicate positive and negative sentiment respectively.
  And many forms of sentiment analysis do just this. They look for positive phrases and negative phrases in text.
  The form that JMP uses, and that you might encounter elsewhere, actually quantifies positive and negative sentiment on a continuous scale.
  So take, for example, down at the bottom of this second snippet of text, where we see that somebody says that large department stores in the region report poor sales.
  So up at the top, we have sales being severely depressed and down below, we see that sales are poor. Well, I think we all agree that severely depressed sounds worse
  with respect to sales than poor does. And so we might want to,
  while they're both negative, quantify severely depressed as more negative than poor, and so that's what we do in sentiment analysis in JMP. And we're going to talk
  very shortly here about how those scores are calculated, in particular, but first I think it's worthwhile to actually look at
  potential demonstrations of validity of quantifying sentiment by looking at individual, you know, terms and phrases like this.
  And we're talking about is essentially scoring documents, based on the words that occur in them
  that we think denote emotion and to what degree. And that's an kind of an internal and subjective thing, but this approach, sentiment analysis,
  can really demonstrate some predictive validity. For example, if we look at how Beige Book sentiment tracks over time, we can see it actually tracks pretty reliably with recessions. So here we have sentiment
  on the y axis. You can see it extends from negative 75 to 75, that's just the range that we plotted. The true scale goes from negative 100
  to 100, where zero is neutral. And you can see, with the smoother over time capturing the general trend in sentiment across all the Beige Book reports,
  you can see an interesting pattern with respect to economic recessions, which themselves are highlighted in red. In particular, we can see that sentiment seems to drop reliably preceding each recession.
  So, even though this is a relatively subjective measure of what we believe to be the emotional state of the directors of the US Federal Reserve, I mean, that's tracking with economic recessions as one might expect.
  It also tracks actually geographically, interestingly enough. Take for example the dot-com bust around 2000-2001 when a bunch of Internet startups in the San Francisco Bay area
  went under. We can see, and I have the hover labels for San Francisco highlighted here, so we can see the scores for that region. We have positive sentiment in orange, as we saw before, and negative in that blue.
  We can see that building up to 2000, sentiment
  was pretty high in in the Bay area, 38.65, and then it plummeted over 60 points in 2001 on average to
  negative 24.7 approximately. And that's a much bigger swing than we saw elsewhere in the country because that's the region...presumably, because that's the region that was most strongly affected.
  And we see something similar with the great recession. So we see the crash of the housing market in 2008 corresponding with a drop in sentiment and not uniformly.
  It looks like the biggest drops were evidenced in the West Coast, so that's the San Francisco region again, the New York region, and also the Atlanta region, which includes
  Florida, where I live, which was a particularly hard hit area of the country. And, in fact, if we look down in the bottom row, we can see that Florida...
  the sentiment doesn't seem to recover in that general area until about 2012, behind some of the other geographies, and that seems to make sense. The housing market in Florida was one of the last to recover nationwide.
  So we have this subjective measure of emotion, but it does track with, in this case, external events...economic events in a way that we would expect.
  And you know again, we're looking at the Beige Book because it's a particularly good publicly available example. If you
  are talking about maybe more common domains to apply sentiment analysis, like consumer market research, where you're dealing with a
  voice of the customer in the form of text data, so product reviews, survey responses things like that,
  you may be able to use sentiment analysis in order to better understand, maybe even predict, say, customer behaviors, much like we can use sentiment here to better understand or predict, maybe, economic recessions.
  So with that said, I realize we've been looking at sentiment scores for the past few minutes here, and it really begs a question about where they come from.
  So JMP uses a form of sentiment analysis called lexical sentiment analysis. So it uses a sentiment lexicon or
  sentiment dictionary
  to assign scores to all the individual words that we believe carry some sentiment. So a word like fantastic might get a score of 90, which is highly positive. It's almost at the max of positive, 100, whereas a couple down, feared gets a score negative 60.
  Now we can assign scores to these terms and basically count them up or take their average in any individual document.
  But we do want to account for a couple other things. If you remember that first sentiment phrase we saw in the Beige Book example was severely depressed.
  Other words severely as an example of something that we call an intensifier, which is basically a sentiment multiplier. So you can see here, we could declare incredibly to be a multiplier with a sentiment score of 1.4, where a little actually
  scales it by .3, so decreases the sentiment's magnitude a little bit.
  So, if an intensifier appears with the sentiment term, we perform that multiplication and then also if a negator occurs. So we say,
  you know, something is no or not or shouldn't or wasn't, we multiply by negative one, so that we basically, you know, flip the sign; positive becomes negative and vice versa.
  And so for any sentiment phrase, we just perform this simple multiplication and then across a document,
  we can take the average score or maybe the difference between the minimum and maximum sentiment scores to
  compute a sentiment value for those documents. And that's what we were looking at in those in those graphs, so the graph of sentiment across time in those maps.
  So this is lexical sentiment analysis. It's a very common practice and it's the type of sentiment analysis that is implemented in JMP. And as we'll see, when you have JMP
  compute all the sentiment scores, you get a lot of handy information out. So here's an example of what the sentiment analysis report looks like.
  To just preview one element of the results, at the top here, we have a simple summary of sentiment across all of our documents.
  So we have a histogram that's handily color coded showing us the distribution of overall sentiments seems to lean positive, that we have a pretty strong negative sentiment tail.
  We can see that, across all the documents, the average sentiment score is 4.1, so just slightly above neutral.
  For all the documents that are overall positive, of which there are 3,900, their average scores 22.9, whereas among the about 1,400 negative documents, the
  average score is negative 15.4, so the negative documents aren't quite as strongly negative as the positive documents are strongly positive.
  Now, of course, you know, here with the with the Beige Book, I might say, well, you know from in this case in 1970 to the present, just looking at the average sentiment score, maybe, you know, I don't care as much as maybe looking at the scores of individual documents or
  maybe looking at the distribution of certain sentiment terms. But again, if you're dealing with data in,
  say, consumer market research, something like that, any any text data where you have the voice of the customer, even just the high level summary like this that can tell us well how positive or negative are our customers feeling, based on the data that we have,
  even just the simple report can be really valuable. But we're about to switch over to JMP and we're going to see that the sentiment analysis report comes with a whole lot of additional ways of exploring the data to gain insight.
  So let's head on over to JMP and see how it's done.
  So here we are I'm in JMP Pro 16.1, and before we dive into these Beige Book data, I do want to give a shout out to Xan Gregg.
  For those of you who don't know him, Xan is a director on our development team. He is the creator of Graph Builder
  and he is also the person that originally came up with the idea to look at the Beige Book data, and even was kind enough to scrape the html pages in the archives of the Beige Book and provide a nice JMP data table for me. So thank you very much, Xan.
  All right now, let's take a look at sentiment analysis and I'm actually, instead of going into the results,
  you know, I do want to show you how to set the analysis up briefly, because there are some important considerations to be aware of,
  with respect to not just, you know, how JMP implements it, but how to do sentiment analysis or lexical sentiment analysis in general.
  And to demonstrate that setup, I'm actually going to look at just a subset of the data from 2010 to the present, in this case, just to keep things snappy, because we're going to be asking
  text explorer to kind of auto update its analysis a number of times, and we want that to be, you know, as near instantaneous as possible.
  So here, I have that subset of the data, so we have our Beige Book reports across all 12 districts just starting in January 2010 up until the most recent report last month.
  Alright, so to launch our sentiment analysis, I'm going to go to analyze and will choose text explorer. Take our text and put it in the text columns role. Couple things to note.
  One is that while text explorer generally supports a number of languages,
  sentiment analysis is currently available only in English.
  And the other is that if...at least if you're a experienced text explorer user, you might be used to turning stemming on, and you'll note that I'm going to leave it off right now.
  For those who don't know stemming, it's a way of collapsing across multiple forms of the same word or in linguistic speak, the same lemme. So take strong, stronger, and strongest.
  We would collapse across all of them, and just say, well, they're all strong for our purposes.
  And obviously we don't want to do that here because strongest is more positive than stronger, which is more positive than strong.
  And so we leave stemming off and, in fact, even if you turned it on once you invoke sentiment analysis, it's going to ignore that stemming, because, again, you really shouldn't be using it.
  So I'm going to click Okay. No and it looks like as I did that, we got a slight technical hiccup. I apologize there's been something weird happening with my computer lately that is doing that, but that's Okay. We can simply restart our screen share and bring my video back. Hi, everybody.
  And we will share that. Okay, so we should still be going, looks like the recording is still going. Let me make sure JMP is assigned to the right monitor here. Looks like it is. Alright sorry about that everyone. Anyway so let's get back to it.
  So here we have our text explorer window. Now again if you're used to doing
  text analysis, you would know that you typically begin by declaring stop words. For example, people are talking about my contacts in industry told me that... And you know I wouldn't want to use the word contacts in my analysis, so normally I would
  add that as a stop word and then go through potentially for hours
  declaring stop words before I actually did my analysis, and you don't have to do that with sentiment analysis.
  It's all based on that sentiment dictionary, and so, for a word like contacts, it isn't in the dictionary. It's functionally excluded anyways, so it's a nice way to save some time.
  However, you probably do want to declare phrases, for example, real estate or one that we're going to look at, which is loan demand, which is sitting down here.
  So I click loan demand and add it as a phrase. It gets placed over in our term list, so let's just find it real fast.
  There it is, 1,139 instances. And this is a to help us actually drill down into sub domains in these analyses or in this
  in this data set. So imagine I want to calculate sentiment, not on the whole, but sentiment only with respect to loan demand.
  To do that as we'll see, I need to first declare it as a phrase, so it is reasonable to go through and declare some phrases at least for really prominent domains that you might be interested in.
  Alright, so now let's close that down and actually launch sentiment analysis.
  And you'll see when I do, I'm going to get a full blown report out, and so it might look to me like it's time to start exploring my results, because this looks like a completed analysis, but I do want to encourage anyone using this to pump the brakes and consider your sentiment dictionary.
  Different, you know, groups of people generating texts in different contexts about different things, with different vernaculars
  can use language in a lot of different ways. And it's important to tailor your sentiment and intensifier negation term lists
  to reflect that. So that is...I hesitate to use the word never, but I'm going to...I'm going to say never just use the default sentiment dictionary. You will always get more valid results if you curate it yourself.
  So if I go to the sentiment terms here, you can see that the scores are based on this built-in dictionary, which is relatively sparse.
  But I want to actually read through some of the text and see if I can identify maybe some more words that should be included in the dictionary. I should mention you might even want to go through the
  the built-in list and exclude some, though that that list is relatively sparse and so ideally you shouldn't have to exclude too many or perhaps any.
  So down in our report, you actually have some tools for curating your dictionary, not just exploring the results. I can select the document in this document table and bring up the text
  right below. So I'll get off the bottom of the screen briefly. You can see highlighted in that text, the sentiment terms and phrases have already been found, like slightly better.
  But if you look, say, here we have the word weak, current conditions remained weak, and turns out, as far as I can tell, this is a very common
  term in the Beige Book reports indicating that something is essentially bad. I don't know, I'm not an economist, but maybe it's the case that in economics,
  you know, weak is a synonym for bad, strong is a synonym for good, and that's my take, having looked at these data a fair bit.
  Now you'll see, it's not highlighted. That means weak is not currently in my dictionary, but I can add it really easily. I just hover over that
  term right in the text, and I can add it as a negator. I can give it an intensifier value or a sentiment value so it's moderately negative, so I'm going to click negative 60.
  And what happens is that it's now added to my dictionary, and all the sentiment scores are recomputed in real time here.
  So my sentiment term list now I can even select weak and see all the instances of it. It looks like, if we go back to that table in the count column, that it occurs 1,012 times in this subset of the Beige Book data.
  So certainly, go through and read, perhaps, a random subset or a stratified random subset of your documents and try to identify some sentiment terms that belong in in your dictionary.
  Now, you may also rely on a handy feature, this possible sentiment list,
  which actually serves up potential sentiment terms for your consideration. I could note, say, strong is here, and I put weak in. I certainly want strong and so I'm going to declare that to be a
  strong.
  And I've been using the shortcut buttons, you know, positive 60, negative 60, but you can enter any any score you like between negative 100 and 100, so maybe I actually only want a score of 50 for strong. So I'll do that, and if you look down below right now, you can see everything again auto updates.
  So take care with your sentiment term list. Also I won't go through the mechanics of doing your negation intensifier terms, but the same ideas apply.
  Make sure you get those right to really reflect your domain and then I recommend, under the red triangle, you can always go into full term management tools to either one you know manually enter some terms, so maybe I know that I want strongest and I want that to be a score of 90.
  So I can add in kind of a batch format here
  without auto updating every time, or I can also import and export sentiment term lists,
  which is great for sharing them with other people, with making them portable, with helping to track our work and make it replicable.
  Because of how much your results depend on the sentiment term list, and also the intensifiers and the negators, I highly recommend that when you have a set list that you export them so that you have it fully documented and also able to be imported into a new analysis and replicated.
  So that's enough on setting the analysis up, I really just want to drive home that point that if you invoke sentiment analysis, you're going to get a report, but don't dive right into that report. Curate your dictionary first.
  So, now that I've
  beaten that point mercilessly home, let's actually take a look at that report.
  And what I'm going to do now is pull in, you know, a fully baked report using, in this case, all 5,300 plus
  Beige Book reports here. So, as I mentioned before, in the results section, you have this overall summary. We won't run through that again. We've seen that you also have
  all the individual documents, so you can view any document to see how it was scored, what appears in it. Explore the use of sentiment terms and phrases in context.
  You can also in this table see a number of handy measures, for example, the sum total positive score for a document,
  the average score of the positive sentiment phrases in the document. The same goes for negative and then the overall score and it's you'll notice, as I
  hover over these values, that actually shows me how they're calculated. So if I look at the negative score mean, I can see that there are seven total terms, I can see what their scores were, and then I can see what the average is.
  I can also see that the average, in this case, of negative one, is just the average score across all sentiment terms, positive and negative.
  There are a lot of handy ways in which hovering over elements in this report will reveal helpful information. I highly recommend, you know, just try it out. Hover with your mouse in different places. You'll probably find something useful.
  Now, typically we don't look too hard at individual documents when we're doing text mining like this. I mean, if we could just read them all,
  then we would. I mean, we're better at understanding language than computers. So, you know, we certainly aren't just going to go through this entire collection of 5,300 documents, but you could say, sort by score and have a look at just the most negative
  reports, and actually read through those to get an idea of, okay, well,
  when people are particularly negative, what are they negative about? What can I learn about what's driving that negativity, and again, especially in, say, consumer research, looking at, you know, product reviews, survey something like that can be really valuable.
  We also saw that you have your sentiment term list, where we can see all the different sentiment terms that appear, how often they appear and their
  scores that are associated with them. And if I grab any of them, I can see those terms in context as well. So if there's a particular sentiment term that's used
  quite commonly in your data, you might want to understand it better in context to see why people are using it or what's driving them to use it. And it's pretty easy to do that just by selecting the term and then browsing the text below.
  Now, finally, what I think to be one of the most useful features in sentiment analysis is under this features tab.
  So what this does when I click search, it finds non sentiment terms that are frequently occurring with sentiment terms and allow us to do a drill down analysis, where we assess sentiment only with respect to that
  selected term, or what we'll call feature, or you could also call it sub domain. So maybe sentiment just for sales and we can see sales were reported weak, weakening in sales.
  Or to go back to earlier when I said you do want to do some phrasing potentially. Let's find loan demand in this list,
  and let's score all of our documents only with respect to the sentiment regarding loan demand. And of course this is using some
  natural language processing in order to try to determine which sentiment phrases do and don't apply to loan demand, and with any NLP method, it's not going to be 100% accurate, but its accuracy, especially in well formed text written in
  in a way that the Beige Book reports are, it seems like it does function quite well. And so now, I can see, for example, how do they feel about loan demand
  across all these reports? And it looks like generally positive, though we do have some particularly negative reports in there.
  Last thing to call out here, and just to kind of bring us back full circle, probably notice I have to spikes when I did this, and that's...that looks like positive 60 and negative 60.
  And this is just highlighting again how important it is to curate your sentiment dictionary when doing this analysis and account for how you built it when you're interpreting your results.
  So here, this spike at positive 60 could very well be because when I initially entered a lot of sentiment terms into the dictionary, I was using these shortcut buttons.
  So I have these 30, 60, 90 shortcut buttons and so I'm kind of almost building myself up to see spikes in scores at
  positive 60 and negative 60. And if I were to have taken a little more care to make scores more fine grained, then maybe I wouldn't see quite this, you know, interesting, we'll call it, distribution.
  So I call that out only to again just stress how important it is to get your sentiment dictionary
  together up front and then, once you do, you have all this useful information about global sentiment, sentiment by document, sentiment by individual term or even by sub domain, using this features
  section here. And then, of course, as you're probably used to in JMP, when it's time to take what you've done and save it out, so you can use it elsewhere under the red triangle, you'll find those options, so I might save my document scores.
  If we go back to my data table, which is right here, you can see I've actually saved, in this case, it's the scores for that loan demand sub domain, back to my data table so that I could use them, say, in further analysis or graphing, like some of the graphs we built for the slides.
  So that's sentiment analysis, new in JMP Pro 16 for quantifying positive and negative emotion in text. And
  with that, let's transition over to term selection, the other new text mining technique I'd like to talk about. As it says, term selection is all about identifying words that are associated with some outcome, that is, some other measure. For example here, you know, we have,
  the Beige Book report texts, but we also could incorporate
  economic variables, for example, GDP, and see how certain terms in the Beige Book reports track with GDP, or maybe with a binary variable, like whether the economy is in recession or not.
  This is a...term selection here is a general purpose tool for finding words that are associated with some other variable, which can be useful
  in an interpretive context, to find maybe the drivers of some other variable, or perhaps in a predictive modeling exercise, so we can identify terms that are predictive of something so that in the future, we can use those terms to help predict it.
  So let's see how term selection works, and we're actually going to look at a different example here from the Consumer Financial Protection Bureau's consumer complaint
  database. So consumers, if they have a complaint regarding a financial product or service, can enter that complaint with the CFPB and the CFPB mediates between that consumer and the
  the provider of the financial product or service. Here's an example of what you can find in these data. So here is somebody saying that they were monitoring their credit report, and there was a collection notice filed for $3,900 and they claim to not know anything about it.
  And you can see as well, in the data table, we have a column right next that says consumer disputed it, and here the consumer did dispute, and what that means is that they received a response from the
  financial services provider and they were not satisfied with that response, so they disputed it.
  You can see in the header graph spec in the data table, it's not all that common. It's basically about one in five in this data set we're looking at,
  result in the consumer dispute, but if you're the financial company that is dealing with this customer, you certainly don't want to issue some response and then have them continue to dispute that response. You may want to understand what types of complaints
  are coming in that actually lead to consumer disputes, so that you might be able to, say, identify the ones likely to result in dispute in advance,
  or just better understand your dispute resolution practices to cut down on the number of disputes. Tere are a lot of reasons, in which you want to mind that.
  Consumer complaint narrative data to see what might be driving these disputes in the first place, and what term selection does is it finds the individual terms in the texts that are associated with that outcome, that dispute variable.
  So here, I have a graph I've made of the results of doing a term selection analysis, where we have a collection of terms along the Y axis. On the X, we have the logworth, which is
  a measure of the strength of statistical evidence of the association. So this is actually ordered by logworth values, so at the top, we have
  violate and you can see the little dot there. That actually shows that we've had stemming done here, so that violate means that could be violate, violated, violation, and so forth.
  So we have strong statistical evidence that that term violate is associated with consumer disputes, and actually in color coding,
  I have the strength of the effect in terms of an odds ratio here. So when people...when somebody mentions that word, violate,
  the odds of a dispute goes up by a factor of about 1.5. So if I am
  a financial services provider, I might want to be on the lookout for that term. I might want to investigate how and when people claim certain violations, and as we'll see, it's really violations of federal law.
  When they claim that in a complaint, how is it that we're handling those, because it seems like maybe we're not handling them satisfactorily.
  Maybe you want to use the presence or absence of that term in part of a predictive model to help flag potentially problematic complaints up front, so that we can escalate their the resolution.
  So, how does all this work? Well we're building a model that starts with the document term matrix, which is a representation of the text, where each
  row is one document and then each column is one term that could appear. And then
  in this case, we have what we call binary weighting, so we just, if the term appears, we enter one; if the term doesn't appear in that document, enter a zero. There are more weightings but we're going to stick with binary, for our example.
  When we enter those binary columns there as predictors in a model, and in particular we use
  JMP's generalized regression and we'll see why in a moment. First just want to specify that what we're doing here is regression, in fact some
  some folks will, you'll probably hear, refer to term selection as text regression.
  Don't worry too much about the form of the model. I'm just trying to highlight that what we're doing is we're taking some outcome variable.
  Because we have a binary variable it's the log odds for logistic regression, and we're saying that that's equal to, you know, some intercept plus,
  you know, a coefficient times some variable. In this case, it's a binary one or zero; did violate appear or not.
  Plus another coefficient value, in this case, negative .2 times another binary variable; here whether somebody mentioned the term hung up or not, and so forth.
  And so we enter a bunch of terms into this model and we use generalized regression to select which terms belong in the model and which don't. And
  what we see here is an example of JMP computing that solution, which terms belong and which don't for us.
  Over here in the solution path, we can see in blue all the terms that entered the model and in black all the terms that were actually not included in the model, because they don't really add to its its predictive value.
  And so what we end up with is a list of terms. We have their model coefficient values so here, for example, violate, I got that .4 value because that's actually the
  value of the coefficient when we fit the model. And we also, as we mentioned before, we get a measure of the statistical evidence or the strength of evidence,
  this logworth value, 16.58 here, where, if you want a rule of thumb, any value above 2, you know, might be considered strong statistical evidence, but really, logworth is negative log base 10 of the p value, so the value of 2 just corresponds to a P value of .01.
  So we have both, kind of, in this case with the coefficient, how strongly does this increase or decrease the odds of a dispute? And then the log worth telling us, you know, how reliable is our evidence of this really?
  And with that information, we can we can do a whole lot and we're about to see how as we move on over to JMP and actually perform this analysis.
  So I'm back on my other desktop here. I'm going to pull up our consumer complaint data.
  And
  instead of going into text explorer from scratch, I'm going to just launch it using a table script here.
  And as we'll see, once this pulls up...sorry, popped up on the wrong screen. There we go.
  We have, you know, we always start out with a list of terms that will actually be entered into the analysis. Then we have our phrase list, where we can declare certain phrases that belong in the term list, should also be included.
  I've included plenty of phrases, like fair credit reporting act, because if the mention of that phrase or that concept is associated with consumer dispute, I want to find that.
  And you'll also notice if I go to our...let's go to manage stop words here, that I have declared a lot of stop words. So these are all words that
  I don't want to include in the analysis, mainly because they might swamp out other words.
  They might be really frequent, and if I only asked to analyze the, you know, 500 most frequent words, I don't want able to muck that up.
  Especially because, even if able were returned as a term associated with dispute, I don't even know exactly what I would...what I would do with that or how I might interpret such a generic word like able.
  So you want to take your time to do your stop words and phrases, and when you're ready, under the red triangle, invoke term selection.
  So when we do, we will first select our
  outcome variable and that's this consumer disputed. And we want to be talking about things in terms of the probability of a dispute happening.
  You can set continuous, as well as categorical, outcome variables here. They both work. And I should mention, too, that we're talking about a dispute, which,
  you know, you might think could actually be some kind of proxy for sentiment, that is, if somebody disputes something, it's probably because they're not happy.
  But it could be common that some folks have data, where you have an external variable that is a more direct measure of sentiment, say, a star rating on a product, that really indicates satisfaction.
  And so, if you happen to have a good measure of sentiment, along with your text data, you could use a tool like this to do what we would call
  data derived sentiment analysis, where you use this tool to mine for terms that are predictive of a sentiment measure, thereby essentially selecting sentiment terms and automatically
  finding what their scores should be. So it can be pretty handy. That's just not quite what we're doing here. We're using it a more general way.
  Okay, so for that document term matrix, I'm going to stick with binary weighting. To keep things snappy, I'm going to say, let's look at just the 200 most frequent terms.
  Beyond that, starting at term 201, I'm not going to include it in the analysis. In practice I'd probably include more, you know, it's just that I want this to be nice and snappy for the demo.
  And if you are a GenReg user, you will see that you have
  wome options for specifying the estimation and validation methods up front. I'm going to leave those at default and click run.
  And so what's happening is JMP is calling out to GenReg. It's building a logistic regression model and doing variable selection using the elastic net method.
  It looks like we just had a second technical glitch. I'm terribly sorry about that. We will get it addressed as this model is fitting. So let's see. We will bring my screen share back.
  We will bring me back. It looks like we are still recording, so we'll carry right along. So here is the result.
  Before we go into the report in detail, we will first take a look at just this one row in the summary table. This summarizes the single model I built so far.
  So, for example, I can see it returned 145 terms. I used AICc as my validation and there's the AICc value. We can see some other fit statistics.
  And, just like any model, we might want to try multiple, you know, runs at the model in order to build the model that is best fitting to our data.
  So if I go up to this generalized regression node here, you'll see that we have GenReg actually embedded in text explorer.
  I could explore the model that we have fit in more detail, or in my case, I'm actually going to fit one more model to see if I can do any better. Let's do an adaptive elastic net.
  And will fit that model instead and see if it does any better than the regular elastic net that we have already fit.
  Okay, so it has fit that model as well.
  If we go to our summary, we can see this one now has returned 98 terms, has actually a lower AICc, so might be a better fit.
  And so I'm going to go with that model. Now I could, if I wanted to go with the original, just click on that. As you can see, the report below updates to whichever model I have selected in this table.
  I have my list of terms down here. We've already spoken about this, so I could look at it in terms of the coefficients. So here are the terms that are most...have the largest effect, increasing the odds of a consumer dispute.
  Here are those that have the largest effect in decreasing the odds of a consumer dispute. In any one of them, for example, I could look at hung up and see that,of course, it's talking about them hanging up on somebody or somebody hanging up on them, and maybe it makes sense to us that
  that this would be predictive of a decrease in the odds of dispute, because it sounds like it's really a communication problem and that might be a little easier to resolve.
  Again, we want to pay attention to our logworth values here, 2.883, so that's a little above that rule of thumb, 2, that I stated before, so we might say okay, I think there's reasonably strong statistical evidence that this is a real association.
  Just to drive that point home, if we go down here, we see the word police. And my initial impression is, well, people are saying, well, I made out a police report, or I called the police.
  They're probably not going to be so likely to just go away if we issue some response, the probability dispute, it has to go up if somebody already to the level of calling the cops.
  But if we look at the
  logworth, we can see well the statistical evidence of this really isn't that strong, so while it violates my intuition, you know, maybe, maybe it's not entirely real.
  It's certainly not to the point that I would want to run this up the chain to management and say, hey if people say they called the cops on us, it's actually a good thing. We're certainly not saying that. It's just been selected for inclusion in the model here because it did, ultimately,
  result in a better fitting model, but it's still up to us to apply our own knowledge and common sense in interpreting the results.
  So this table might be, if what you're really trying to do is just identify the terms and quantify their effects in some capacity,
  this might be the the kind of goal of your analysis. And so I might right click and go to make into data table
  and obtain a data table of these terms and their coefficients and logworths, and then use this to maybe guide some analysis of our previous responses, improvement efforts, or even flagged these as terms that should be included in some kind of predictive model going forward.
  But, you know, we did build a model here, a regression model, so we have information for, at the individual document level, about the
  the sum total of the coefficients predicting an increase in the odds of dispute, sum total predicting a decrease, the overall model value, the probability of
  a dispute that the model assigned and what we actually observed. So all the things you might expect when you're actually assessing at the level of the individual document, well, how well does this model perform.
  Up at the top, we also have a general summary of, in this case, the individual documents and their...what we're calling our contribution, so that is the
  sum or average of those coefficient values. So it looks like, for example, 30,412 of our documents have a sum...positive sum total of their coefficient values and the mean coefficient value, mean contribution is at about .57.
  So, again just a flag, you know, we're talking about regression coefficients here, and so these contribution values, the coefficient table, all this should be interpreted as you would when you're constructing
  linear regression model otherwise, so in our case here, logistic regression, we should interpret these
  appropriately, that is, if we exponentiate them, then we're dealing with the log or...excuse me...then we're dealing with the odds ratio, so that is the factor by which the odds would increase or decrease if this word appears in the text.
  Though if you're not huge on regression, these values can still be good for you in a more general sense, in that you can look to see, well, okay, which terms have the strongest effect or don't. And same with logworth, well, which have the strongest evidence or don't.
  Now, finally, as you might imagine, I mean we already saw how to save the term table out, but if if your goal is not that term table, but really the model
  that you've built, you have the option to save that out as well. And so I've selected the model that I intend to use, that adaptive elastic net, and under the red triangle, I can save the prediction formulas.
  And I think my computer will probably churn here for just a second. So it's saving the document term matrix out, so that was for the top 200 terms as we specified, and then also the formula
  that operates on those columns or that document term matrix to actually compute our model. So here, I have the document term matrix
  and then the formula columns at the end here, so now, I can actually use
  this model directly if I want to.
  So that's term selection. Again high level, we are finding which terms are most strongly associated with an outcome variable that we care about, either to better interpret what's driving that outcome, or maybe to even inform a modeling practice to try to predict that outcome.
  And that that wraps it up that term selection and sentiment analysis, the two new text mining features in JMP Pro 16, again just introduced this past spring.
  I'm going to hang out now for a little bit and field any questions that anyone has about,
  you know, these two new features. And also know that later on, you can always post an additional comment or question on the user Community page for this presentation and I'll always be keeping my eyes
  open to answer any questions that come through. So thanks for taking the time, everybody. Please take care.
Comments

One of the attendees at the semi-live presentation asked for a reference for sentiment analysis. I didn't have a good answer at the time, but some quick internet searching suggests Prof. Bing Liu's work might be a good starting point. His 2012 book Sentiment Analysis and Opinion Mining has over 7,000 citations on Google Scholar, and he's also authored a more recent book titled Sentiment Analysis: Mining Sentiments, Opinions, and Emotions. Prof. Liu's Sentiment Analysis webpage includes information on these books as well as a PDF book chapter on the topic. I haven't read Prof. Liu's work but it could be worth checking out.

jgrayson

Great!  Looking forward to looking at these.  I now have only 16 and it seems that others are 15. Is it possible to also have 16?

Thanks,

Jim

Thanks, @jgrayson. I'm afraid I don't understand the question about 15 vs 16. Can you please clarify?