Finding the Source of Grandma’s Chili: Investigative Text Mining (2020-US-45MP-532)
Scott Wise, Senior Manager, JMP Education Team, SAS
The power of using Text Mining is a great tool in investigating all kinds of unstructured text that commonly resides in our collected data. From notes captured on warranty issues, lab testing/experimental comments, to even looking at food recipes, this new method opens a lot of opportunity to better understand our world. In this presentation, we will show how to use the latest text analytic methods to help solve a family mystery as to the regional source of my Grandma’s delicious chili recipe. Along the way, we will see how to use text mining to create leading terms and phrase lists and word cloud reports. Then we will utilize the resulting document term matrix to perform topic analysis (via latent class analysis clustering) that will enable us to find a solution to our question. You will be left with an understanding of the powerful text mining approaches that you can add to your own toolbox and start solving your own text data challenges!
Speaker |
Transcript |
Scott Wise | Investigative Text Exploration. My name is Scott Wise and I'm Senior Manager of the JMP Education Team. |
And I've got a really fun presentation for us to view today and the goals of this presentation are to give you a little more familiarity | |
with the capabilities and how easy it is to do text exploration in JMP and JMP Pro, as well as show you a different way of looking at text exploration, like, can I do with to investigate something like a detective would? | |
Okay, so we're going to talk about my grandma's chili and how that relates to a Texas chili cook off. Before we begin, | |
let's just debunk some of the terminology that is around text mining, and to me, there's really five simple steps. You spend your time summarizing the data, literally, finding out what words in text occur the most often, even what combination of words occur the most often. | |
So the ??? call this looking through unstructured text right and then this, this could be anything. This could be a sentence your customer gave you about the performance of your product or your service. It could be... | |
it could be social media, right, where you know you figure out what people like or dislike based on comments, | |
something about your product or service or it could be this guy, something like a recipe. | |
I've even seen it done on patents when people are researching what are popular things people are applying patents for and should we be doing these. | |
So really, summarizing finding out what words out of that are the most important. Now there's some preparation that comes next, which is just getting down to the smaller list of the words you care about. | |
Then of course we wouldn't be JMP if we aren't going to visualize it and analyze it and as well we can model it. We can even do some advanced modeling. So I'm going to be using JMP Pro to do this, version 15.1, | |
and I will be sure to point out what you can do in JMP and where I actually throw in a little bit of Pro that help further answer my question. | |
document, corpus and term. Document is going to be those... | |
those things you're analyzing, basically the individual body of text, each row of the text. So it's my my recipes for my example. | |
It could be different customers who have commented back to you on your product or service. Corpus is actually that unstructured text that you're trying to handle, the body of that text which you're going to analyze. | |
And then the terms are going to be those words or those combination of words that you care about that can help you answer a question. So let's talk about the story. You know, why did I come up with looking at chili recipes? Well, this comes back to | |
almost 25 years ago when I first moved to Texas and I came to work for a big company and like most big companies in Texas, they had a big Texas chili cook off. | |
And I was asked to participate. Everybody participated. But I was also asked to do even more than that, I was asked to be a judge. | |
So I should have turned down the judge. I thought it was an honor for the new guy, but I think I was the only one they could give it to, I found out why. | |
But let's talk first about the reaction to the chili I brought. So I had chili recipe that my family's always used and it came from my grandmother, Grandma Lillian. | |
This is not chili. This is not what Texans considered chili. | |
Now the good news was I'd have to bring any home because I enjoy eating it. They considered at something else besides chili. Their chili didn't have beans in it. Their chili was not hearty. It was very hot, very soupy. | |
Mine had beef. Theirs had mostly pork. They had all in. Here's the other thing that got me in trouble. Not only did they not like my grandma's chili, I almost didn't survive the judging | |
because the real badge of honor of a real Texan is to make the hottest bowl of chili. So you want to beat your neighbor. You want to beat your coworker. So they were throwing all kinds of ungodly hot chiles and spices in here. | |
And I just thought, almost put a hole my stomach just trying to taste it and you're drinking all this milk, trying to put the heat out. So | |
it was a baptism by fire. So my recommendation is unless you like the heat, don't enter in as a judge, right. Not a good idea. But I wanted to place; that always bothered me. | |
What, why didn't my grandma's chili do well? Are there really different types of chili? Where do they come from? | |
And it turns out chili's got a really cool history. You can see some actually really cool history blogs and papers on it and it most likely came out of | |
Central America, mostly Mexico. And there are some light dishes, but in San Antonio, it was first observed being sold kind of in the state, we would know it now is chili, | |
on the on the old San Antonio square there by the Alamo. And it was used on cattle drives and it started to get popular and then it worked its way up the middle of the country all up through the Midwest. | |
And I was told that, you know, many food innovations got created at the St. Louis World's Fair, you know, really took off when they were shown, this was one of them. Chili made it into the | |
St. Louis World's Fair and it really got popular. | |
So there's different varieties. And so the idea was if I...could I use text exploration on ingredients and recipes? Just take the whole recipe and dump it in, see what happens. | |
And what do I want to compare it to? So I looked up what the traditional regions were; we had several, all the way from Texas up to Michigan. | |
So in Texas, we've got different varieties. Texas bowl red is the one I was tasting. That's chili con carne. A | |
Frito pie is something you'll get at a football game, but very popular, that Louisiana has their own version. New Mexico, with green chilies and chicken, have their own version. Oklahoma, Kansas City, Springfield, | |
Illinois, Cincinnati with the skyline chili hey put it over spaghetti. Michigan. Coney Island,the hot dogs, you know, the chili sauce they put it on there is serious chili. | |
White chili, unknown where it comes from. Vegetarian chili. But there's a lot of styles that are out there now. | |
So I said, if I took three recipes from each of these styles and I compared it to control my grandma's chili, can I find what I'm looking for? | |
So we're going to do this and I'm going to show you, as I walk through the steps, I'm going to show you a summary of the steps then I'll go right behind and show you how I do it in JMP. | |
First steps going to be summarizing the data. We want to find those words we care about. | |
Now what happens when you enter in | |
any of this text analytics, this text data, like my ingredients, into a text explorer, it's going to run it through a library | |
of regular expressions, and think about the word like regular, just things that are just part of everyday speech. | |
And they're not that helpful. It's not helpful if I have "the" and "and" in my list. So it tries to help pull out words you don't care about. And yes, it can be customized. | |
And it's got a pretty strong one built in already, and that's what I used. And then stemming. Stemming is where you go and | |
say how you want to treat like words -- so "dice" is "dice," "dicing," you got plurals, different...different... | |
different versions of the root word. Do you want to just count it all in the root word or do you want it separated? So that's a consideration. | |
And then after we summarize. Let me get these words. You can see I've got a little list here of words. The one behind is unedited and then the one in the front | |
is actually one where we've gone through and kind of sized down that word list. So what you do is, you look for words you don't care about, you call them stop words, basically say, remove these. Add them back to my library of things I don't care about. | |
You can also bring over phrases that really matter. And once you have these, we are going to be able to visualize some. | |
And we're going to be able to see, in my case, I got really interested in ingredients. So we're going to be able to see what ingredients came out the most often. | |
And you often see word clouds here, and word clouds, you know, the bigger the word, the more frequency it is. | |
And you can look at it in a cloud, just as just a...just a sporadic layout or you can look at an order where, you know, the first thing up there is the biggest word. That's like the | |
graph that's in front of your slide. And after that, we are going to do something with it. And so we can create some basic models. | |
And one of the easiest ways to do that is, once we know what words on our list that we care about, | |
we can add it back to our data table. It's called saving out the document term matrix. In this case, a simple way of doing it is just binaries. So I've got a separate column here, you can see to my right, where | |
"onion" has a one where it's in that rows, you know, recipe. It has zero if it's not, so you can get zeros and ones here in columns. And you can see what's the most important. | |
And a lot of times you're trying to model for something like, if I was stuck, | |
maybe I had the judging scores by this and I can say, well, here's a numeric score tied to each recipe. I want to see what ingredients | |
are common that result in a high score. That might be something I'm doing, so try to do it a little more predictive. | |
But in this case, I'm kind of going to look at grouping. I'm really interested in grouping, kind of, like recipes together and seeing where my grandma's chili falls. So I'll show you how we're going to do that. But let's first go to JMP and show you how we do these steps in JMP. | |
So here is my raw data. | |
In every document here, every row is is...got my unstructured text and in my cases, it's just the raw ingredients. | |
And if I click on any one of these cells here, you can see it is just literally the copied in ingredients. So there's the one...the first one for Cajun chili from Louisiana. | |
It's got words I care about, like "tomato" and "chili powder" and "honey." It's got words I don't care about like "can." | |
It's probably ingredient measures I don't care about so much, like "one," "two pounds," "teaspoons." So how are we going to take care of this? So what we're going to do | |
is we're going to go to Analyze and we're going to go to Text Explorer and we're just going to put those ingredients up into the text column. | |
I'm going to ask for stemming, how to stem for all terms. I find that very helpful. And then I'm going to use the built in regular expression. | |
I say okay and now here is my initial list. | |
So what I can do now is I can go and select those things I don't care about. I don't care about numbers. So maybe I can go in here and highlight them, right click, and say add a stop word, and then it gets added to the list of things I don't care about. | |
What about "chili powder"? It sounds like something that needs to be on its own. So I'm going to right click on that phrase. And I'm going to say add phrase and it adds it in. | |
So you go and you do this until you get a streamlined sized-down list. | |
I'm just going to run my | |
My finished list by it. And here's all the words that came off the regular expression that were found, and also | |
things I added | |
and stop words. And now, here is my finished term and phrase list. And so I've added these phrases. I care about. So "onion" came out the highest, then "salt," then "cumin," then "chili powder." | |
As you can guess, this would be really good to visualize and we do have a word cloud. So here is the word cloud for everything in this one. And again under my red triangle options here, I can change that to an order to make it crisper on what comes out. | |
If I keep it centered, something that's fun to do. You know you can add filters to your data and something you can do is, sometimes you can find your answer visually without having to do anything else. And so in that case, I'm going to go into this to my | |
red triangle. I'm going to get a local data filter. I'm just going to look at the type and I'm going to say, well, let's take a look at | |
what Grandma Lillian's chili looks like, you know "tomatoes" and "kidney beans" and "beef," that type of thing. And how would that compare to the Cajun chili? | |
Well they shared chili powder and beef, but, you know, there might be some different things on there. You know, how that compared to the chili verde, | |
you know, which is more of the green chili, you know. And they've got raw chilies in there and jalapenos, chicken stock, all that type of thing, chicken broth. | |
So this is really interesting, but probably not enough for me to figure out what's going on. So I did go | |
(and this is another...you've got all the options here under your red triangle) I did go make sure (and I probably need to make sure here) that I turn off my local data filter, make sure everything selected, that I'm looking at all my terms. | |
I got 299 here. I'm going to right click and I'm going to say "save document term matrix." | |
And when I do that, it asked me what kind of way I want to save it, with with with with weights, what kind of weighting. It's basically binary, there's frequencies I can use, how many terms, you know, the minimum term frequency to actually get a place in your data table. And I have already done that. | |
So if I slide across and look... | |
Aactually | |
I'll go ahead and do that and show you what it looks like. So I'll just say "save document term matrix" and say "okay." | |
And now, as I showed you on the slide, now I get all the terms I care about. And there's that first one for "onion." Here's the one for "chili powder," and as it relates to their respective | |
Recipe, you know, rows. | |
So I know where there's a one, I know this Cajun chili had chili powder. | |
When there's zero, | |
this one here looks like Oklahoman chili or no, I'm sorry, New Mexican chili did not have chili powder, so that's just, that's just how that works. And this can be used in modeling. So you can go to Analyze fit model in JMP and take...and actually apply this to some type of model. | |
But what I'm going to do is, since I'm not really trying to predict, you know, what ingredients will give me a higher score. I don't have any like, you know, | |
you know output data here. I really want to group them together and I heard in JMP Pro, I can actually do this. So I'm going to go right back to my slides here. | |
And I'm going to talk about analyzing the data. So to do further analysis | |
in JMP Pro, it enables you to do some really good grouping techniques and these are multivariate methods and their specialized for handling text analytics and working with those document term matrix is about text. | |
And it uses something called latent class analysis. It's one of the terms. And this is similar to the principal components, if you're used to doing that technique. But basically it's going to | |
ask us how many groupings or clusters of data do we want to look at. It's going to look across that multidimensional space | |
between everything that's in those columns and your document term matrix for the important terms in your model here, the important terms we got on the word | |
word list, right? And it's going to group them. And in my case, I was able to get it down to three groups. | |
So there's a cluster one, which seemed to have a lot of chili recipes with ground beef, tomato sauce, chili powder and beans. | |
There's a cluster two, which had a lot to do with chicken and green chilies, raw chilis here. | |
And then there's cluster three, which had a lot of chilies again, but they were more of the red chillies and they were kind of pork based and this made a lot of sense. | |
Okay, so when I created these clusters, I was able to use a cluster probability by row, this kind of gave me how strong those individual | |
recipes, my rows in my document, right, these these original...my control recipe, where did they fall and how strong did they belong. Why did I assign them into whatever cluster? And when I did this, | |
22 was my grandma's cluster...was my grandma's control. And I found that she clustered in cluster one along with some other recipes, including those that came from Kansas City, Missouri. | |
There is one on number 24 which was very close. Now the Texas recipes for cluster three, they had hot chilis, spice...a lot of spice in them, and often often pork and no beans, right? Beans were something that showed up in cluster one, but not in cluster three and then... | |
The cluster two | |
was more for, you know, it's more for green chili, | |
more for those things you see in New Mexico, you know, chicken-based chili, things with green chili. | |
Alright, so what happened was I was able to make the match, and I found a recipe. And one of the three representative ones from Missouri | |
that actually was called Kansas City chili, and it almost matched exactly Grandma Lillian's chili. | |
So when I asked my mother about this, I said, "Well, why could this happen. I didn't think it came from Missouri." And she said, "All this makes sense." She said, "Grandma Lillian, | |
she grew up on a farm in St. Joseph, Missouri, and she was the only girl and she had like 11-12 brothers. So she did a lot of the cooking." | |
By the way, she was the only one to get a college education and so she was quite progressive for for the time that she lived and was one of my favorite relatives, but her recipe was very indicative of this. So let's show you what this looks like live. | |
So if I go to... | |
at this point, | |
I go back to the data I had made and under that hotspot, I'm going to ask for these additional models that JMP Pro can give. There's a latent class analysis clusters documents using that method, | |
based on the binary way to document term matrix. So it does use a doc...does use that document term matrix, yeah so you don't even have to save it out, it automatically generates it. | |
There's also a latent semantic analysis, which does...which does a little more math, a little more advanced method, but both of these are basically doing the same thing, and I particularly liked this latent class analysis. So that's the one I selected. | |
And I asked for three clusters, you can play with it to see if it makes a difference. And I did try more clusters and I broke back to three. | |
And within its | |
options, you can look at the cluster probabilities by row. | |
And of all the output, this is the one that made the most sense. So remember, back to my slide, this helped me look at where my grandma's chili fell, | |
Which was 22, | |
row 22. And then what else it combined well with. And so that's how I was able to do this analysis. | |
It's that simple. | |
So, | |
that was a really quick run through the capabilities of doing JMP text exploration in JMP and then how I was able to use JMP Pro and | |
find these clusters and place my grandma's chili and find a matching recipe. So if you're hungry for more, I do have a link to the blog | |
in the presentation that you can...you can go click on or you can just go right to the Community and you can just type in "grandma's chili" and you can find that blog. And I also will give you along with that, we will give you as well | |
the recipe. So you too can make Grandma Lillian's chili. | |
So we appreciate being able to show this to you today. | |
Please be sure to leave any questions in the Q&A that we can answer. And try this, try something | |
that you have around you at work, at home, wherever that has some unstructured text data, where you would like to explore and ask the question, and you'll find it's a fantastic, fantastic method, very powerful and really helps you attack that third dimension of data. |