A Walk Through Text Explorer to Understand Customers’ Needs (2021-EU-30MP-748)

8 Kudos

Level: Intermediate

Dr. Kira Alhorn, Statistician, W. L. Gore & Associates GmbH

How can we understand our end users better and make better products? On their web pages, sellers provide the end users with descriptions of our products. Could those descriptions provide us with potential information about most wanted product features?

In this talk we present how we used product descriptions for shoes in web shops to analyze what properties end users expect and need in different types of shoes. More explicitly, we used the JMP Text Explorer platform to process the data and to visualize the most important shoe features via word clouds. Using the advanced features of JMP Text Explorer, we can analyze the data set for topics in the shoe descriptions. We compare those topics to market categories of shoes. This enables us to find a meaningful number of topics that serve as “prototypes” for shoe descriptions in certain market categories. In order to complement the topic analysis, which can be viewed as an unsupervised technique, we explore and apply supervised predictive modelling techniques for predicting the market category based on the findings from Text Explorer. More explicitly, we use a bootstrap forest to extract the most important terms and predict a market category based on a shoe description.

Auto-generated transcript...

Speaker	Transcript
Kira Alhorn	Okay, so welcome to my talk, A Walk Through the Text Explorer to Understand Customers' Needs. I'm very happy to
	present at this is JMP Discovery Summit in virtual format. My name is Kira Alhorn and I am a statistician at WL Gore and Associates.
	And, before starting with the presentation itself, I would quickly like to introduce you to our company. WL Gore and Associates is a global materials science company, we are producing
	very different solutions for very different industries, starting from aerospace and goes until medical products and what might be familiar with most of you
	is our Gore-tex brand. So that is what we will focus on today and the Gore-tex brand is producing durable waterproof and breathable fabrics laminates.
	And those laminates are sold to our customers or brand partners who make products out of that. More specifically, we will focus today on shoes.
	So this is a short overview about my talk. I will start with the background and motivation. So what are we looking at? We're looking at shoe descriptions and why are we interested in that.
	Afterwards the rest of the talk will be done in JMP. I will use JMP 15 Pro, and I will show you there how to use the Text Explorer
	for analyzing our shoe descriptions. How to find topics in those texts, or in the shoe descriptions, so those topics can be used to find common shoe descriptions or common topics in the shoe descriptions and compare that to different categories of shoes that we have.
	And finally, this will only be a very brief introduction, we use some predictive models to validate our findings and to get a more...it will get a deeper look into how the different shoe descriptions categorize the shoes.
	So just imagine your own cupboard and the shoes that you do have in your cupboard
	And every one of you will have at least a small or a rather large collection of different shoes in his or her cupboard, and the question is, why do you have so many shoes?
	And that your shoes have to fit your style and maybe you need different colors and so on, this is not the only reason why you have different shoes.
	The other reason is that you need different shoes. So just imagine, you have one pair of running shoes, you have several pairs of shoes for casual end use and you might have hiking shoes.
	And you have those because each of the shoes has a different or slightly different features that makes this shoe usable for this end use and you require those features for that end use.
	So what we were wondering is, if we are looking at shoe descriptions from our brand partners, can we somehow find what makes different shoes unique? So you will see here a collection
	of different shoe categories which correspond now to us for different end uses of shoes.
	And the question is now, if a customer wants to produce a new hiking shoe, for example, what feature configuration should we offer in our laminate that we sell for a shoe that is intended to be used for hiking?
	So can those shoe descriptions give us a prototype that we could offer to our customers?
	And, of course, what is another aspect that it's interesting here, how are other shoes differentiating? So what is the difference between that hiking shoe and the street shoe? why do we need those differences?
	And that was our intention when we started analyzing the data.
	So with that, I would like to go to JMP.
	And in JMP, I first want to show you how the data that we collected is looking.
	So what we did was we simply visited the web pages of our brand partners and we collected shoe descriptions from those packages.
	For each category of shoe we have a certain number of shoe descriptions and you will see, we simply copied the whole description into our data table.
	So this might be rather long, it might be rather short but we did not do more than simply copying.
	But on the other hand, you can imagine, with that text field, we cannot do a lot.
	Or at least seems like we cannot, because JMP offers great possibilities to do something. And what means to do? We want to extract information out of that long text and make it to somehow more digestible chunks.
	And those digestible chunks
	can be extracted using the Text Explorer. So I want go to the Text Explorer and put the shoe description into the text columns here.
	And here's a lot of options. Those options are feeding into the algorithms that cut our shoe descriptions or we call the descriptions documents in the text analysis language.
	And this is cutting the documents into smaller pieces of information, as I mentioned, those digestible chunks.
	And what you will see are digestible chunks, namely, we have those terms, which are the smallest unit of information, and we have those phrases that are terms that often occur together and might have a meaning if you combine them.
	So you can see here the most commonly used term
	in all the documents, so in all shoe descriptions, is the term comfort.
	So first what we did was we went through all the phrases here and had a look. Are there any that makes sense?
	And the first one already makes sense. So you see here in the terms Gore and tex appear as single terms, but of course, it makes sense that the term Gore-tex should be analyzed and not Gore and tex.
	So you can add all phrases that do make sense to the terms list. So we add the phrase, and you will see here now, we have the term Gore-tex instead of the single term Gore.
	So that was step one of our processing of the data. Step two was to check the term list if there are any duplicates or
	if there are any terms that do not make sense. And JMP already has a really great processing, because JMP already excludes lot of words that do not make sense, such as and, a,
	or, which do not give you any information. However, there will be always some terms that do not make sense left, such as this S, for example.
	What would you do with an S? What does that tell you about a shoe? So we had to go through all of the terms, and if there is a term that doesn't make sense, we call that a stop word, and we edit them to the list of stop words to exclude them from analysis.
	That was the one thing. The other thing I mentioned was, we need to check if there are duplicates and you will see here, for example, we have the term Gore tex
	and the term GTX. And GTX is only an abbreviation of the term Gore-tex, so we should not analyze both of them separately, we should combine them.
	So you have the Recode function, which is pretty much the same as for columns, and this is a great tool for combining those
	??? data. So you see here is the function Recode, and I group both of them.
	And you will see now that this is only a single term now (Gore-tex), and this is actually the most frequent term in our data.
	So.
	We started with 4,000 terms and what we ended up with after all the cleaning were only 100 terms. So you see, we have 100 terms left and with those, we can do a lot of analysis. Those are very useful for us to get a sense of what is a shoe description about.
	So you can use that list to see what are the most frequent terms, but JMP also offers a very easy visualization tool.
	Clicking on the red triangle, you can add a word cloud. And this word cloud is giving you, by the size of the word, its frequency in all documents, so in all shoe descriptions. So Gore-tex, comfort, and lightweight are the most commonly used terms in the shoe descriptions that we have collected.
	But you might remember this is not exactly what we were wondering about, right? So we were wondering, what is about shoes for different end users?
	And for that we can easily add the local data filter here, and use the end use of the shoe as a filter. So I could be, for example, interested in what are the most frequent terms in all of the descriptions of street winter shoes.
	And you will see no wonder, the terms winter and warm appear frequently in the descriptions of street winter shoes.
	On the other hand, taking, for example, the trail running shoes, you see a different emphasis here. You will see those don't need to be warm, but they need to be breathable, lightweight and fit well.
	So these word clouds are very, very great to to get a first grip on our initial question, so what is describing the feature configuration of different shoe types.
	And we did not end with that. We wanted to go one step further.
	The one step further, is, if you have already prepared all of your data, can you somehow do modeling to find this prototype of shoe configurations? And for that modeling, there are different options in JMP and we decided to take the SVD.
	SVD is performing a principal component analysis of our document term matrix.
	And what does that mean? The document term matrixs includes for each document, so, for each shoe description,
	the frequency of each term. So it would count, for example, for the first shoe description, how often do we have the term Gore-tex, how often do we have the term comfort, and so on and so forth. So you can imagine that is a very huge matrix.
	And the idea is now, through principal component analysis, which is reducing the dimension of the document term matrix, to get a better grip on your data.
	And, to be more able to visualize the data.
	And based on that reduction of the dimensions, we can now do some modeling. And what kind of modeling is interesting for us? So we were interested in what describes a shoe for a certain end use.
	And to get to that point, we analyzed, are there any topics in the shoe descriptions? And if there are topics, are those topics corresponding to certain end uses?
	So, in an ideal world, we would find nine topics in our data and those nine topics would exactly represent one shoe category. So we would have a prototype shoe configuration or feature configuration for each shoe.
	So let's see what's happening if I...you search for nine topics. I will click here on the topic analysis and enter a 9, since I am interested in nine different shoe categories.
	And what I get then is a list of topics with what terms are occurring frequently in those topics. So topic one, for example, includes shoes
	that have in their shoe descriptions the terms superfit, textile, warm and so on. So my first gut feel here would be these have to be street or street winter shoes.
	Topic two, for example, consists of shoe descriptions that include the terms hut, trekking, multiday and so on, so this seems to be like trekking shoes.
	But of course as a statistician, I'm not happy with only giving a gut feel. I really want to know what category of shoe belongs to which topic. And for that, I am doing a simple distribution of the shoe categories and I will add the local data filter here to show only
	categories
	that are...
	to show only shoes that are selected.
	What does that mean? How can I use that? JMP offers here the topic scores plots and these topic scores plots
	have one point for each shoe description for each topic.
	So the higher the score for shoe description is, the more representative this shoe is of that certain topic.
	So what I do is, I simply select all of those shoes that have high scores for topic one and I will see in that
	histogram here that those are street and street winter shoes. So basically my gut feel was right. Those shoes that are made of textile, that have superfit, those are the street and the street winter shoes.
	And topic two remember, we expected it to be trekking shoes. You will see in that histogram here, those are trekking shoes.
	And you can do that now for all of the topics, and I put that into the journal.
	And you will see some difficulties doing all of that. So assigning one shoe categories or more shoe categories to each topic, you will see two problems.
	Problem number one, represented here, you would see there are certain shoe categories, such as mountaineering, that are not represented by one topic, but by several.
	And you will also see that those topics are overlapping, so the word crampon, for example, is occuring in both topics.
	Which means, it seems like we have too many topics here.
	So those topics, ideally, you want to have one topic per shoe category, and if we see there are topics that are overlapping that correspond to multiple categories, it seems like nine is too many.
	But this of course have to has another ???, so if the mountaineering shoes are corresponding to more than one topic, there needs to be topics that are not that clearly identifiable.
	So look, for example at this topic number five. And so, like the most representative shoes. You will see this as a huge collection.
	There are progressive shoes, road running shoes, street shoes, and so on. So those are mostly categories that are kind of overlapping and not differentiating from each other. So just imagine where's the difference between the trail running shoe and the road running shoe?
	Or then go to the next step, what is the difference between trail running and hiking?
	There are certain overlaps in those shoes and if there are those overlaps, you cannot clearly identify the one feature configuration for that one shoe category.
	So that is also very valuable learning. It is, of course, managing your portfolio of different laminates. You want to know, do I need to have a specific laminate for that specific end use or is that rather similar to other laminants that I already have?
	So what we ended up doing was, we wanted to have final collection of topics where we do not have any topics corresponding to more than one category, and we did that by hand. So we analyzed different numbers of topics and we ended up having five.
	Those five topics, you will see, correspond to a certain amount of categories, but we will see that not all shoe categories can be represented by it.
	So we had two learnings. Learning number one was we have some prototypes of shoe descriptions for those
	shoe categories. And on the other hand, all shoe categories not named here might not be very differentiated or might not have this one single configuration
	in their shoe description.
	And that was only one option that the Text Explorer is offering. There are other options as well, but we focused on this and it worked pretty well.
	And with that, I would like to leave Text Explorer and give you only a brief outlook on what what you could further do with the topic document term matrix.
	So here, I have the same data set as before, but I have exported the document term matrix. So you see a lot of new columns that show for each term
	the frequency of that term in a certain shoe description.
	And we tried with different versions of that, so we had a document term matrix indicating frequencies; we had one indicating only it is contained, it's not contained; and so on. And we ended up with that version.
	And using the possibilities of JMP Pro, we created a validation column to split our data set into training, validation, and test data. The training and validation data are used to train our predictive model and the test data is used for testing.
	And when talking about predictive modeling, what is our goal here? So what we did before was kind of an unsupervised technique because
	we just searched for topics and compared afterwards to the shoe categories. And now we want to be a bit more proactive, so we have the shoe description, can we predict a category explicitly?
	And we tried around different predictive modeling techniques that JMP 15 Pro offers and ended up with a bootstrap forest. And here I would like to give you only an idea of how the results could look like and what we could use the results for.
	So you will see the results in the confusion matrix in the test data set, what was the category, certain shoe description and what did I predict using my bootstrap forest?
	And, interestingly, you will see pretty much the same results as we saw before. So there are certain categories that can be clearly differentiated from others,
	that already have a unique feature configuration, and there are others who do not have. So take a look, for example, in the road running shoes.
	We already discovered in our first analysis that those are kind of not uniquely defined. And we will see here as well, I predict a lot of shoes to be road running shoes that are indeed other shoes.
	So that means this confirms the findings we had before. We see exactly the shoe categories that were well described using that unsupervised technique are also well described here and those that are not well differentiated are also not that differentiated here.
	So that was pretty interesting to see, and there are a lot of further things you could do now using this analysis. You could do, for example,
	an analysis, what is...what are the most important problems when I want to differentiate certain shoes? So that is the column contribution which terms contribute the most when differentiating shoe categories?
	Or you could further explore which shoe categories can be very differentiated, which cannot, using the ROC curves.
	And last but not least, you could use this model for prediction. So you could enter here certain feature configurations or shoe description
	and predict then the probability that this description belongs to a certain shoe category. In this case, for example, we would predict this shoe might be a road running shoe.
	Due to the time I will not go into further details, but would quickly like to summarize. So what were our key findings? The key findings were that
	the Text Explorer is a very great tool that is absolutely intuitive, and you could use it for quickly extracting information out of unstructured text data.
	where do they differentiate, where do they not? What is...what features are wanted in those shoes? But those somehow, the findings confirmed what we expect.
	And this confirmation, we could do that using either unsupervised techniques or a supervised technique, sowhe compared the topic analysis, using the SVD and predictive model here with a bootstrap forest, and others might work similarly well.
	Finally, I would like to thank the group ??? who did come up with that problem. They a collected all of the data and cleaned all of the data has great discussions on
	the sense or nonsense of different techniques. So without them, you would not have listened that talk now. And additionally,
	I would like to thank Jonas, a former colleague from TU Dortmund University, who consulted with me on text analysis.
	And finally, I would like to thank you for your attention and I'm happy to answer any questions, might be via email or in the JMP User Community. Just drop me a note.