Using JMP® Pro Text Explorer To Improve Patent Intelligence (2023-EU-30MP-1309)

1 Kudo

Florent MATHIEU, Chemical Process Engineer, R&D, Michelin

Technology intelligence, especially patent intelligence, is a crucial step in the development of new technologies or new processes. This task is often tedious because it requires reading many patents. JMP Pro's text exploration and predictive modeling tools facilitate patent intelligence. First, patents must be identified as critical for a corpus of patents. The patent title, summary, and claims are analyzed with JMP's Text Explorer to associate each patent with a vector representing its lexical field (the Document Term Matrix). Next, a predictive modeling tool, such as neural networks, is used to predict the probability that a patent is critical. Once the machine learning is complete, it is possible to identify, in a fraction of a second, the relevant patents to be read within an extended patent corpus. Once implemented, this methodology accelerates the patent intelligence process, broadens the search scope and limits the risk of missing critical patents. This presentation demonstrates the process of developing this methodology for patent intelligence.

This is [inaudible 00:00:00], and I will show you how a JMP Pro can help you for patent intelligence . Patent intelligence is very strategic for the development of new processes and new products . But this task is often a tedious job because reading patents is very hard work . For those who have already read a patent , they know how incredible a patent is .

The less you read patents , the better it is for you . This is exactly where a much learning can help you . Using JMP, we're trying to mimic Netflix . Netflix , based on which we have already seen , suggests you which one you will like . For patent intelligence, it's quite the same thing . Based on which patent you already read , we will try to predict which patent you have to read .

Before I will show you how to do that with JMP, I just want to introduce the key strategies that I work with patent intelligence in Michelin and we can go. Michelin is a well-known tire manufacturer that aims at producing best- in- class performance tire . But now the new [inaudible 00:01:25] mission is to produce the best performance tire, but this time with much sustainable materials only . The tire is made of plenty of type of materials , such as natural rubber, plastifiers , charge, synthetic rubber, and so on .

For now , almost none of this raw materials are sustainable . They all come from [inaudible 00:01:56] resources . Our mission is to produce new chemical processes that will produce these new sustainable materials , which means new processes means patent intelligence . The case study of which we will talk about today is about producing recycled styrene because styrene is used in synthetic rubber .

To recycle styrener, w hat we do is to collect styrene waste. Then we have to precondition of styrene to remove all the [inaudible 00:02:42]. Then we go through this process , which is developed with [inaudible 00:02:48] Pyrowave. All processes is composed of three main parts . The first is the liquefaction of the styrene and the removal of the [inaudible 00:03:02]. Then we put the styrene in an industrial microwave . The microwave is not the kind that you have in your kitchen . I mean , it's an enormous microwave. This microwave will cut the polymer chain into styrene monomer.

But at the exit of the reactors, the styrene monomer is not [inaudible 00:03:28] . It contains a lot of [inaudible 00:03:33] materials. We have to curify it. We do that by distillation. [inaudible 00:03:37] o f the distillation, we obtain recycled styrene , which has the same properties as [inaudible 00:03:47] styrene. We can use this recycled styrene to produce any kind of object containing polystyrene and especially tires.

This is a very innovative process and we need a lot of patent intelligence . But patent intelligence of two issues . The first is expertizing patents take a lot of time and only a few patents are [inaudible 00:04:18]. The second issue is that we strongly depend on the search engine and keyword . Meaning, we want to search for patents on or process . We would type some keywords on Google patents or [inaudible 00:04:40] , whatever. What we would read is only the first page of the search engine and very rarely the second page , but that is sure that we won't read all the patents that appears after the second pages .

There is a high risk of missing a very important patent , such as why we need a tool that will read for us a high amount of patent. That's what we will do now with JMP . I switch to JMP . I will show you how to do that. Here is a table that I imported from [inaudible 00:05:22]. Here, we have the patent number. We can identify each patents, we have the assignees here . We have here the title of the patent , the abstract . But describe what contains patents and the claims.

For all the patents that we have read , we tied the patent if it's critical or not . The idea is now to try to set up a matching learning model that will predict i f the patent i s critical or not. This table has two parts. Patents that have Manually Expertized, patents that we read and some patents that we haven't read. We will use the second part . The second part of the presentation . For now , we will build up the model . We will exclude this data that we haven't read .

The first thing to do is to use the Text Explorer. We will focus on the abstract only for this presentation. Of course, we can use claims and title and whatever . We will use the ID here, the publication number. Here we say , okay , just use settings, English. We say the minimum character per words is three . We will use stemming . I will tell you what is stemming just late. Here we will use basic words . No , I just have to launch to [inaudible 00:07:14].

Here is what we have . On the left part of this screen, you have a list of the terms . JMP Pro has red in the abstract corner . Here is the frequency of each them appears in the [inaudible 00:07:33]. Here you can notice what is telling , for example , for materials . If I just say , okay , material and materials is the same meaning. We group all the word by the root of the world .

On the right part of the screen , what we have is phrase. Phrases two or three or four words that can have meaning . For example , pyrolysis oil has a meaning and the meaning is different . If we separate pyrolysis in oil , so what we do , if you select the phrases that hasn't been , which is interesting and can say , "Okay, a trace ."

Now what we have done c onsider a phrase as a term . For example , this one, hydrocarbons [inaudible 00:08:32] is now considered as a term . We can do some more cleaning , for example , to provide . Provide does not give you information on what [inaudible 00:08:46] patents .

What we can do is right-click say , "Okay , this is a stop word ." I don't want to take , provide , providing or whatever in [inaudible 00:08:56]. We removed it . With JMP Pro , you can manage all the stop word on all the phrases here . As you can see , all the stop words are stored here in local . But if you want to study by default , we can just put it in user.

I do not have to exclude that one . Each time I will use the text explorer . I can also export and import all the stop word s, phrases and records . Now, the first tool that we will use to analyze patents and try to predict if patents is critical is term selection. What we want is to predict if the patents is critical or not . I want patent critical . Here is some settings , and this only settings that we will change here is to use low frequency of term instead of [inaudible 00:10:03]. Now, I click on Run .

Here, we have the first model . Now, here is a result of the model are here. What we can do see is RSquare is not that bad because it's almost over five. You got misclassification rate is also very well . I'm in less than 1% of patent are misclassified . So this is pretty good. Here we have all the term that the model used to predict patents is critical or not . We can sort it by the most important term to [inaudible 00:10:50].

If I click on polystyrene , which is not surprising , you can see all the patents where the word polystyrene appears with very useful [inaudible 00:11:04] how the model builds is built on the left table . What we have is the prediction of the model . Here , we can sort by the maximum probability of being critical . Then if I click on it , I can read why patents is critical for us or not and then can compare if it's actually critical for those that there's a probability to be critical by [inaudible 00:11:34] is high . A ctually, it is critical .

T here is some patents that are actually not critical , but the models , they maybe could be critical. We should read it because maybe we were wrong when we read the first time . By rereading this patent , we can find some misunderstanding that we don't the first time. The first two that we can do for patents , classification and prediction . But from my point of view , this is not enough because we are not confident if this model will predict well and that [inaudible 00:12:18] if patents will be critical or not .

To do that and to improve also the RSquare, we will use another machine learning model. To use machine learning model , what we will do is to export the Document Term M atrix which contains all the numerical data and into this DTM. Each patent is represented by all the frequency of the term . I already did that before . We will just have a look on the details of the abstract. [inaudible 00:13:04], which has something like 400 columns .

We can just use a cell plot to better visualize what it looks like the DTM. Well, it only takes the 20 most frequent terms and here when it's green , it means t his term does not appear in the patents . W hen it's red, it appears a lot of time . Here in blue , you have the critical patents . If I click on all the critical buttons , the aims of the machine learning model will be to understand which patent leads to critical patents. Before using machine learning model, we will have to do two things which are very important .

The first thing is data of [inaudible 00:14:06]. I mean , the number of critical patents is very small compared to the number of patents that we have. It's like searching for a needle in a haystack. We have to rebalance all of the data . To do that , what we did is to use a frequency critical. The freq critical has a formula which is inversely proportional of the number of time patents as critical or not . Then the second thing that we will do is to use the validation steps. Using validations who will have better confidence in the predictability of the model . A lso , we set a frequency in validation . Now we have pretty good data . We chose [inaudible 00:15:06] validation set .

Now, we can learn to machine learning model . I will use first the neural network . What's the first thing that I want is to predict if the patent is [inaudible 00:15:23]. I will use all those factors, the DTM . I do not forget to use the frequency to balance the data and the validation column to be sure that the model is predictable or not. Let's go .

Now , I will just configure some parameters and we will use [inaudible 00:15:50]. That's all. Now, the model is fitting. We would just wait [inaudible 00:15:58] to have model . Here we are. That's it for [inaudible 00:16:08] . If you remember the RS quares that we had on the first column, it was the 0.5 . Now , with this neural network , we have 0.94 as RSquare. For the validation side, we are 0.86. So big improvement.

This classification rate is required the same but we have rebalance the data. Now, i t is pretty warm . Here we of the confusion matrix and you can see that actual patents are predicted as actual. For the validation side, still the same . A lmost a [inaudible 00:16:56]. Now what we can do is to save the formula of the model . I have already done that . Here is the problem . Then the formula predicts the probability to be critical or not critical .

We can now r einclude all the data that w e excluded to apply this formula model , and patents that we haven't read. We just wait for them to calculate probability . Here we are . The prediction of the neural network. Here is what we are. Blue, it's the patent that we have read and [inaudible 00:18:01] the patents that are actually critical. I n orange, the patents that we didn't read but [inaudible 00:18:15 ] has worked for us .

As you can see , the probability to be critical predicted by the model for all the critical patents is pretty hard [inaudible 00:18:27] does a g ood job . Now , if we want to know which patent to read , we just have to select those patents and take a subset of it. Here we have a short list of very important model. Here , if you see which kind of patent we have , we have a lot of patents that are important for us . They [inaudible 00:19:03] because styrene purification or I don't know .

Maybe this one, it is recycled , it's styrene monomer. That's patent our very important for us. When you read all of this patent , in fact , we had a very good surprise. Many of those patents is [inaudible 00:19:29] for us now . But sometimes it could happen and the model is [inaudible 00:19:36] because the word that was used for the mo del aren't precise . Maybe a good thing is to set a different kind of model. That's what I did .

I set a model like Lasso, Boosted Tree or Neural Networks that take into account not only the abstract but also [inaudible 00:20:07]. We can also include eyepieces . I did an ensemble model , which is an image of all those patents. Now the selection is more precise. We can also see what patents we have in just a subset . Now there's a chance to be very critical is very important . The number of critical patents that we have to read is smaller than with only one model. We increase the performance of the machine learning model .

Now let's go back to the presentation. What we did is we imported data from all the intelligence . We converted into textual data and we set a matching model . Here is a method that we use the [inaudible 00:21:08] method. We did manual ranking of patents on the small corpus by hand. We cleaned up and import some data into JMP. Then we converted the textual data into matrix. We created a validation set. We balance all the data using frequency field. Then we also machine learning models such as Lasso, N eural Network, Boosted T ree. We did an ensemble model . It overwrites all the supervised machine learning models that we set up. That was the first step.

The second step was the production . We use that model on a wide corpus of patents that we have in [inaudible 00:21:56]. We can import all of the patents the same way as the first step in converting. We do some prediction . Now we are able to select only a few patents that are probably critical . The number of buttons that we will read is quite small , and we have high chance to read only critical patents and to spend time on the good patent .

The conclusion of this work is machine learning can automate time- consuming task . It's a tool that will help you , but under no circumstances it will repress human extra time . This pathology is very useful applied patents and could be probably applied to many kinds of textual data. Using JMP Pro, y ou will see that you have seen that it's very easy to do that and JMP Pro is very powerful program f or setting up machine learning like that.

What we can say is this machine-learning model just mimic our choice. I f we want to go further or if we include new patents that are critical either we will need to rebuild a new model or try another approach which is based on deploying on whatever .

I finished the presentation and I hope you enjoyed it . I hope it will give inspiration for future work , maybe on patents or any textual [inaudible 00:23:48]. Do not hesitate to contact me on the job community or to post some comments.

Thank you for watching . Bye- bye .

Emmanuel_Romeu · ‎04-18-2023

I love it !!