Imagine That: Predicting Chemical Formulation Targets from Images
At Syngenta, image classification is key to the modeling of samples within an automated chemical formulation system that has produced more than 100,000 pictures. The images are used to determine the next steps in the development of products but their interpretation by eye alone is very time-consuming and subjective to the evaluator.
The Torch Deep Learning Add-In for JMP Pro uses the power of convolutional neural networks and GPU computing to dramatically simplify the analysis and give consistent results across an agreed set of control images. We show how easy it is in the add-in to import and predict a rather complicated set of pictures and return meaningful, interactive output in minutes for tasks that normally take hours or even days with manual processing and Python coding.
Hello, everyone. Welcome to our presentation today. I'm Russ Wolfinger from JMP R&D, and I'm really pleased to be joined by my friend and colleague, David Barnett from Syngenta. We're really excited to talk with you today about a relatively brand-new application of JMP and JMP Pro using and analyzing images. It's a new direction for us within JMP that we're very excited about. I remember we spent a lot of the early COVID years working on a brand-new add-in for JMP called Torch Deep Learning, which we had been hearing from a lot of customers, about interest in this thing. As you know, there's been an explosion worldwide with interest in AI, ML, and deep learning. Probably the two most popular packages out there these days are PyTorch and TensorFlow. We settled in on PyTorch, and it turns out Torch has a nice C++ interface that we were able to tap into at a low level. We've been able to put together a nice add-in that seamlessly integrates with JMP Pro, and we put it out there.
I remember being excited. I think it had only been out a few weeks, and somehow David found out about it.
David, I think you were one of the very first ones to take a look at it. I remember hearing about your early successes. Really happy today to hear your story and how things came about and also some applications and where things stand for you now after having been able to work with these data and applications over the past few months. David, let me pass it over to you now and hear about what you've got going.
Thank you, Russ. It's really good. Sorry, I just got to share my screen. Hopefully everybody can see this. Thank you, Russ, for that. This is, as Russ said, I saw this quite early on in one of the earlier doctor programs. It's something that we've wanted to look at for quite a while on the image analysis front. I'm just going to explain the story behind that on our high throughput system that we've got.
A bit of blurb of the agribusiness, that it is a growing concern about how we're going to feed the world in the future. By 2050, there's going to be approximately nine billion people, and the requirement for food is going to increase much more than that. As a business, we are looking to grow more and more. Syngenta on its own has about 30,000 employees over, I think it's more than 90 countries, and we've got quite a big sales there, so nearly 17 billion 3 years ago. The largest R&D site that we have, and it's the one I'm based at, is at Jealott's Hill in the UK, it's just outside London. We spend a huge amount of money on research and development within the company.
We have all different types of chemists, synthetic, computational, analytical. I sit in formulation chemistry, which is basically putting all of our knowledge together and selling products. The formulation trends that we have because of regulatory pressures, cost pressures, and the speed that we have to get things to market to, as I said, about all the regulatory pressures that we have to make safer products, means that the complexity of formulations is growing exponentially, nearly. We have to make things safe, not only to manufacture, but to also for the farmers that use our products, and also for the consumers who buy the products which our products are used on. Because of this, we saw a huge opportunity to use automation to do a lot of the background work for us. As I said, because of the complexity of the formulations, the complexity of the composition, what actually goes into the bottle has increased a lot. The process behind that, the order of addition as well can be very important. The responses that we have, whether it settles within the bottle over time, whether it thickens over time as well, often a nonlinear because of the different things that we have to put into the bottle.
The rewards that we saw, a larger, more complete data set. Actually understanding the products that we sell to the farmer better. It also can reduce the repetitive nature of the work that we have to do in the laboratory to get our products to market. We can also build in some things around about the cost as well. If we can buy our materials cheaper, we can sell our products either cheaper to market or we can make more profit, which is usually the way it's meant.
Also, the improved product performance. We can build into things some adjuvants, as we call them, to actually help things get either into the plants or into the insects better. Because of the complexity, this is just a very simple explanation of the number of components that we can put in there. If we have, top row, we have 20 ingredients, and we just want to do a one-to-one blend, that's 190 samples. A formulation chemist could do that in the lab. If you want to build in more complexity and just do a straight one-to-one blend, that's 1,100 samples. That isn't something that a formulation chemist can do in a reasonable amount of time.
You can see if we've got more and more number of ingredients, we can get to nearly 2 million samples, and we're not going to get anywhere near that in the lab. A few years ago now, about 15 years ago, we invested in what we call ARTEMIS, which is the formulation high throughput system at Jealott's Hill. There, you can see the scale that we were working on. This is something which can make the samples and also characterize the samples automatically. The system can make about 200 samples a day and characterize them fully as well, a step change from what the formulation chemist can do in the lab.
One of the tests that we have to do is to see how the sample would react when it's poured into a spray tank. We use a standard test, it's an industrial test, which we call it into a Crow Receiver. We make up normally a 1% solution in that. We have to manually check how well that's bloomed. When you pour it straight in, how does it bloom? Also, the stability over a number of hours. It can be up to 24 hours that we have to do that test.
Again, it's a very subjective test. One chemist might see that it's fine, another one might be a bit more discriminative and actually see that there's a bit of cream up the top of the sample or a bit of sediment down the bottom. Here is an example of what we're looking for, as the help puts from it. This is all very visual. On the left, we have the best samples. Still not brilliant, but it's very good. Going from left to right, it goes on to bad. Basically, your sample is then dropping straight to the bottom of the cylinder and there's no dispersion at all. If a farmer was to put this into a spray tank, without any agitation, then it wouldn't be sprayed successfully onto the field, so therefore onto the plants.
Now on ARTEMIS, going back to the beast, we have a visual analysis module, VAM, which is probably the strongest aspect of the characterization of it. What it does, it takes a number of pictures. I'll give you a demonstration. I'll show you the image in a minute. Using laser, back light, and front light from it, and it's all contained within a nice shiny box.
This is an example of what we have. This is one sample, and we have, in blocks of three, we've got different exposure times. The first three, there's a front lit, back lit, and then a laser which comes in into a plane, into the sample, into the front, and you're seeing what's reflected back. The second three is a different exposure, but again, it's a front-lit, back-lit laser, coming into a plane like that. The third section is, again, the same, a very long exposure time. You can see that the intensity of the color tends to increase, as you would expect, with longer exposure times. The next block of three is the laser coming from the back. You're seeing what's actually detected by a camera from the laser. Again, it's a plane of a laser coming through. Down the bottom, you see some little red dots, and that's what's coming through the sample. It's not quite as opaque down the bottom.
The one on the right side, we tend to ignore because it doesn't tell us anything at all. Hopefully, that was clear. This is one example. This is a creaming sample. If you look at the images, you tend to see a white bit across the top on the left-hand side of number 3. It's quite clear on the back lit that there is something at the top. That's what we term a creaming sample.
What we're looking for is a homogenous sample through all of that. What we've had for a number of years, is something based on a program called ImageJ. It gives us a numerical value of the sample, and it doesn't really care what the sample is, it just gives us a numerical value back. Anything from 0 up to 255 is one of the numbers for color. We get red, green, blue as an example for that. We do not get a human-readable output from that. It's very much up to the individual, normally myself, to actually put all these numbers together and tell the chemist what the sample is. Is it a good white homogenous sample, which is the ideal, or is it a sedimented yellow sample? As part of the numbers, we can get the color, but it's quite time-consuming to do that. All I wanted, for a number of years, is a program that could actually tell me what a sample looks like. Is it a good white homogenous sample?
I wouldn't have to interpret anything at all. Maybe a few checks in the background, but I wouldn't have to go into the laborious nature of the numbers.
As a part of this, we did a design of experiment to actually screen out a number of emulsifiers from that and also get then the images back that we need to work through all this Torch information and also some other reasons as well. We've got a number of samples. We had about 800 samples that we had to prepare for an experiment, which we used for this image analysis. All the images were prepared in triplicate. The image analysis was prepared on each of these samples as well. They're actually, in total, seven images per sample, which I then went through and did my own interpretation of them. Whether it's a sediment sample, a cream sample, whether it's homogenous, whether it's a bit of foam, various different things on there, which I hope you can imagine it was quite time-consuming to do. We had over 6,000 images to look through. It did take quite a while to do that. I didn't get it right first time. Probably still not quite right, but I tried my best on this.
There are various samples I took out for various different dosing reasons. Images are retained for any other technique you may want to do. When you look at a sample in the lab, your analysis of that is basically disposed off straight away because you just pour the sample down the sink, and you can't look at it again with any images, it's there forever more. What would we use to do this image analysis? This is a thing called Torch in JMP, so that's quite good. I'm sure for us, we'll go into more details, but this is basically saying, Russ has just said anyway, that this program can actually look at an image and interpret it for me based on some prior knowledge which I've put into there. My laptop, which I do all of this on, is a high spec, and I would highly recommend you get high spec PC for this because it does use a lot of memory to actually perform this. I'm not going to do anything live here because it can take quite a while to do it, especially in the more intricate type of analysis.
I have a number of examples where I can see slight differences. We are honestly about to get onto the good stuff, but as a caveat to this, I'm not an expert in this at all. I know what the numbers mean when they come out, but how it works, I don't really know. I've got a quote there from Socrates, which I was quite proud of. I found it. I know that I know nothing about this, so I'd leave it up to experts to try to explain it to me. I'll leave you to it then. I'm also only showing very simplified work. I've done a lot more work in the background to do this, and the data will be supplied to you as well. Hopefully, you'll be to look at that and get some even better information out from that. As I said before, the results aren't perfect. I'm happy with an 80% correlation between what I think the samples look what it looks like and what JMP says it looks like.
I'm getting higher than that. There are other tools available to do this. You can actually use the Python script in its native environment.
What we have thought about doing is actually putting this as part of our data net's pipeline. We will be hopefully putting this in as an automated thing. Once We've got more knowledge and more confidence in what the outputs really mean and that they are true. We are nearly there.
A few thank you before I go on to the good stuff. We've got various. Thank you, Sam. A few people from JMP. Liz Easton from Jealott's Hill, who looks after all the JMP licenses, and Alan Brown, who some of you may remember, who's retired statistician at Syngenta who's been a great influence on using JMP at Syngenta over the past few years. As I say, he has retired now, but his legacy lives on. Hopefully, I'll be able to get on to JMP.
The design of the experiment, which I'm not really going to go to cross the top here. It is a simplified version. We have the AI system, three emulsifiers, and that's the amount there, those three. The solvent there with the levels applicable to those there, which make up nearly 1,400 samples. We have three different types of cuvette, which is what we pair the samples in.
We have the neat cuvette, which is the neat sample. We have dilution, which is about five minutes after it's prepared and inverted, the image is taken. That's a t equals 0. Next day, so after approximately 24 hours, we want to see how stable these samples are over a number of hours. The visual analysis and the color are my interpretation of them. There's homogenous, which is the ideal situation. Cream, which is where the sample has risen to the top. Neat, which should always correlate with the neat cuvette. Sediment, which is where the samples have sedimented to the bottom. Foam, where your sample has been inverted, and it's created foam, which isn't an ideal situation for the samples. As part of the Torch add-in, you can actually make k-fold columns. As this recommendation, we have three columns here, which are basically the validation columns, plus there, and the image snips. They're not the full resolution images. It put into the data table there.
You can see there are, quite clearly see, there are different qualities of sample in there. As I said before, I'm not going to run through this because it can take quite a while to do.
Hopefully, you can see a number of analysis that I have actually done previously. The main figures that I look at are these ones here, the training Rsquare to the validation accuracy. As I said before, I'm looking for about 80% agreement between what I've said they are and what the model they are. You can see up here, you've got 96%, 97 there. There are some very good correlations in there. There are loads of different types of image models you can go through. The lower you go down this list, the longer they take, but I have tried a number of these in the past, and I found that this LeNet5 is actually a good compromise. Between the speed of the analysis and the actual outputs, you can get better as you go down the list, only very slightly. I am very happy with the 95% we've had there. You can also alter number of layer sizes, batch size, and the image size as well. Again, I'm not an expert.
I just fiddle around for these numbers and get different outputs. You can see I've tried a few here. There's no great improvement over the default, which is this top one here.
It's one of the very many nice things is you get this plot here about the training history. You can see that what you're looking for is it to come down a lot lower, potentially quicker there. You can click through different models to see how they perform against each other. You can get some really good ones there. You can also get some suboptimal ones which are very noisy. You can improve them, but you got quite a lot of noise out there, but you can improve this as you go along. Another thing that I'd quite like looking at are these plots here. You can actually see what probability you've got of getting a correct answer. Although it may look quite messy, the actual numbers, though it's not predicting correctly. Again, this could that I haven't interpreted properly. It's only 33 there, out of 1,500. Actually, a very good correlation there. As a proof of that, you can also look at the confusion matrices here. It's getting most of them correct.
Yellow is 100% of one there. As you may expect, you can detect different colors if you want to. White is at 97% and very good correlation there against the validation as well is also very good.
You can try different models there. Sediment is not very good at particular foam. It's not an important test. Cream, that's not predicting great confidence, about 84 out of 1,500, so not too bad. This is actually showing me that there is a lot of confidence behind this, and we can build this into our workflow for analysis. We can automatically, hopefully, pick up the image, do this analysis, and tell us what it looks like in a human-readable form. Again, it's very basic testing. We can put a lot more information into this if we needed to, different colors, we could put in blue if we have blue formulations as well.
That's a very quick basic run-through of my playing around with a Torch add-in for JMP, and it's been quite fun, actually. For more information, don't come to me. Go to people like Russ who can actually give you a lot more information. It's over to you, Russ. I'll stop sharing.
Thank you, David. That's such an impressive example. Really exciting to see, especially since I know you're an expert formulation chemist, but you're not a deep learning expert, but you're able to get in there and get some results. This is exactly how we've designed the software, and even JMP in general, designed for scientists and engineers, and now engineer data scientists. We're a new thing coming down the pike. I wanted to mention too technically, it looks like, if I'm not mistaken, David, that LeNet default model is also resizing your laser images to be 28 by 28 pixels only. Fairly strong reduction in resolution. I have some examples actually where that tends to really degrade performance. In your case, it's pretty impressive.
It's working really well.
For other applications, there's the flexibility to use bigger images, which will slow things down computationally, but sometimes it's well worth the trouble. That's where those Nvidia GPUs on Windows can be very handy. I echo your recommendation. You need some decent hardware to run these models. Unfortunately, they're fairly affordable, at least on Windows now, thanks to the gaming world. They often use the same chips. The problem is if you got teenagers or something, you may be fighting over the machine, but it's really fun to see the application here, David. Congratulations. I'm looking forward to working with you more on these applications. I think in a way, we're just getting started. I wanted to mention a couple of additional things to build on what you were talking about.
First of all, just so our folks are aware, we're going to be doing a second presentation that will be complementary to this one. Depending on when you're watching this video, it's 2024. If you're watching before October 17, there's still time to sign up. If you want to watch live, go to LinkedIn, and you can find this advertisement. If you want to go to my page or Alexander Beck and sign up.
If it's past October 17th, I think you can likely still find the recording to see some more details, and we'll be going in more depth on a couple of topics, too, as a follow on to the presentation today. I also wanted to mention that this is not just an isolated example. For example, this paper was just published this year from a group, these are researchers from Singapore and Cambridge, and BASF, and they've got this nice paper on shampoo formulations. Very similar aspects where they have a nice design of experiments approach, which we love within JMP. This is just right in our wheelhouse, experiments and analysis that we like. Then they take pictures of their formulations and again, try to build classifier models to do that. They've got also some of these nice automated robots that help collect the images as well. It's really great to see this. I think this tech were able to really leverage the capabilities of this high-quality data now and get the most out of it.
Thankfully, they actually made their data public. Here's just a quick view of it. We also ran this through the Torch add-in, these images to build a classifier.
We were even able to improve a little bit on the results published in the paper. There's a lot of room with these models. They're pretty incredible what they're able to do. They're not foolproof, and sometimes they even make silly mistakes. For example, here, I think this was a case where it really helped to do cropping. You've got to be careful about artifacts around the edges of the images. Sometimes the models, they're so greedy. Sometimes they'll focus in on aspects that are not what you want. In this case, we did some cropping up front, and then we're able to get really nice results. If you'd like to give the add-in itself a try, it's available on the JMP community for free. Just do a search, maybe the easiest way to do a web search, Torch Deep Learning JMP Pro, you should be able to find this page. You do need a working copy of JMP Pro 18 or 19 early adopter is also available now.
Either one of those should work. Then you just download the add-in here and you're good to go. To make it even more compelling, we've got a nice collection of examples put together now called the Torch Storybook.
This is actually just a jump table, which you can even download without the add-in, just to take a look if you want. We're up to 47 examples now across a bunch of different types of applications. It does tend to be a little bit of a focus on chemistry, which is one of our, probably, arguably, our most important industry or emphasis in field of science, including, for example, this one I thought might be the next build on where we've got here data from... This was actually a paper from Discovery Europe. Last year, maybe earlier this year, in a fermentation batch process. They actually had time data. The interesting thing here is we were able to build images from this data using JMP Graph Builder into a heat map and then pipe those into the Torch Deep Learning. It's something you may not have thought about at first. In fact, if there's maybe one theme today, it's that this idea of fitting images within JMP with deep learning models, it just opens up a whole brand-new venue and just all kinds of different applications.
A lot of the applications in this table are from images.
I wanted to mention also we're also doing unstructured text, for example, using smile strings of molecules to predict properties. Pretty incredible. I hope we'll talk about that more in a different presentation, but just wanted to whet your appetite. Please feel free to take and download this table and look through examples that might catch your interest. Check them out. The table is set up with links where you can download the data, all of which are public data sets. There's also example outputs and references, background. If you want to get up to speed quickly, this table is designed to help you do that with a bunch of really nice examples. Always good to learn by example. We're very excited about these applications. I'm hoping David, as we go along, folks will be able to follow nicely in your footsteps, and we'll tune in for other applications that we've got. Thanks again, David. I really appreciate the chance to work with you on this. What are you thinking, David, for next steps with your analysis?
It's about trying on different samples and ensuring that everybody is happy with the results that we get. It's building on the knowledge we already have.
Very nice. I know we also have customers who are already… They've done the same thing, and they're actually starting to deploy their models in practice. A lot of them are using Python, and we've got applications now where you can train models in Torch, save it as a PyTorch object, and then load it into a Python program using JMP's Python integration many times, too. We've got nice connections to that now. A lot of exciting things happening. Just a way to do this. This is what I consider standard non-generative AI ML, very popular and practical. In your case, David, that might be saving you some trips to the optometrist and really accelerating and improving your workflows and day-to-day activities.
Definitely, yes.
All right. Thank you, everybody, for tuning in. We hope to see you at the Discovery meetings or other chances we get. Do feel free to reach out to us any way you can, LinkedIn or JMP email. We'd be happy to talk with you further about all the exciting things going on with this and other things going on with JMP Pro.