Narrowing Down Choice: Using Bayesian Methods to Determine the Right Yeast
In Novonesis’ pursuit to understand the performance of industrial yeast products in specific customer conditions, our application research and development groups excel in generating insightful data sets for screening top yeast candidates at customer scale. Scientists regularly employ statistics to address complex questions related to both broad and specific customer scenarios.
The introduction of Bayesian methods into Python libraries has significantly enhanced our ability to compute models and facilitate data-based decisions based on probabilities, even when traditional statistically significant results are not available. Additionally, the new capabilities in JMP 18 enable seamless integration with Python, empowering our scientists to generate large data sets and address customer-specific inquiries without leaving JMP. By leveraging application research data, customer trial data, and prior information, we are embarking on a journey to characterize our decisions based on informed data and Bayesian statistics, rather than relying on statistically insignificant trends. This approach aims to provide a more comprehensive understanding and quantification of performance, ultimately leading to more informed decision making.
Hi, my name is Matt Watts. I'm a senior scientist here at Novonesis, working on bioenergy and research and development. Bioenergy here is basically taking corn to ethanol. In my role, I am responsible for the yeast and scarification enzymes that make that process efficient and giving the most yield possible for our customers.
Today, I have the privilege of talking to you a little bit of how we have been using Bayesian analysis or statistics to help us make data-driven decisions. We'll get into a little bit of background of what Bayesian statistics is. It's fairly new for me. It's something that I've been learning over this past year. I'm by no means an expert in the field, but it's been a journey that I continue on. With JMP18 comes even more Python integration. We'll look at some learnings and things that I found with that integration and how it's actually improved, but some things that were a little bit difficult at first.
An intro into Bayesian statistics. It's really all about getting probability statements to make better data-driven decisions. It's also really good because you can form the estimation with prior knowledge. You can constrain different variables based on prior information of ethanol can never go below zero or something like that. This is really important when you're making models and things that Bayesian can do a really good job at.
I referenced a course from Richard McElreath. It's called Statistical Rethinking. It's in the bottom right of the slide. You can't click the link, but hopefully you can see it. It's pretty short. This has been super helpful for me to understand Bayesian statistics, and we're actually going to grab an example from that course to walk through the Bayesian workflow, which helps your frame of mind of what goes on when you're doing Bayesian methodology.
Here we will explore an example of finding the proportion of the globe covered in water. There is a Bayesian workflow to follow. It's going to be on the right side. You can think of it like an artist. An artist doesn't go from a blank piece of paper to a fully-drawn owl. There's different steps they take to get more details and more added. They start with shapes, and then they go to lines, and then they go to the eye structure, and the feathers, and whatnot.
This is really important when you're thinking about Bayesian workflow. Step one is to define a generative or causal model of picking land or water. We can just take a globe, and we can pick multiple samples, and basically, that'll be the land or the water. The second step of the workflow is to define an estimate. Here, the estimate is what is the proportion of the globe covered in water? Step three is to get a little bit more detail. We're going to develop a statistical model to produce an estimate of the proportion by applying multiple land or water samples.
This is where it gets really hard for Bayesian because you might need a different statistical model for every estimate that you're asking. You have to have a lot of knowledge around statistical models, and what they tell you, and how they can contribute together to give you the information. I found a lot of different sources to help me along the way, and I'll get into some of those later.
In step three, we're going to test basically the… Actually, in step four. In step four, now we're going to even add more details. We're going to test the statistical model using our generative model. This is basically what we did at the beginning by just picking the globe. We're going to put those samples into the statistical model and see if we get the same proportion of the globe covered in water. If that works, and that makes sense, then we can sample the statistical model a lot of times.
Usually, we do this 10,000 times across multiple chains, and that gives us a posterior distribution. This posterior distribution helps us find the probability of the proportion of the globe that's covered in water. We can look here at this chart, and we can see that the highest probability leads us to believe that the globe is covered with 75% water, which is pretty awesome to see, you get a visual representation of the model. With these workflow steps, it is crucial to understand the testing. You have to document your work, and this will not only reduce the errors, but allow repeatability and others to understand your work.
We briefly discussed the concept of posterior distributions, but what is that? A posterior distribution is the distribution of the posterior probabilities. This is where I get back into Bayesian in all about probabilities. These are calculated by looking at the likelihood times the priors, so that prior information, divided by the actual evidence or metrics that you measured in your experiment.
You can see in red, the distribution helps us to understand that the Earth has a high probability of being 75% covered by water. These are the probability statements that I mentioned in the beginning that helps us make data-driven decisions. That's a little background on Bayesian. But here I wanted to highlight a learning curve that I found between JMP and Python integration. There was some steep inclines and some valleys, some pain points, some struggles, but in the end, I found a lot of benefits from using Python in JMP, especially when you merge JSL and Python scripts together.
I don't have time to go through this entire slide, but really, I just wanted to give a plug for PyMC. This Python library not only was crucial in the code for the Bayesian analysis, but also the website does a really good job of giving you examples of statistical models that you can use for a multitude of testing and things. Then it also leads you to another visualization software, that we'll get into, that helps you really test your model before you even start to do an algorithm or anything like that to figure out the final result.
PyMC, it's built for Jupyter Notebooks. Through and through. And so then people ask me, "Why are you using JMP's Python interpreter to do this?" It came back to lowering the activation energy for my colleagues. This is something that our company uses. Our company uses JMP through and through. If we are able to do a click situation within JMP where the back-end is Python, then my colleagues are able to use a new tool that can help them make decisions moving forward. That's basically why we went on this journey.
Coding with Python in JMP. Visualizations is one of the things that I love about JMP. You can see things, you can explore data. I would say that's one of the things that I found the most clunky because with Python, they have native libraries that help you with visualizations and things, but they don't work in JMP. You basically just have to report an image. We've done some of the code here to output data tables from the information that's given out from these libraries that helps you explore the data a little bit, but that's something that I found a little bit of a struggle.
On the reverse side, JMP really helps with user inputs, things like lists of columns or determining if the user wants to continue with the script. These are just pop-up windows that people can interact with versus just figuring out where the code I need to type in my column names and how to name them and things like that.
Then lastly, where I found to really benefit was, again, when I talked about where you can integrate JSL into the Python script. Now, we're able to have panel boxes showing the image, but then you can have descriptions underneath the image to help the user understand what they're looking at. These have been amazing jumps forward that I found, at least.
Now getting into the meat of the talk today is the script. We start with selecting some factors that are going to affect our observed outcome. Our observed outcome here is ethanol production. This is just using a sample data set fermentation process from JMP, so anyone can go grab this when they got the script.
First of all, with the Bayesian workflow, again, it comes back to testing, testing, testing. We're going to look at simulated data against our observed data to see, does this statistical model work that we've chosen? We can see that it does. It overlaps our observed data well without going too wide on the axes.
Then we're able to test the model. We're able to look at the univariate parameters within the model to see how those densities look across each other. If they line up on top of each other, that's a really good sign that your model was working. We can use trace plots here, which is another Python library called rvs, to help us get there.
Then the last test that we did was an energy plot diagram, looking at this no U-turn sampler, so this is the major algorithm that gets the answer. Again, we're looking at those curves, those distributions to be on top of each other, and they were.
Then we get into the last piece. This is the final data set. This is really important because I wanted to show what JMP does with the fit model, standard least squares, we get both tank level and air to be very significant on ethanol production. If we look at the Bayesian methodology, we also get the same thing. Tank, level, and air are significant. They're the furthest from zero that we can get.
But what's cool about the Bayesian, it gives you a little bit more inkling of it, is that we can see positive or negative effects. Here, since tank level is large and negative, this means that the higher the fill, the lower the ethanol. Since air is large and positive, this means that the higher the air level, the higher the ethanol.
As a fermentation scientist, I don't know if I jive with that conclusion at the end. This is a sample data set, of course, but it's really cool that we got the same result from both, and we get a little bit added information from the Bayesian to help us make some decisions moving forward. But as my title describes, we're making yeast decisions here at Novozymes, or Novonesis.
We chose a fictional character as a product name, so we have Clint Yeastwood. Using Bayesian analysis, we were able to make probability statements to help us make decisions. Here in this Bayesian analysis, Clint Yeastwood reaches 1.1% higher ethanol than BadFerm 78% of the time. This is really crucial information that we can use to drive products forward.
In summary, today, we talked about Python integration in JMP and exploring Bayesian methods. Hopefully, we've unlocked some potential for everyone here to go exploring, go looking at new ways of thinking, but to remember to always come back to test, test, test. Bayesian, you need to test, you need to document so that you understand, but also your colleague, whoever else you get this to understand what you're doing.
Then finally, it's all about decision-making. Using this, hopefully, you're able to make data-driven decisions, and we've opened your minds to new possibilities. I really want to thank you for listening to me today. Hopefully this was beneficial. Please go to the website. All the script is there and information. But if you don't, please reach out to me. Thank you.