Within agricultural businesses, the ability to accurately predict the yield of a crop each year is critical for enhancing the efficiency, profitability, and sustainability of that business. The earlier the yield can be predicted, the more efficiently that resources can be allocated, supply chain managed, the harvest scheduled, and the storage logistics of the business be determined.
A current challenge in the sugar beet industry is climate change, which is causing increased variability of yields from year to year. The rapidly changing weather conditions make yield estimation less predictable and ultimately increases costs to all stakeholders.
Harnessing the power of data analytics and machine learning is one way to improve the accuracy and timeliness of yield predictions. A predictive model was built in JMP to predict sugar beet yields. The whole process was possible in JMP alone: data collection, cleaning, preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, tuning, and subsequent deployment and maintenance.

Hi, everybody. We're here today to look at JMP for some predictive modeling and specifically for agricultural yields. But it's worth pointing out that this process I'm going to show really doesn't just apply to agricultural yields. You can use it for anything. Hopefully, it's a bit of a how-to on how you could do this yourself. Some of the little tricks that you can use in JMP that are going to be helpful, particularly for agricultural yields. But as I said, anything you're trying to predict, really, this process stands.
Just going to provide you a rough guide with some examples of how you can build predictive models in JMP, specifically for agricultural yields. As I just mentioned, you can use this for any predictive process you're trying to achieve. The specific problem here is early accurate predictions of crop yield is very important for business decision-making in agriculture. One problem that makes that difficult to achieve is, there's a lot of variation year to year in sugar beet yield and quality. This variation is actually increasing over time in frequency and severity, which makes it harder to predict. The use of applications like JMP for their predictive capabilities are increasingly important and necessary to accurately predict yields.
One of the main reasons the variation seems to be increasing is climate change. As we'll see here, I just wanted to give you an overview of sugar beet farming. When I say yield, what does that exactly mean? Here's a sugar beet, and there's two aspects to this. The first one is the size of the sugar beet. Here you can see it says the cleaned root weight. As you can see on that beet, there is some dirt. It's then cleaned, and the size of that beet in weight is one component of the yield.
The second component is the quality component, which is what concentration of sugar lies within this beet. Then, quite simply, the concentration of sugar multiplied by the size of the beet is the yield that you're going to achieve.
How do we go about accurately predicting the yield? Well, first things first. This is the same concept regardless of business type. But what we want to do is gather as much data as possible over as many locations and years to produce one large tabular dataset with as many possible factors that might affect the agricultural yield as possible, along with results, yield values that we can then compare. We produce a large historical dataset with many independent variables alongside our dependent variable.
For agriculture, Some of these examples you really need the yield data, ideally data on soil health, the weather, pests and disease, farm management practices, seed variety, soil terrain data, remote sensing data, etc. The more data you have, the better quality this data is, the more accurately you can predict your yield, and the more variation in yield you can accurately assign to one of these factors, and therefore, the more variation you can discern between these variables.
In agriculture, there are so many potential variables that can have an effect on the final yield. It's almost impossible, if not impossible, to really, in one dataset, contain all of the information. But the goal is to get as much data as you can to get a good prediction of the yield. First, you want to find all the data that you can get, combine this data into one dataset, clean, process the data. Then in JMP, we can do the exploratory data analysis, the goal being to discover the hidden patterns and trends within the dataset. From there, we can identify the most important predictors of sugar beet quality and yield. Then we can build predictive models using those important predictors to forecast yields into the future.
Just quickly, the process overview in JMP. The first step is data collection, and this is the most important step, actually. I can't emphasize this enough. The quality of the data is incredibly important. One of my mentors often says this, "Rubbish in, Rubbish out." If you don't have good data, you're not going to build a good model no matter how good the process you have is. Really need to focus on getting really good quality data.
One benefit in agriculture is, one, important variable in the yield is the weather. The weather with technology today, we have very good methods retroactively of acquiring weather data for a field, for a region, whatever it may be. There is a technology out there where we can very accurately take that weather data from another source and utilize it.
In terms of agriculture, this is one piece of the puzzle that is really good, because in many cases, you do rely on having the business that you're working with or your company to have already in place good data management practices, that they've been tracking a lot of the important variables, taking as much data as possible, storing it well so that it can easily be accessed. If that's not the case, then the likelihood we've been able to build a good predictive model is lower, and it may take many years to gather the data that you actually need to get to the place where you have a really good model in place.
Then we have data cleaning, pre-processing, feature engineering, exploratory data analysis, model screening, training, tuning, and evaluation. Then finally, model deployment, potentially maintenance. The first step here, collect data. As I said, and often is the case that data is stored in many different locations, many different databases, potentially different sheets, and the goal of JMP is to draw it all in together.
In this particular case, with the agricultural yields, there were three datasets. There was in-season sample data, there was the weather data, and the final yield data. The important thing is to find a common thread between all of your datasets, which we'll show in a minute. You can then use the JMP query builder to pull all of these data sources into one place very quickly, stress-free, accurately, very easily. In this particular instance, the common thread was contract number and year, where we could combine the data.
Let's just go straight into the first step that I'll show you is some data pre-processing. Something that you may or may not need to do, depending on the data you have, the analysis you intend to do is data normalization, but it may increase your model performance. Let's just have a look at that in JMP. Got the JMP platform here, some weather data. Now, if we open that up, and we just go into the distribution here, we have a few weather variables. We throw them into the distribution.
Now, if we look here, let me just extend this out. What you can see here with this precipitation is there's a left skew. Now, with data that is not normalized, there may be problems for the model. It may be more difficult to the model to discern relationships in skewed data. One thing we can do to correct this is normalize the data. With this here, the accumulated precipitation, go into JMP, find that dataset. Let's insert a new column. Formula. Now, the column in question we had was precipitation.
Now, to normalize this data here, we want to find the square root with the left-skewed data. Add that to the formula in our column. Click, Apply. Now we see we have this new column here. Let's rename it so we know. Normalized Precipitation. Now, let's go back to the distribution, look at the two columns together. Ignore these. This will come in later. We can see here the original data. Let's just paste those axis over here.
Now, as you can see here on the distribution, the original data, we have this clear left skew, and with the normalized data, we square-rooted, you should find that modeling with this data will be much easier. One problem here, though, you have to remember is that when you are working with a client, the precipitation, for example, is very easily interpretable for them. If you're trying to then show them a model with this data over here, which is square-rooted, the square root of the precipitation is no longer interpretable, and often people really want well-explainable data. It's something you'd have to do at the end of this process, potentially is reverse square root these values so they are making sense to the client.
Anyway, that's one thing we can do. Let's go back to the slideshow here. Next step, some data cleaning, formatting is often very helpful. The first thing we can look at is table splitting. In this dataset we have here, something you'll notice is we've got different columns for precipitation, temperature, min temp, max temp, and we have a column for month. For our analysis, it may not be ideal to have a column for month with one column for each temperature variable. We may want to switch that, so we go into tables, split. We want to split by the month. We want to split the weather data, and we want to keep all of the remaining columns.
Now in this new dataset here, you see now, very quickly, we have a column for each month in the dataset, which is much better for us to work with. See that there, month 1, month 2. We now have a column for each month. It's one thing we can look at. Let's close this, keep this for now. The next thing we can do is combine data, which is a really neat part of JMP that very quickly allows you to combine due datasets. Let's just quickly look at that.
Here you see we have opened the weather. We just split, and now we have a second table here, which is contract ID and the yield, and we also have site. We want to combine this dataset with the weather. How do we do that? We take our yield dataset, tables, query builder. We have the yield data here. We want to incorporate the weather. Already you can see JMP here has located two columns that look identical to JMP, contract ID, and that is what we want.
Now here, if this wasn't the case, often you might need to remove it. You click the X here, remove, in which case we don't need to, let's say. Then you can add your own contract ID, contract ID. We're going to identify as the common thread. Then we decide which join we want. This is a SQL query here, Left Outer Join. No, we don't want that. We only want Inner Join where the data is matching from both datasets. Build the query. We can here add all columns, run the query, and we end up with a dataset that combines both tables. Here is the yield data. Here we have all of our weather data, which we now can start analyzing.
Then the next step is to explore outliers. Now, in a small dataset, You can simply, with a low number of columns, it's very easy in JMP to quickly analyze these by throwing all of the data that we're looking to analyze. Let's just show an example here, into a distribution analysis. Oftentimes, it's very easy to pick out where individual outliers may be or may not be. You can see here we've clearly got an outlier.
However, in a large dataset with potentially hundreds of columns like we have here, there's too time-consuming, what we can do in JMP instead is go to analyze, screening, explore outliers. Within here, we can add all of our columns run the outliers. Let's go here with Quantile Range Outliers. Instantly, we now have a list of all the columns and the number of outliers based on this test. Now, there's many ways of looking at these outliers, but it's just a quick way. Then from here, we can select our outliers, and we can exclude them and remove them. We can select them, color the cells to look at later. There's many options here. We can exclude all of our outliers very quickly in this manner with large datasets.
Next step is feature engineering, where we may be able to help our future model performance by giving it some better information. Now, this may require domain knowledge or from findings from your exploratory data analysis, you may have a hunch that combining certain columns in certain ways can improve predictive capability. For in this example, we might want to create a feature which is summer rainfall in mills. How do you do that?
Very simply, let's go back to JMP. Here we have our dataset, and add a new column. Let's call it summer rainfall millimeters. Formula, Let's go to our rainfall summer. Statistical here, we can easily get the mean, drag in those variables we want to apply. Now we have a feature, summer info, as we can see here. This is the square-routed data we had earlier. Feature engineering is something we should be looking at, trying to improve the predictive capability of the model by building in features that increase the predictability of the data.
Some more things here that we can be doing in the Exploratory Data Analysis. Response screening, predictor screening are really great tools in JMP. Response screening, we won't look at. The principle is the same in how to get here as predictor screening. The response screening looks at individual variables with Y by X type analysis, and you can look at individual independent variables and how they affect the dependent variable.
Predictor screening, we'll have a look at in more detail now. This is a really good tool in a large dataset to find the hidden trends that may be lying within. In JMP here, analyze, we go to screening, predictor screening, and this then uses random forest machine learning process. We want to see what are the most important predictors of our response, the yield, why? By looking at that, the site may be important. We want to include as many variables as we can find, all of our data. Side.
Now this is all sample data, so let's not expect anything here. We can also run it by side for speed today. Just put it in here. We can run this, and what we're going to find is in the background, this bootstrap forest is going to run, and we're going to look at here, the most important predictors of yield. You can see here, according to this analysis, this way you have to be wary because this is literally sample data. You have to take things with a pinch of salt. You can see here, the most important predictor of yield in this case is the site. We've got three different sites and minimum temperature 11.
Now, from this, we may identify some important variables that we want to later include in our model so we can highlight them, copy selected. We may want to use those later. It's a really good platform, the predictor screening, to quickly identify the most important predictors in a dataset of your given response variable.
Now with those predictors we've located, the other thing we can look at very quickly is Principal Components Analysis. From here, we can essentially... We're adding in the variables we just isolated that may be important predictors of the yield. Also, the yield we're throwing in there, too. We run this analysis. Let's remove the labels. We can see this output we get here, and it can quickly tell us a lot of information about the dataset. In a very complex dataset, we can start to see very simply the relationships that may be occurring here.
You can see here this strong correlation that when these columns, what do we have here? Max soil term 4, and the second one here, max temp 4, when these are very high, we're also going to see low global radiation, according to this. When we see a right angle here, in this case, we see no relationship between these variables and these variables.
Now for the modeling process, why this is helpful is because we can see here that these two variables, we have max soil temp 2 and max temperature 2, unsurprisingly, are very closely positively correlated. It's very unlikely that our model is going to gain much more information by using both of these, in which case we can select one of them for the modeling process and exclude the other. The same for this instance here, just to refine our model. Principal Components Analysis is also a really neat little tool to refine our model process.
Let's go back to the slide deck. Model screening. Now we have our important variables. We might want to find which model is best for us. Now we simply go back to JMP, to the query, screening, not screening, predictive modeling, model screening. Open up the JMP platform here, and we can hopefully paste in here the important variables we found earlier. Then we can put in the outcome we're looking at. We can choose all of the different models we want to compare. We can, very important, in all scenarios, try and set the random seeds for reproducibility.
If you want to come back to this later, and you haven't put a random seed in, you may get different results. Whereas if we put in here the random seed value, one, two, three, very easy all the time, the results are repeatable. We can then look at all of these different options. We can already incorporate ask JMP to build validation column. We can add different quadratic variables. I won't run this today as it can take a little bit of time, but we can then choose based on different metrics, the best performing model, usually based on the RSC and the model R², and within the platform, there are other options we can choose. But also another good tool to find which model is going to fit our data the best.
Next step is validation. This is going to help model training. It's essential to make sure you're not overfitting the dataset. We can build in JMP very quickly validation columns, and I'll show you how we do that. Here we have our dataset. We go to analyze, predictive modeling, make validation column. Now in here, it gives you the option to stratify based on a column.
In this example, we have in the dataset, there are three sites where these fields are coming from. Now, we'd like to stratify based on the site because we're going to split the columns up in a moment, and it's going to randomly give the value one, two, or three. If we don't stratify based on site, we can lead to problems by which, if there's an unequal number of sites in the dataset, an unequal number of rows for each side, you may disproportionately bias the dataset in one way, the model in one way.
Important to think about what columns you may need to stratify with your dataset. We go, OK. We can then very quickly create multiple datasets, multiple options here on how we'd like to split the data, so 70% of the data training, 20% validation, and 10% for testing.
Now you can see here the adjusted rates and how many rows you're going to get. Important to consider how large your dataset is to whether you're going to use training, validation, and test, or perhaps just training and validation, and what split you need to do. Now, depending on your dataset, these values may vary slightly.
From there, go. As you can see over here, we've built a validation column very quickly that has created us the option to when we build our model, we can now add our validation a column to this model. Another thing here, as we can see on the right here, this would be a boosted tree model. One thing that is worth mentioning is tuning the model to the optimal number of splits, the learning rate, the minimum splits, everything you can see here. You can also use a tuning design table. It's worth taking sometime and trying different options, seeing how you get the best performance in terms of error and R² based on changing these.
However, JMP often gives you recommended values for these based on the dataset you have, and I found them to be very accurate and actually often performing the best, even when you go and test other options, this works pretty well. That's another thing to consider.
Now we go to Model Evaluation, and as I said, here the R² and RMSE. We're going to look at some ways we can evaluate the performance of the model. If we go back to JMP, here we have a model we just created. Here we can go to Factor Profiling, Profiler. It's already on. Here is our model, our profiler, and something we can very quickly do, save columns, and we produce the prediction formula here.
The formula that is running behind the scenes to make the predictions will be stored in a column in the dataset. This has been done now. Go to our dataset. You can see here the predicted values for our yield. Now what we can do, analyze fit Y by X. There are many ways to do this in JMP. This is just one way. Our response was the yield. We're going down here. Got a predicted value in X. OK. Not a very good model, as we can see. We can fit a line from this line, we can go to that fit of line. Profiler. Now we have this platform here, and we can see our prediction for yield.
Let's say we hit, let's not do that. Let's do 50. We can say, "Okay, when we predict 50, what is the true value likely to be here?" 49.694. We're over-predicting. Then two nifty numbers there you get instantly is the lower 95% confidence interval and the upper 95% confidence interval. This is just one way of looking how well your model is predicting in real-time with datasets. They're a very useful tool here.
Now, Model Deployment. We have some really cool components of the profiler, which are helpful for agricultural yields in particular, and anyone who's looking to predict something in the future and there's various factors coming in at different times of the year, we can use the model simulator for forecasting possible scenarios in the future, which I'm going to show you here.
Let's say, here we have our model. Let's go down to the Profiler. Now, how can we use this in terms of agriculture? If you see here, we've got the three different sites. We have different variables that are coming in at different times in the year. Let's go to Simulator. We can change all of these to random. A number is... If we hold shift here and then keep holding Shift, click Shift, not Shift, Control. Here we go. Hold Control, click, go to Random.
You can see here we're going to now have the option to... Just make it a bit smaller. We're going to have the option to randomly generate runs of the model down below here. Number of runs, 10,000. We can change this to whatever number you see fit. Let's go 10,000. You have the option to add noise. Let's say we are in the month of March. Some of these variables here are before March. In March, we can assess, OK.
The accumulated global radiation in March was actually 60,000. We fix that value in, and all of the rest of the values, we can leave it random, and we can then run 10,000 iterations of the model using the mean and standard deviation. We're going to create more values around the mean and less at the extremes. But if we were to do that, we can then simulate it to table, make a huge dataset. We can also, which can be very useful, run potential scenarios Let's say we go to normal censored here. We want to exclude certain parts of the dataset. Here we want to exclude values below 20 and values above 30. Enter.
You can see here we've now censored distribution. We're removing any possibility of reaching these values on the extreme. We can do that even more so. We can chop the data off and go through scenarios of what may occur, simulate that. Now we've got all of our potential outcomes. Make table. From here, analyze, distribution. We can add in the yield. OK. Now we have an output of our model generated values with those 10,000 runs of what may occur.
Now you can see at the mean, 53, median, 52.96. The most likely scenario in this case with the values we entered was around 53. You can see the probability nicely displayed here. You can copy this column, paste it to wherever you need as an output, and start graphing different outcomes based on different inputs to the model, which is very useful in predicting future yields with different summer weather scenarios, and to make sure that all possible scenarios are included in the analysis. Let's go back to our Excel.
Just to summarize, JMP is an all-in-one platform that you can build predictive models in. For many different business cases, we've shown some agricultural insights here, I guess. But this whole process is very applicable to any use case where we're trying to predict something out in the future. It allows businesses often who have very vast datasets that are not utilizing to the full extent, they can be combined, explored, and we can uncover the hidden trends that are really in there to allow for data-driven decision-making in the future. But one thing that's really important is, it relies on having enough good quality data to produce a good model. I hope some of that was useful to you. If you have any questions, feel free to reach out. That was everything. Thank you.
Presenter
Skill level
- Beginner
- Intermediate
- Advanced