Speaker

Transcript

Stan Siranovich 
Okay, so starting off with the observation, we have the observation in the first column. And right away, we see we have a little bit of a weird format. 

With that we'll address that later. Next column is YKappa, and that is what's known as target variable, that is what would like to predict. That is the measured value and we'd like to keep that within a narrow range. 

And in another minute or two, I will explain why. And as we scroll through the data, we see a couple of things here. We've got some oddball labels. We have the TupperExt, whatever that is. We have a Tlower, we have a WhiteFlow. And we noticed that 

some of the columns, as a matter of fact, most of the calls have a number at the end of them. And it turns...after some investigation, we find out that that is the lag time. 

And I'll get to that when I explain the process. So we have things like chip moisture, steam heat, and we also see that we have some missing values. There...there's a few here and there, but mainly it looks like they're confined to the AAWhite, 

excuse me, at the very end, the sulphidity. So we make note of that. 

Now 

since we're not familiar with any of this sort of thing, probably the smart thing for us to do would...would be to do some research and see exactly what it is that we're looking at. And we'd also like to know (let me get the PowerPoint up here) 

exactly 

All right. 

Let's begin. Well, the kamyr digester is a continuous fourstep process for the manufacturer of pulp and paper. 

Now these 

process...these processes are quite large and you can think of it as a giant pressure cooker. What we do is check some heat, steam, and pressure, 

excuse me, into the reactor. And these can be 18 feet in diameter and up to 200 feet tall. And what we wanted do is separate the lignin while controlling the Kappa number. 

And the feedstock is a little bit unusual. I came out of the chemical industry and I'm used to working with 

a liquid feedstock but here our feed stock is wood chips or sawdust from a 

sawmill whatever. And we don't have any information as to whether these feedstocks are dumped in together, they go in one on top of one another, whether they're 

campaign, but right away, we we know we have some variability here. No matter what the physical form is, the feedstock is composed 50% water, 25% pulp, and 25% lignin. And, of course, the lignin is what we want to dissolve out of there. 

Now the process looks like this. It is complex and within this single unit are four separate unit operations. We have the top, where the chips are dumped in and the reaction mixture, which is...which is referred to as a liquor. 

It's got some caustic...some alkalinity to it, but you can think of it as like a soap solution. It's going to separate the lignin from the wood. 

And we also have the fluid flows, both effluent and also have liquors, and there's three of them. 

And it can be either concurrent or countercurrent, so there's another variable we introduce. As I mentioned before, the target variable is a Kappa number, and that is an ISO determination, and we've got another variable here, in that the fluids are recirculated and regenerated. 

So let's start off with some data exploration. 

And here's what the JMP data table looks like immediately after I brought it into JMP. Let's start off with the observation, YKappa, 

which we want to measure and then the chip rate. Here are some other ratios. And we can see that we have some missing data, most of it seems to be confined to that column there, and this column over here, the sulphidity. 

So let's take a look at those to see if it's anything that we should be concerned about. And the best way to do that, I think, is to go up here to tables and look at the missing data pattern. 

And what I'll do is click drag and select all those. And I'll add those in. 

And we get this table. 

Now, you should see the missing data table 

right on top of my JMP table. 

And let me expand that first column a little bit. Now if we looked at the first column here that says count 

and we have number of columns missing is zero, and we see zeroes all the way across here in the next column, that's our pattern and that tells us we have 131 columns with no missing data but we've got 132 columns 

in which two columns have some missing data. An where you see the 1 and the 1, that is where the data is missing. And we have another pattern here with 19, but the rest of them are fairly filled... 

fairly fully populated, and if you've ever dealt with the production data or quality monitoring data, you see that this is not unusual. As a matter of fact, this this is 

rather complete for this type of data. So let's look at that a little more. We have 131 with one missing, and we know we have those two columns. 

Let's see if they they match up. That's it. It is the same record with with same two columns with missing data. And what we can do is select that first row 

and we can look across here. 

Let me pull this out of the way. And we can scroll. 

And sure enough, 

we see the blue highlights there, 

which highlight the non missing data, and we see that the missing data tends to line up. There a dot here, dot here, so it looks like all these missing data in those two columns are in the same row. 

So let me go back 

and stop the share. 

And we're now going to do some data exploration, and I will close that window. 

And we are back here to the original JMP data table. 

And let's see, where to start? Well, we've got a couple of choices here, but one thing we can do is click on this icon right here, and it says show header graphs. 

So we click on the icon and we can see some header graphs here. And what that does is show us some histograms of the distribution 

in column underneath. And these these these are dragable, so we can make them larger in either direction. And why don't we do a little bit of exploration here. Since since we want to control the YKappa, 

let's let's select some of those high values and see what we can see. Now scroll on, scroll on, it looks like it's 

somewhat related to what's going on in the BlowFlow that... 

that is at the top of the range here, and if we go to TupperExt2, 

we can see that it tends to be in the bottom half of the distribution. And other than that, nothing really sticks out for us, so let's click right there 

and there, to clean that up a little bit. 

And let's do a little bit more explanation. Let's go to graph builder. 

So you go to graph... 

graph builder and this window pops up. 

And what I'm going to do is look at the YKappa 

There we go, and I can put...I can put a smoother line in there. 

And what I see is 

some variability in our process, but it's rather stable, maybe a slight upward trend here, but stable and that's going to be important a little bit later. 

But if I look down here, where it says observation, what do I see? I see it starts with 100 and then 01 and time keeps going up. But if I look at the data table here, it starts with 31. So what they apparently did here 

on the unit was start with the date of the month, and then the hour and they just continued with that format, but that's not what we want. So let me close this. I'll stop the share and bring up another JMP data table. 

Okay, so what we're looking at is basically the same data, and I cleaned it up a little bit, and I made some changes. And as we look at that, I have it arranged like I usually want to arrange it with the with the YKappa, which is our 

...our target value, followed by the observation. And here I added an hour sequence, and let me show you what happens here. 

And I've saved all the scripts to this data table, same data, it's just I opened up a different one, which has been cleaned up. And you notice that we have the same number of decimal places, all the way across all our tables here, if I scroll over. 

And let's let's repeat 

what I just did before. So we're going to go to graph, graph builder. 

Put the YKappa here and now I added Y sequence to 

our data table, and now we get a different pattern. If I open up the original, 

we see that we have a different smoother line, and what I try to do is look for some constellations, what I call constellation. So here we have these three data points. 

I'll close this and open up the second one. 

And the three data points are here. 

Let me shut that. 

And notice down here with the hour sequence, 

we have the data points in order that they are, and another thing we could do with JMP 

is we can go up here, 

switch 

observation, which we had before, and now move that down here. 

And we'll put our smoother line in again, you can see, we have the same pattern, but now if I take the hour sequence 

and drag it down here, we have the best of both worlds. They're in order, which is how we wanted them. 

And we have, if you notice at the bottom here, it's observation ordered by hour sequence descending, which is the default. And I could click done, but for right now, I'll just minimize that. 

And we'll move on to the next step. Check my notes, make sure I didn't leave anything out. 

Oh, and how I put the new column in there, let me stop the share here, and rather than going through the entire process, let me switch back to the PowerPoint. 

And I gave the column a new name, Hour Sequence, and for data type, I chose numeric, because that's what it is. For 

modeling type, I chose continuous because it is continuous data, even though it's taken a regular hourly intervals. And I chose format here because it's easy for human being to read it. 

And for initializing data for the column, I chose sequence data and ran it for 1 to 301, because that is our last row. And let's see, repeat each value one time, number of columns to add, just one. 

And for column properties, I chose time frequency from the drop down, and also chose hourly. So that is how I got that column. 

So let me stop the share on that 

and minimize it and I will bring up to JMP data table again. 

And here we are back in the data table. Now we start with the fun part, move on to the analysis. 

Let's go 

to analyze. 

And I got the wrong data table. Let me open up the other one. Here we go. 

Go to analyze, 

multivariate methods 

multivariate. 

And we're presented with this screen. Well, we want to look at YKappa so put the Y in there. 

We don't want any weights and frequencies or by and let's 

click drag. 

Put all those in there. Click OK, and here is the result. And to start off with...let me close that... 

and we're presented with the scatter plot matrix. And here we can look for some patterns in the data. We don't get to see a whole lot of them, but if we look at YKappa over here in the lefthandmost column, 

start looking over, looks like we may have some sort of pattern here with... 

with whatever this column is. Scroll down a little bit and looks like it is 

one of the liquor flows, etc., etc. So we can do that to get a rough idea. Let me close that. 

And we can look at the correlations here, and the blue ones are positive correlations, and red ones are negative, and none of them are real high. 

And some ways that's good and, in some ways that's bad, but we won't talk about that a whole lot for right now. But we just look at that. In some situations, we get more information from that than for other information, so let me close that. 

Next let's go to model screening, so we go to analyze, 

screening, 

predictor screening. And we're presented with this window and once again, 

YKappa is our Y response. And here we're presented with the next window. So for right now, we'll ignore the observations, because we have hour sequence here, and it's a whole lot easier to read, so we'll put that in here and click OK. 

And this is what we get. 

So we have our predictors here, and of course, the ones with the higher values and the bigger bars are going to probably be the most important predictors ranked for us. 

And we see a bunch of variables here. Now we come to a decision point. Well, these first two, they are obviously 

something we want to look at, probably the third one too, and the fourth. Now the rest of these, they're somewhat minor, but they're all about the same, so let's do this. 

We'll click drag, oh and before I go on to these two columns, contribution is a contribution to the model in the portion, which is the column that I usually look at is percentage portion that this 

predictor here predicted to the model, so if you added all these up, it'd come up to 100%. And we have a... we have a link here, it says copy selected. 

So let's do that. 

And I'll minimize that window and go back up here to our analyze window. 

And let's see. How about if I fit models? Fit linear regression models. 

And I'll click on that, and notice that the selected columns are already highlighted for us. 

So let's add them and, of course, we want the YKappa again. 

And we'll select that, and when I did that, it opened up 

an extension here of our window and a couple more personalities...a couple more choices here with the drop downs. And the first one's personality, so let me click on that and it gives us a number of choices here. But for right now let's just stick with standard least squares. And then for emphasis, 

we have three choices. And what the emphasis is what is revealed to us in the initial report window. For right now, I'll just select 

minimal report. And let's see, degreee, 2, okay, so it'll look at squared terms. 

Let's see...and I guess we're ready to go, so I'll click run. 

And this is what I get. And looks like this BFCMratio is of primary importance, 

followed by some 

other ones here of less importance, but they're...but they are significant, and the reason they are significant is, number one, the p value, by the way, this is a .05 level and the blue... 

excuse me, the .01 level, and the blue line here, which is .01 level for the significance. And the reason...the reason we 

we're doing it like this with the logworth...oh, by the way, I should mention that the logworth is 10 times minus the log and that's log base 10. So if we have the blue line equal to the value of 2, the log... 

log of the p value of .02...or excuse me, .01 would be a 2, so that's why the line is in there. 

Here, you don't really need it, but if you're doing an analysis as one of variables, it definitely comes in handy. And we look down here to summaries 

of fit, find an R squared .5, and that's...that's about the most of what we can get out of that window right there. 

And we come down here to estimate, and we can see the the estimates are coefficients for our multiple linear regressions. 

So we see that these two may not be needed, so, in the interest of parsimony, we'll... 

we'll select those, come over here and remove it. And now we have two less variables. We could probably remove chip level, but for right now, we'll just leave it there, because our R squared drop down a little bit. 

And everything else remained pretty much the same. So that is our model here. We can come down here and open some other windows. 

And we see T ratios of the individual terms, some effect tests, and that is where we will leave it for right now. So let me minimize that window, and we have one more analysis to do and that is time series. So we come up here to analyze again. 

And we'll come down here to specialized modeling, and we'll go to time series. Click on that and everything is still selected. 

We'll select that, 

we'll put Y in there. 

It will do the same thing. 

One too many, hang on. 

Put our target list here and we have this button here, X time, we'll put the hour sequence in again because it's easier read. 

Put that in there. 

Click the run. 

And for right now, we'll ignore all that sort of thing, and we'll come down here and we'll look at our time series diagnostic. Now scroll down a little bit more. And the blue line here is the 

P value, and this is at a .05 level, and this is looking at different autocorrelations and the partial lag times. 

So I could spend the whole 45 minutes talking about this, but we don't have all that time, so I'll just give a brief summary. What we want to do is to bring all of these lag times 

within that blue line. And it looks like, well, for this one, we have to go up to around maybe five or six, maybe even seven, somewhere around there. 

And hit that point around three or four or five with the partial lag time. 

So 

what we do is come up here to the red triangle, 

and we want to do an ARIMA model. And we get this window and ARIMA stands for auto regressive integrated moving average. Now we don't want to do differencing here 

because we've already done some differencing. Those numbers which were appended to the end of the columns, they were the lag times, so that all the data lined up. So what we want to do is just make a guess with the number here. 

Let me make that a 4, so lag 4 periods. 

And we'll do the same down here for moving average. 

And we could choose different values for those and let's see what the intercept, constrain fit. 

We'll hit estimate and we come down here again. 

And this is what we get. We're doing a little bit better than we did before. 

Let's come down here. 

And we'll select this again. In the interest of time I'll put in six periods. 

Check everything. Hit the estimate again. 

And I'll scroll down, and now that looks a little bit better. We probably want one to be within four or five and six lag periods here, so there's one more thing to check. 

If we look at the graphs... 

And we could save the scripts to the table. Nick, can we pause the recording for a second. 

Okay, so I did that a couple of times. 

And here I have summary here in the model comparison. 

So we have the first model of the ARMA. Again we didn't do the difference so it's called ARMA not ARIMA. And here's five period, and let me expand this 

just a little bit more, and this is the table. I worked on the table a little bit, I took out...took out some columns that I didn't want to discuss for reasons of time. And 

what we can do is look at the results. So it shows us AIC and lower AIC is better, and it looks like the best model here is the 

first one we did the (5,5). We have 

1268 and then it goes up to 1270, 1271, 1271, and then we have the R squared and here, of course, larger is better, and it looks like the (6,6) lag was the best one there. They have a slightly larger R, but certainly not significant and 

here we look at the MAPE, which is that mean average percentage error, and that, of course, we want to minimize. And looks like the middle model there, the second one, with the six period legs is the best one there and that gets a 7.55. 

And then we have some more data down here. We have the estimates and you can look at various and sundry graphics down there. And all these have been saved to the data table. Let me clean this up a little bit. 

And so what we have here, in summary, is the daytoday operations of a very complicated plant. We were able to explain and understand. 

We started off with the visual exploration of the data and then we went on to 

some research to find out what we needed to find out about the process so we knew what was going on with our analysis, and we were able to save everything our data table. 

And one one quick thing was that we explored through our data, we were able to move easily from one platform to another for our analysis and that concludes the presentation. 