everybody, to this presentation about a use case of curve alignment. Experienced analysts often say that in a larger analytical project, plus minus 60 % of the total time goes into the preparation of data. If curves play a role and especially the alignment of curve is needed, then that is certainly close to the truth.
Curves are very specific types of data, and JMP has some tools to work with curves and to address all the related problems. In the sample library, there is the Algae Mitscherlich data, which is one of my favorite data sets with that respect because it has the option to deal with many aspects of fitting curves.
This is just an example of the development of Algae density in different treatments. The type of curves that I'm going to talk about are typically observations or measurements over time. But this doesn't mean any loss in generality. The presented concepts work in all kinds of curve relationships.
This is an example, A lgae measurement over time, and one of the aspects that is in the focus of the analysis for this data set is to specify curves, specific types of curves that have a known shape and are driven by certain parameters and then to estimate those parameters based on the data.
In those cases, the parameters very often have a technical meaning like slope, inflection point, limit that gets approached. That platform here also has the sliders that let you analyze how changing one of those parameters affects the shape of the curve.
In the specific case that we are going to talk about, we are not specifically interested in the curve. The curve itself is only a help because we are facing another problem.
This is the data set, or this is a part of the data set that goes back to the real problem. We had this series of measurements, one and another series of measurements, and they belong to two different devices. U nfortunately, the clocks of these devices were not in sync. But luckily, each of the devices had one sensor that measured the same substance. W e could look for times where the measurements were very close to each other. Then try to find out how to correct one of the clocks, so to say, so that we get aligned measurements, and then use those to evaluate the data from all the centers that have been available in that data set.
What was the problem of the task? You see the curves here. The red curve is the one that we took as the reference curve, and the blue one is the one that we wanted to shift. You see not only that the curves are quite some distance away, although they should theoretically have measured the same substance at the same time, but also the time points of each series is completely, or the time point of both series are completely unrelated. With the bare eye, we don't see any lag that we could use to correct one of the data sets.
Therefore, I looked into… I compared the time points, not the Y measurements, the time points of the two series. If there was just a shift, then we would expect to see all the data, all the points exactly on one line. But here you see there are ups and downs, so this is obviously not the case.
Perhaps we can see more, we can understand more if we calculate row by row in the data table, we calculate the difference of the two times and look at those. Here with some fantasy, we see a little bit of curvature, so till the end, it seems to be closer related than to the beginning. But in the beginning, this looks like real random data. This as well does not help us a lot to figure out how to relate the data.
I thought I had the link to the data table, but we can look in this screenshot as well. This is the data set that you have seen before, a little bit annotated. We see that two lines have specific markers, the star and the circle. This is due to the fact that the whole measurement project had a ramp- up phase, and at the star point, the measurement series, the measurement time, the real process time started.
The circle is there where we, after visual or manual inspection, saw the starting point in the second time series, and we want to align both. W e need to change the relationship of the rows, of the data and the rows. We want to shift one of the data sets, and that reminded me of the paternoster that I like to use when many years ago I was working for a company that had a very old administrative building and we had the paternoster in there.
It came to me that the strategy that we are following is part of paternoster shift, which gives the word elevator pitch a completely new meaning, by the way. How do we find the right steps, the right place to fit? We do not have similar or identical time points in both series of times. We need to construct those time points somehow. Of course, this is done through interpolation.
T he first thing that comes into mind is linear interpolation, and if I zoom in into only a part of the data set, then it is evident that linear interpolation, so just checking, so to say, the regression between neighboring time points has some problems, especially if we look into areas where we have horizontal lines, which may easily happen. Then the time point in that range is quite arbitrary. It always leads to the same results. The opposite is true. If we are in an area with a very steep ascent, then a little change on the X- axis or the time may lead to significant changes in the Y value. T his is not a very good technique. We can use splines to interpolate between the values.
You know splines, certainly from the graph builder. If you make a scatter plot, then by default, the smoother is switched on and the smoother provides splines. You can even change the stiffness or the degree of fit or the closeness to the data with a slider in the graph builder.
The advantage of using a spline as an interpolation tool, also takes into consideration points further away. T hey build a smooth curve. That is why in the graph builder, it's called the smoother. This makes it easier to use those for interpolation and to use these as well as a base in our alignment process.
We need to fit splines. How can we do this or which platforms do help? First of all, simple tool is Fit Y by X comes into mind very fast when you work with JMP and do that data analysis. This is the data, one of the curves. There is the spline fit, and here is the slider that let me choose how close or how close I want to fit my data. Very good, very easy to use, and you can save the spline but only the values, not the formulas. We are keen on getting the formula for the spline.
Next stop, Fit Model. If you have a continuous variable, you select it, you can give this the attribute of being a knotted spline effect. When you do so, you are prompted to say how many knots that spline should have, the more, the more flexible. I accept the default, say run. We get the typical report from Fit Model. A lso, we have fit models, functions of saving formulas. We can use Fit Model, save the formula.
Little disadvantage here is I need to specify the number of knots before I start the analysis. Once the analysis is done within the platform, I don't have the option to play with it or change it like it is, for example, in Fit Y by X.
Another tool is the Functional Data Explorer . T he Functional Data Explorer has splines as a core function, and it is also functionality to find optimal definitions, optimal fits for the splines. You can export everything. It's a bit because simple tasks like this is not where the Functional Data Explorer is made for. You need some more clicks to come to a result. A lso, it's only available for people who have JMP role.
Remains, the Graph Builder. You have seen it before, and this time I want to show the spline control as well. As I said, we can use the slider to determine the fit. A very nice feature, by the way, is that you can check this box, then through a bootstrap sampling method, the confidence interval for the smoother is calculated or estimated. You see how that changes when I'm… Now you can see better, I have a lot of data and there is not too much variability.
H ere the confidence band is quite small. But if we zoom into one of these areas here, for example, that place and look at what happens when I change this, then we see that the smoother can even… That the line of the smoother can even walk out of its own confidence band. This is another visual help to find out a good fit for the smoother, for the spline. It should stay within its own confidence limit. Then comes the very important option here. We can save the formula. Then we have a formula for this spline.
The graph builder surprises as a modeling tool. Who had expected this? How does that help? This is again part of my data table, small part. You see that now I have two columns here where I saved the formulas for the smoother too. Down here in the colored rows, I put some arbitrary time points in. That leads to an interpolated response relative to the time point that I have given. It only works for interpolation. We cannot extrapolate this way, it's only with interpolation.
But this way I can, for example, manually add different time points. I have this one here plus X seconds in that case , and then I can see what is the difference of the interpolated value. Now I can put reference times in and I see exactly what is the expected value, plus minus a little bit for both measurements.
I did this for two different phases. I can go here and experiment anymore. In my journal, you see in the yellow rows, I added eight seconds. In the orange ones, 10 seconds. Depending on what you want to do, this is the principle of how you can work with this. If your task is a one- off task, this is good enough. You can go in here, play with the data, see the difference.
Our task was more regular. The good thing is everything can be controlled with JSL. As usual, for many commands that you do manually on line, you have corresponding JSL statement, and I just listed some. This is not a working program. First of all, you need to set up the graph, clear. Then you have commands that you can send to the graph and specifically to the smoother element in your graph. We can change the smoother so we could even interactively try to determine good fits.
We can also give the command to save the formula in the data table. That is the command that plays an important role for our solution here. You can read out the current settings of the Lambda slider and something small.
How did we want to use this? The concept here was, of course, first you need to determine what is the reference curve and what is the objective curve, the one that needs to be shifted. Then you calculate the spline function for the reference curve and determine the direction of shift. Where are we? Do we need to shift our time up or down?
Then we move the Y values of the objective curve, one row in the desired direction and calculate the spline function for this new curve. Save that, use a reference value, and then calculate the differences in Y for each row. Then we take the total sum of those differences as a criterion when to break the process. Because after every step, we calculate that difference, we save the difference, we do the next step, and we check, was there an improvement?
If yes, we move up or down one row more, and then we repeat that whole activity until there is no improvement anymore. The whole program in the real project, of course, runs behind the scenes. You wouldn't see anything. But I added did some graphs to make it visual to demonstrate how that works step by step.
The starting situation is this one. On the left- hand graph, you see the dashed line and the solid line. The dashed line is the reference line. The solid line needs to be moved. On the right side, you see the differences per row.
In the beginning, the differences are… You see that here in the starting area, the differences here are pretty small. Then they get larger and larger, and they are negative. That is why it goes down here on a negative scale, very small differences in the beginning, and then they get up larger.
This is the starting situation. You will see this picture again. Then the program will start shifting the reference curve one cell up in our situation, our case. Then you see how these graphs update for every step. Yes, first we need to tell JMP what are the time and measurement values for the reference and the objective curve. Here we go. It will take a little bit in the beginning, then afterwards the steps come faster.
You see how for every step, the blue curve approaches the dotted curve, and how the differences decrease. The last step did not improve the situation anymore, therefore, the program stepped one step back.
Now, we have the data table in a situation where we shifted up the objective curve. Now we can use this shift for all the other measurements, for all the other sensor results that we had for this device and start the analysis.
That was it. I hope I could inspire you a little bit. It was an interesting presentation. If you have any questions, please don't hesitate to contact me. My email was on top of the presentation, bernd.heinen@stabero.c om. Thank you very much.