In Pursuit of the “Golden Curve:” A Comparison of FDA and PLS Analyses of Data Series (2021-EU-30MP-742)

5 Kudos

Level: Intermediate

Beatrice Blum, Senior Statistician, Procter & Gamble Service GmbH
Phil Bowtell, Principal Statistician, Data and Modeling Sciences

With sensors now being economically available, P&G massively expands its use of sensors to develop new and better test methods. Sensors deliver discrete measures over a continuum like time or location often resulting in smooth curves. However, the metrics that we extract from these sensor data are blunt summary statistics like averages, sums and integrals. Those are believed to represent different consumer-relevant product features, but we struggle to establish robust mathematical links. Using historic approaches, a lot of information about the product performance that we measure along the way are not leveraged. We propose to apply Functional Data Analysis (FDA), a mathematical approach to spline fit any type of curves, to extract discriminating curve characteristics representing product features. Using case studies from Baby Care, we show how to turn sensor data into meaningful information. In addition, we compare FDA with PLS in SIMCA to understand when to use each method.

We envision that matching these fits with consumer data will enable creation of a product portfolio landscape, empowering us to understand what optimal product performance, the so-called Golden Curve, looks like. Eventually, our goal is to design diapers, pads, razors and more against identified consumer-relevant Golden Curves by optimizing product composition.

Auto-generated transcript...

Speaker	Transcript
Beatrice Blum	Hello, and thanks for joining Phil's and my Discovery presentation today with a glimpse of my fabulous 2021 lockdown hair style.
	We will be talking about how we approached some sensor data, why the use of functional data analysis (FDA) and partial least squares
	(PLS) in our pursuit to catch the golden curve. My name is Beatrice Blum and I'm a statistician in the data and modeling sciences department of Procter and Gamble, supporting baby and fem care R&D in Germany. My co author is Phil Bowtell from the UK. Phil, do you want to introduce yourself?
Phil Bowtell	Thank you very much, Bea. Hello, my name is Phil. I'm based in the UK and like Bea, I'm a statistician as part of the data and modeling sciences group. And I support a variety of technical sciences in Europe, including baby care with Bea. Thank you.
Beatrice Blum	So what we want to cover today is a quick introduction to the data that we have collected and how we're trying to figure out the meaning of the different curve shapes with respect to our consumer responses, Yield 1 and Yield 2.
	We will pay particular attention to comparing two analysis approaches to these data (PLS and FDA) and try to understand when to use which.
	Note that we assume some knowledge of PCA, PLS and FDA for this talk, but what you really only need to know is the general concepts and data...how data is organized.
	But it's very likely that you will still be able to follow the course of this talk, even if you're not familiar with it.
	So you may be aware that Procter and Gamble is developing and manufacturing diapers. To improve these diapers and their product performance in the eye of the consumer,
	we try to capture and understand the important features of a diaper. Particular in some of our test methods, we apply fluid to the diaper in different locations and under different
	protocols or conditions and measure K data curves, as seen here on the left, and P data curves, as seen on the right.
	We assume that these K data curves are somewhat linked to our consumer response called Yield 1, and that these P data curves are related to our consumer response, Yield 2.
	So let's first look into the K data curves and analyze these or try to fit these with the help of Functional Data Explorer in JMP. With that, I'll switch to JMP.
	So here is my data table in JMP.
	It's a very limited data table in terms of columns. We have one column for the 10 products that we have been investigating, A to K.
	For each of those products we have run three replicates in our method. I combined the two columns into an ID column and it's consisting of the sample name and the replicate number.
	We collect the data, over time, continuous variable and our raw signal is called K raw.
	So let's have a look what the K raw is looking like.
	I just picked two products, in this case G and I, because their profiles seem to be quite different. What you can see, we have pieces of where the curve steps up jumps up and then it flattens down in a quite...quite smooth behavior.
	The stepping up is no big issue for PLS, which was created in terms of trying to model spectral data, while for an FDA (functional data analysis)
	that would expect smooth curves and also smooth derivatives, which are probably not given if the curve is just jumping up.
	These jump ups are related to a sauce(?) that we apply to the diaper. Can also see the three replicates, so the method seems to be nicely reproducible.
	Quite nice. However, we see a lot of noise in our raw data. It sinks again, oscillating down here and we assume that we will model a lot of noise and overfit
	just because we have so much oscillation over here. So what we found it indicated to smoothen the curves prior to fitting, and that's what we see down here. We have smoothened the curves by the use of moving average super sample(?) with a window size of 20.
	With that I'll go over and try to fit
	functional
	components to this. So I put my variables into the corresponding roles. Instead of the raw data, I use the smoothened K. I need my ID variable and my ID function. My X is the time over which we measure our K, and eventually what we want to achieve as linking these data to our Yield 1
	continuous variable, and try to understand how our predicted curves are related to this consumer response. So I put this into the supplementary role.
	Run it, get the original output from the FDE, and usually I just start by fitting B-splines because that's nice and easy and relatively fast.
	Can see that this is only taking a couple of seconds, despite us having quite a couple of thousand lines. So we get a result. It doesn't really look bad...that bad from afar, however let's drill in a little.
	As already mentioned, when talking about what functional data analysis was developed for, it is expecting smooth curves.
	And the B-splines do actually just stitch together in this particular case cubic pieces of splines
	and to get around corners like here, where there is no cubic...certainly no cubic curve but a real change in behavior and a turning point,
	it has to go around and try to somewhat capture that behavior, but you can see that it's doing a really poor job. It's also not doing a good job in trying to represent these plateaus that we observed at the top.
	So, despite being very fast, simple and, in most cases, a really good approach in this particular
	context, it's probably not the best to go for B-splines. Instead of that,
	I read a little bit in Help and did a bit of research and found Okay, we should use P-splines if we have profile data, if we have something like spectral data. That's what JMP recommends, so we went for the P-splines.
	And because I really saw that we have step changes here, the only way to attract...attract this is by using the step functions.
	And since the P-splines take much longer time to fit, I prefit those and we will just have a look at what the results look like.
	So, again we look at the actual and predicted lots here and see, oh they're doing a much better job in getting these turning...turning points and in
	achieving the step upwards. They also somewhat captured the plateaus round here, but it's already assumed previously yeah, they also still do
	capture quite a little bit of the noise. So this is not really smooth. It would be actually super nice to have maybe a B-spline fit this
	degradation type style of downwards hill. Maybe just step P-spline in this area where it's really needed, but at least this is a lot better than our B-spline fits. So let's see what's happening.
	So it did quite a good job in putting our different products on it, two dimensional score plot. We can clearly see how the replicates of different products group to each other.
	There are the A's; we have the H's and so on. Some of them are not so good, so the I's are a little bit further distributed and overlapping with others.
	However, we can see that the good reproducibility that we saw in the raw data seems to be playing out well here. We decided to go for four FPCs,
	as seen here. And we can see that they're quite nicely predicting our curves.
	But, eventually, our goal is obviously to see how our consumer response relate to different curves. So what JMP is doing here in the background, in the generalized regression, is fitting each of those four FPCs by the use of Yield 1.
	And with the results from that, we can now see how changes in Yield 1 changes the shape of the curves; could clearly see an upward strand.
	So
	seems relatively easy to capture what's going on, so this is a really bad product, so it's certainly much up, lot of plateaued up here,
	and this seems to be a lot more down. So we have found something where we think that may be close to a golden curve for our Yield 1.
	However, when we are looking at the data that we really collected from the consumers, and now not on a continuous K, but we just put them in order on a categorical scale.
	We have to see that these four products that came out almost identical from 0.31...
	0.031 to 0.034. If we look at the curve shapes and how they change on the left, we will see they're quite different.
	So it's not entirely in line; you could even say it's not at all in line with what we've seen in fitting the continuous Yield 1. So through the very one down here, and this would...they look so similar despite having quite a big difference already in consumer response.
	And again, this one is also not so different with respect to what we've seen from the continuous one.
	With that I return to my slide deck
	So back to my slide deck. Here we can see how we fit the data that we extracted from the FDA fits to our Yield 1.
	And you can actually see that this is a very, very good model. It's so good that we always had to doubt that this will hold true on the new data. We did the fit by use of auto validation and model averaging as promoted by Phil Ramsey and Tiffany Rao in the Discoveries America 2020.
	The R square with 97 and then R...Press R square cross validation R square of 90, it's just too good for us to believe it's true.
	With that let's look at what Phil found when looking at the same data with PLS.
Phil Bowtell	So, as you say, we have this R squared of 97% with the press R squared of 90. All looks very nice. Let's just see how partial least squares compares with this.
	So I've been looking at principal components analysis, partial least squares, which is a tool that we use when we have spectra or curves. It's commonly used because all our inputs are going to be highly correlated and
	traditional regression techniques don't deal with that so well. And the first thing we noted was that, when we looked at the score plot that Bea had in the previous slide
	on the demo, it looks almost exactly the same as you get in principal components analysis, so that's where we see some common links.
	When I run the partial least squares, what I see is I get an R squared of 73%,
	not quite as good as 97. And also, if you look at the observed against the predicted,
	we do actually see what looks to be an okay fit, but then obviously Product B is having a bit of an impact and undue influence.
	And in JMP and in SIMCA we've got cross validation measure Q squared, which is low at 33%. So this isn't really a good model.
	This was done on the raw smooth data that we had. There are other transformations you can try, but really we weren't able to build a good model. It's certainly nothing that competes with the FDA.
	However, one thing we do get from the model are coefficient estimates. We also get this quantity called VIP, and these in tandem give us an idea of which particular regions of the curves excite or tell us what's going on with the predictions. So if I just overlay
	here the VIPs and the coefficients on the raw data plot, the green highlights areas where this is really having a big impact on the predictions, what's contributing towards the model.
	The orange is medium, not so much. And the gray is low, and this is actually telling us that the first peak is really not having much of an impact whatsoever from prediction point of view.
	So moving on, I'm looking at another set of data. This is the p data curves, and here we have these curves that have been collected.
	Four conditions, maybe call them protocols or conditions, at three locations. We also have a fifth protocol or a fifth condition, but this is only taken at Location 1 and that's not plotted here.
	And what we have is Location 1 on the left, Location 3 on the right and Condition 1 on top going down to Condition 4 at the bottom.
	And one thing to note that these curves are quite similar. We do see some slight deviations.
	But one question that was asked is, well, do we need all of these curves? Are they all needed? Or maybe we take a subset and use those to help us understand the data. So what I've done is taken all the products and sequentially plotted them. So I've got Location 1,
	Condition 1, all the way up to Location 3, Condition 4 and plotted the data.
	And we can see straight away that there are some common trends; we can also see some differences. So we all we always...we see that there are three products here that seemed to lie away from the others.
	So we've got some product differentiation. If I look at the different conditions, I can see that, obviously, these products here are certainly changing as we move our change our conditions.
	As I look at location, it doesn't seem to be a huge impact. Let's see if we can look into this in a little bit more detail.
	So with any multivariate data, normally, the first thing you would do is just literally throw it into principal components analysis and see if anything comes out from that. So it's an exploratory data analysis tool.
	And if I look at the score plots, I've taken all the data you've just seen, put it into the package, it's come up principal components, color coded by product.
	And we can see straight away that products D, G and H seem to lie away from the rest of the products. We've got three products here, seven products over here.
	And when we talked to the people that develop these products and make these products, it makes perfect sense. So it's good that the data are actually highlighting something we would expect to see.
	I then highlight by the different locations, and I'm not really seeing a pattern here. I think you'd have to be quite adventurous to say there's something going on there.
	However, when I color by the different conditions, I do see some pattern emerging. And if I look at the three products that we have here, D, G and H,
	I can see, as you go from right to left, we're seeing a shift from Condition 1 to Condition 4, and likewise for the seven products here. Condition 5 is sat in the middle, and again, that's something we would expect because it's actually different measuring device.
	So, from an exploratory point of view, we can see these differences. Let's see if we could look this from a more statistical points of view. And for the example I'm going to look at, I'm just going to focus on Location 3 and looking at the four conditions within Location 3.
	To do this, I'm going to be using multiblock orthogonal component analysis, which is a bit of a mouthful, so it's just reduced to MOCA.
	And, I'm also going to be looking at hierarchical modeling but I'm not going to be doing that...discussing that too much in the context of this talk, I'm going to be focused...focusing on the MOCA and these are two techniques that we find in the stats package SIMCA.
	Now the idea here is that we look at blocks of data, and traditionally each block represents a different way of measuring some kind of chemical or some kind of product. So as an example we've got near infrared, infrared and raman spectroscopy.
	And the two things that we aim to do with MOCA and with the hierarchical modeling is first of all, assess redundancy.
	It might be that I just need near infrared and raman spectroscopy for the prediction and, in fact, if I know near infrared and I know the raman, I can actually predict the infrared or it's redundant.
	So that's what going to be looking at, but one thing to note is when I talk about redundancy it doesn't automatically mean I can throw that particular block out.
	Because on its own right, it may add to the prediction. So it's a balancing act between redundancy and predictability. As we've seen already, if I look at these charts here, it looks like there may be some redundancy.
	So what's actually going on? Well let's think about this in terms of overlap. I've got Location 3 and I've got my four conditions. And to express overlap I've got a wonderful venn diagram and we can really start trying to understand what kind of information we've got.
	The first point of information is globally joint information, which is where we have information that's common to all four conditions.
	Then I can have information that's only common to two or three of the conditions, locally joint information, as highlighted in the orange here.
	And finally whatever's left over, is the unique information, and this is what the different conditions bring to the party in their own right. So what might be expect to see here?
	Well, the left hand side would indicate a situation where the four conditions actually have quite large amounts of independent information.
	And we'll probably think, no there's not going to be any redundancy here, we have to keep all four.
	The image on the right, where we have a large amount of globally joint and locally joint information and not so much unique information may indicate a situation where we could have redundancy, and it might be we get away with looking at one or two of the conditions and not all four.
	In terms of what SIMCA does,
	it does some modeling and we've got our four conditions and it's fitted four components, two of which are looking at joint components and two of which are looking at unique components.
	Overall, we explain a lot of the variability of the data. We can see from the numbers at the top here explaining nearly 100% of the variability. That's a good thing.
	The green bar is looking at the global joint informaton and we can see, these are really quite large; we've got a high number. So
	this is just telling us there's a large amount of overlap between these four different conditions.
	If I look at the locally joint information in orange, there is some, just between Conditions 2, 3 and 4. It's not huge.
	And finally, we can look at the unique contributions that each of our unique information each of the conditions have, and that's quite small. So here we are pretty much certain there's going to be some redundancy.
	We can also investigate where the uniqueness comes from in terms of products. So the size of the bubble here indicates whether it's unique or not.
	Now, if we have no uniqueness, we get very small bubbles, so, for example, Products I and F are very small. We would expect, if we looked at these individually, just to have lots of green green bars here.
	If we look at Product H, we see a little bit more independence. It's telling us there's a little bit independent information enough for conditions.
	It's not a big bubble though, not at all, and in reality it looks like there's a lot of redundancy. So what was then done is we've taken all 13 possible conditions and locations, we've run the MOCA analysis and also the hierarchical model,
	and from this, it's telling us, ideally, we need the first condition at Locations 1 and 3, and the fifth condition at Location 1,
	which really sort of goes against what we've been saying. You know, we were saying earlier on, well,
	you know the conditions differ a bit and locations don't differ at all. Well, not quite, there's obviously something going on in terms of predictions,
	but we have been able to go from 13 different combinations down to four, with a fairly good model, R squared of 85.5
	and Q square to 69.3. The cross validation measure isn't bad; be nice if these two are closer, but at least we are modeling our data here, in this case Yield 2, better than we have been able to in the past, using this is P data.
	So with that, I shall pass back to Bea.
Beatrice Blum	Thank you, Phil, for this super interesting insights that you could provide. Now let's look what FDA does to the P data and what we can get from that.
	Similarly to what Phil showed us, we can see that location as a factor, if you may say, is not having as much of an impact as the different four conditions are. So they vary the results much more.
	And we can also see, that's quite interesting, when we look at the two best performing products over here with a 74 and 72,
	that they are curves, the corresponding prediction curves, they look quite different, even much more different than what we saw in the K curves, maybe even.
	So if we are going for the golden curve and try to figure out what is really the best performing profiles, this is giving us a hard time because it's quite difficult to find commonalities between these two curves.
	So question is really, are there several golden curves or may an average curve between the ones that are already performing good be what we have to go after?
	We can't answer that yet. We also can't answer if there are any redundancy or what location-condition combinations we really need to measure from this.
	However, again when I extract the summaries from these FDA and I tried to build a model and predict Yield 2, I again get a super good model with an R square of 98 and press R square of 97. Yes, we do know already that this is too good to be true.
	On the other side, we did try modeling the same Yield 2 with other extracted values from our curves as they were provided to us from the measurement department.
	But trying to model those extracted values and to predict Yield 2 did not work out at all. The R squares we could achieve were not even close to the ones we get here.
	So we do think something is going on and we made quite a huge step forward in trying to understand how we can model Yield 2.
	So let's wrap up what we found. We've seen that both models, or both methods rather, result in very similar principal components and that they agree in terms of what commonalities they extract from the K and the P curves.
	FDA might probably be a little easier to use, with little data prep. It allows us to predict curve shapes from Yield 1 and Yield 2 and that does give us some idea of what our golden curves may look like.
	It also indicates what measurement practical factor seem more important to keep.
	We are able to get really good models for our yields but do profoundly question these models.
	At the moment we couldn't extract which location-condition combinations are most relevant to keep. That's something we would like to follow up with JMP.
	PLS, on the other side, doesn't give us very useful information about which sections of the curves are most informative in terms of discriminating our products.
	MOCA and hierarchical PLS additionally pinpoint us to the measurement protocols that we do need to keep for capturing the most relevant information from our P curves. The PLS models to predict yield appear somewhat more reasonable in terms of goodness of fit metrics than the FDA models did.
	Our combined efforts helped us find clear patterns to differentiate our products. Both PLS and FDA enable us,
	extracting essential features from the traces and calculating good prediction models for our yields. We also learned that aspects of the curves and what protocols are most meaningful.
	We managed to model or predict our yields much better than we could in the past. That's a huge step forward.
	This is a very good example actually where too many cooks did not spoil the broth. Both methods agree to a certain point, but they also both do give a different additional information where they differ from each other. The work is ongoing; we will expand from here.
	Particularly,
	we will add new independent data to validate or improve the models. We still need to fine tune which protocols we need to keep to give us the most relevant and the least redundant information.
	That's something where we hope JMP will help, enabling that in the FDE platform. Our final goal is to understand what material composition of diapers
	will result in which curve shapes and how those curve shapes relate to consumer yield. In our pursuit of the golden curve, we made good progress and are excited to eventually fully capture it.
	With that, we thank you for your attention and are open to questions now.