Solved: Incorrect output for cross-correlation in time series platform?

SDF1 · Mar 3, 2021 10:20 AM

Hi JMP Community,

(W10, 64-bit, JMP Pro 15.2.1)

I'm hoping to find out either what I'm doing wrong, or what I misunderstand about how JMP performs the cross-correlation calculations in the Time Series platform. In short, when I compare the cross correlation of a Sin(x) with a Cos(x) function, I don't get an answer of pi/2 (or 90-degree) as I would expect, I get 79-degrees instead.

I am trying to put together a simplified explanation and description of what the Time Series > Cross-correlation output means for some colleagues. To do that, I have made a simple table with 361 rows (values 0-360, each row is a degree, or 0.01745 radians) and several other columns that calculate the Sin(x), Sin(x-5), Sin(x-45), Cos(x), and a random uniform distribution (see attached table Periodic_series.jmp). The purpose is to demonstrate cross-correlation using simple trig functions that people can understand more directly than a complex time dependent function, like the SeriesJ data set.

If you run the first script in the data table, you get a graph of the trig functions (random uniform is left out on purpose). Everything looks as it should.

You can then run the time-series script, where Sin(x) is the Y input, and the other functions are the Input List columns so that you can do the cross-correlation analysis. Turning the cross correlation output into a data table (see cross-correlation.jmp data table), you can plot the correlation values vs. lag (run first script in this data table).

At first glance, this looks as it should: The cross-correlation of Sin(x) with Sin(x-5) has it's peak near -5, Sin(x) with Sin(x-45) looks to be near -45, and for Sin(x) with Cos(x) is near 90. However, if you use the crosshairs, you'll see that the cross-correlation between Sin(x) and Cos(x) is closer to 80.

In fact, if you run the script: Get lags in the data table, it prints out the lags in the Log window:

"Corr vs. Sin(x-5) lag in deg = [-5]"
"Corr vs. Sin(x-45) lag in deg = [-40]"
"Corr vs. Cos(x) lag in deg = [79]"
"Corr vs. Rondom Uniform lag in deg = [18]"

The only one that is correctly calculated is the lag of -5. Lagging Sin(x) by -45-degrees results in a lag of -40, not -45 as expected. Similarly, with Sin(x) and Cos(x), the lag is 79, when I would expect the value to be 90. I've tried this multiple different ways and with many more rows to see if it's a "resolution" issue, but I keep getting a lag of around 80.

Either the calculation is not returning the right lag, or I'm missing something. I'm guessing I'm missing something, but I'm not sure what it is. I've followed some suggestions from other posts and tried reading up on it in the JMP help here, but I can't find any details of the calculation for the cross-correlation that would explain why a trig function of Sin(x) doesn't have a lag of 90 with Cos(x).

Any thoughts/feedback on this is much appreciated.

Thanks!,

DS

peng_liu · Mar 3, 2021 04:00 PM

There are problems in the second set of data. This is what Sin(x) in the second data looks like:

This is what my 3601 rows look like:

That explains why your SinX and CosX CCF does not change.

The sample size that matters here is is not the number of observations here, but the number of cycles. In the original data, there is just one cycle. To calculate CCF, one needs to shift series, which will create missing values in one series, which will cause sample means swing wildly. If there are many cycles, the mean will stay close to zero, the true value.

If in a real situation, one just has one cycle of data, I believe that CCF will be misleading, but that is a data problem, and there is no way to fix.

View solution in original post

peng_liu · Mar 3, 2021 12:19 PM

I believe it is due to finite sample size. For consecutive points on CCF, sample sizes are different by 1. I added 3600 more rows, did the plot again, and the peak is closer to 90, also the amplitude is also closer to 1 at the peak.

SDF1 · Mar 3, 2021 02:58 PM

Hi @peng_liu ,

Thanks for the feedback. I had tried more rows, but didn't try that many. Although I understand how the finite sample size might influence the calculation, I'm a little surprised that it impacts it so much. I can imagine a small difference, maybe 88 or 89 with a smaller finite set, but to be 12 off seems pretty large. I guess my concern is what about real world situations where you might have a limited time-series data set.

By the way, I also noticed a mistake in my Deg column that didn't correctly account for adding more rows. The formula should be:

This way, the degree data will always run from -180 to 180 (well, with the small difference on one end or the other because of the increment size).

For example of a real-world limited data set, maybe you have a device that collects data every second for only a fifteen minute period. That's only 900 data points. According to that situation, one would conclude the two signals have a lag of 78, when in fact it's 90. Concluding the signal lag is 78 would be incorrect and possibly lead someone to start making a wrong analysis of the data. No matter what

In this particular case, we know what the real lag is a-priori, but in real situations, that's not always known. I can imagine data sets, where the same event is measured at different locations and the lag in signal needs to be determined, but the data set is rather limited. Based on how the cross-correlation function seems to work, this might result in an incorrect conclusion.

Is there a way to compensate for this or for the calculation to account for the limited data set? Could you have some other functional column that you swap out and do a simulation of the lag amount in order to get some confidence interval on the lag? What kind of functional column would that be?

If I re-do the evaluation with the correct formula to allow for an arbitrary number of rows, I still don't get the right answer -- even when converting back from a lag value to a degree value. For example, with 3601 rows (values from 0 to 3600), I get the lag plots like this:

And when I run the Get lags script -- but now looking at the column where lag is converted back into degrees (formula below based on the 3601 rows in the original table), I get:

"Corr vs. Sin(x-5) lag in deg = [-4.89863926687031]"
"Corr vs. Sin(x-5) lag in lag = [-49]"
"Corr vs. Sin(x-45) lag in deg = [-40.2888086642599]"
"Corr vs. Sin(x-45) lag in lag = [-403]"
"Corr vs. Cos(x) lag in deg = [78.4782004998612]"
"Corr vs. Cos(x) lag in lag = [785]"

Below is a plot of Sin(x) with Cos(x) that is lagged by 900 and 785 rows. As can be seen, 785 is not at the correct lag value, which occurs on row 900 (red curve).

So, I am still not getting the right values. New data tables are attached. I still have a hunch that I'm not doing something right -- maybe a wrong formula somewhere? I can't figure out why I'm not getting the right values.

Thanks for the help!,

DS

peng_liu · Mar 3, 2021 04:00 PM

There are problems in the second set of data. This is what Sin(x) in the second data looks like:

This is what my 3601 rows look like:

That explains why your SinX and CosX CCF does not change.

The sample size that matters here is is not the number of observations here, but the number of cycles. In the original data, there is just one cycle. To calculate CCF, one needs to shift series, which will create missing values in one series, which will cause sample means swing wildly. If there are many cycles, the mean will stay close to zero, the true value.

If in a real situation, one just has one cycle of data, I believe that CCF will be misleading, but that is a data problem, and there is no way to fix.

SDF1 · Mar 3, 2021 06:54 PM

Hi @peng_liu ,

Thanks for your further explanations. I understand now what the issue was and it makes sense. I can confirm that I get the right values if I include more cycles within the time "window".

Now for the real conundrum:

My production colleagues want to find the time lag between data from their production historian system (it records things like oven temp, belt speeds, flow rates etc.) and QC measurements on the material. Production data can be exported from the system at any given interval, 15 min, 30 min, 4 hours, etc. We have a fairly regular QC testing schedule of every 4 hours. It is believed that the production historian data can be correlated with QC data if the correct time lag can be found. For example, the dryer temp in production is believed to correlate with a QC test, but the QC test is done maybe 120 min after the material has already gone through the dryer. We know from lab-scale studies that certain processing steps (values like oven temp) correlate with certain testing outcomes, but finding this in production data is proving more challenging than in the lab.

I have tried using the cross-correlation in the time series to see if it can find any kind of lag in the data as well as manually creating multiple lag columns and using the multivariate platform to look for correlations between the lagged columns and QC data, but I do not see anything. Naturally, it could be that any correlation gets washed out from noise in the process, but the other option is that neither the time series nor multivariate platforms are the right ones for trying to determine the correct time lag.

Thank you VERY much for the feedback and thoughts/input, it has been very helpful.

Thanks!,

DS

peng_liu · Mar 3, 2021 08:24 PM

Some thoughts...

It is possible there is no exact lag, and your data is noisy. E.g. the effective lag is random within a range. Try to smoothing your series using simple moving average before calculating CCF. You can use Simple Moving Average in Time Series platform, or "Moving Average..." in column formula.

It is possible the correlated signals are skipped due to your sampling scheme. Then there is no much in the data for you to find the lag.

SDF1 · Mar 4, 2021 11:23 AM

Hi @peng_liu ,

Thank you for the additional suggestions. I will try the smoother to see if this can help find a reasonable time lag. Hopefully I can find something useful.

My concern is that either the data is noisy in time and the correlations are lost, or as you mentioned, the sampling/testing scheme erases any correlation.

I really appreciate your feedback and input.

Thanks!,

Diedrich

Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?

Re: Incorrect output for cross-correlation in time series platform?