I have a question that i only have a vague idea of where to start with it.
I have a data set in the standard 2D flat table form with 3 potential outputs and many many inputs.
Any one output consist of a single continuous number [Eg weight gain AND materials consumed] that is collected at end of a variable time period (30-50 days).
At present this isnt a problem as all the input variables (such as geographic location, supervisor) do not change during this time period and thus one row of data represents one complete time period
the problem i am about to run into is that i am due to receive data that does change continually during this time period. (eg temeprature from a logger)
There is no way to sensibly handle this in the usual way by taking a few constructed variables (for example Min, max, average temperature for the complete time period) and making new variables for each as this will not likely capture what makes the difference to the output.
Another suggestion is to break this down into a more granular form and making a column for each would also be nonviable i feel
(creating a min, max for each day for every day )
Even if you could avoid every column being highly correlated to every other i strongly suspect that daily maximums and minimums on their own are not what makes the difference. (Max of 50C for 1 second is less impactful than a max of 49C for one hour).
To me this needs to be handled in a different method,
I have been pointed to the 'Mixed models' platform in JMP but have no experience with it.
To me i am dealing with a repeated measure in time with heavy correlation throughput the measure
From what i gather (i may be wrong) i THINK this method needs my output variable to also be measured over this time.
This is not the case for me. I get all my outputs at the end, and only at the end.
What i am looking to achieve is finding out what in this logged data makes a difference to my output.
Can anyone help with a what i can do with this problem?
I think that a useful approach could be to manipulate the data into a "wide" format. So that now the temperature at each time point is a new variable.
t T Y
0 23 50
1 25 50
100 24 50
0 21 51
1 22 51
100 25 51
Y 0 1 ... 100
50 23 25 ... 24
51 21 22 ... 25
You can then use methods that are good for wide, correlated data: PCA, clustering, PLS, Gen Reg. In JMP and JMP Pro there are methods that can deal with really very wide data.
This is relatively simple if the number of time points is the same for each "run". However, it sounds like the total time period will be different each time. It might be 30 days or it might be 50 days.
So you then need to decide how to cope with this problem. And I think that depends on your domain knowledge.
Is it best to just look over the time period that is complete for all runs? That is, discard data > 30 days.
Or should you normalise over time? That is, the time periods are not absolute but are a fraction of the total time and so can be aligned across different runs with different total time periods. In which case you would maybe also include a variable for total time.
Thanks for your response!
In terms of the challenges
CORRECTING FOR VARIABLE TIME SPANS
We can usually correct to a base day if the actual finish day isn’t too far away. For example, 40 days can be a base finish day. If it is 42 days this is corrected (by knowing a chicken usually changes by a specific amount per day in good conditions) If the actual day is too far from 39 days (say 50) that data will probably just be omitted - I don’t even think that happens in reality (50 days) At worst case, at least to get this working, I will use data only that ends very near a common time point. (ie 38-41 days) and just correct to 39 days
If we go by your idea (this is what I thought we may have to use initially) of some form of variable reduction - how do I retain real world meaning?- this is a problem I generally have with PCA in the past - PCA works well, but my new factors don’t really resemble anything 'real'
PCA (im most familiar with PCA) will factorise my (now time ID) input variables into factors that only vaguely resemble the original.
How will I know what the new factors relate to. (ie if Factor 1 is now an important variable, made up of many original variables how will I know what this really relates to?)
Also the number of input variables will be massive - up to 1 variable for every 60sec for 39 days (which can of course be reduced, but still very high number) - is this appropriate for these methods?
As I will have many different original variables (Temp, humidty, etc) would these (at least initially) be better off being analysed completely separately AND/OR combined Of course I can try both!
tn = temperature at time point n
hn = humidity at time point n
Y - t0 - t1 - t2 -----tn
Y - t0 - t1 - t2 -----tn
Y - t0 - t1 - t2 -----tn -h0 - h1 -h2 ---- hn
Thanks for your help
Yes, interpretation of PCs is a challenge. It is possible to "see" what each component "looks like". This might give you useful insight.
PLS will most likely give you better models. The PLS platform in JMP Pro includes plots that enable you to see which of the variables are contributing most.
I would recommend having a look at the book on PLS by Cox and Gaudard. It also gives a good explanation of PCA.
As I mention above, many other techniques could be used (Clustering, Gen Reg, Bootstrap Forest, Neural Nets)
Regarding the challenge on the number of variables: this is where your domain knowledge comes in. Do you really think there is value in knowing temperature at every 60 s? JMP can handle really very wide data. You need to decide if this will be useful.
As always this will be a process of exploration. Good luck!