Need help in determing what data should be threated as random in a repeated meas...

Report Inappropriate Content · Jun 10, 2023 1:48 PM

Hi.

Can someone help me with this one or verify that I am on good path here?

I have data in a long format that contains month (from january till march), day of the month (from 1 till 31), shift (from 1 till 3), total energy consumption within shift (this is dependent variable) and working time in hours of 12 different systems (machines) within a shift (from 0 till 8 hours).

I want to find out energy consumtion of each system, and be able to predict future expected energy consumption based on expected working hours of each system.

As there are repated measures mixed model seems apropriate for me. Now, I have issue what shoud I treat as a fixed, random and repeated masure.

I am thinking at month as a random as it contains only a sample (3 months) of all months within a year. Also the observations are not random but squentional. Can this be a problem?

Im looking at day and shift as a fixed as they cover all the days and shifts within a month. Is this apropriate aproach?

About the repeated structure I think that day should be repeated measure but what structure is apropriate?

Also, can you please explain what and how should be nested. I am thinking nesting shifts into days, and days within months. Can this be done? How should I do it?

Whell this is my intuitive thinking and I would apreciated if you send me some feedback and help. Also sharing any tutorial how to nest would be apreciated as this is my first experience with JMP.

Thank you all

SDF1 · Jan 27, 2022 02:50 PM

Hi @MachineHippo109 ,

Sounds like an interesting problem. Are you able to share the data in a JMP file? The structure of your data table sounds a bit complicated to make up something on my own and ensure it has the right data structure.

It sounds like you have a time series data collection on 12 different machines that operate between 0 to 8 hours for different shifts. This could be modeled using many different approaches. One approach could be to do a time series forecasting. Another might be NN or Boosted Trees, or even a GenReg model might do well to fit the data. But, it sounds like you are trying to predict the future power consumption for each machine based on the total hours of operation, is that right?

I'd be interested in giving it a try and see what modeling method works the best. One thing you'll definitely need to do is partition the data into training/validation/test sets in order to make sure you have the best model possible. Either that, or you'll have to simulate data with a similar correlation structure so that you can use that simulated data to train the model and test it on the real data.

If you can share the data, I might try a few things and let you know what I find.

Sounds fun!,

DS

MachineHippo109 · Jan 28, 2022 03:46 AM

Hi SDF1 and thank you for your fast reply.

Unfortunatly I cant share data as it actual data from the company that I have internship with. How ever data looks like this:

Month Day Shift Consumption in MWh System1 in Hr System2 in Hr System3 in Hr .......

1 1 1 3.8 8 2 5 .......

1 1 2 3.2 8 6 6 .......

I am more interested in how much energy each system consumes and of course predict future consumption based on planed production/working times.

martindemel · Jan 28, 2022 07:58 AM

Some years ago I helped a customer with a resource planning model. They had the project plan where the project lead could enter things like complexity, or innovation level, and wanted to get an assumption on what resources and times they need to finish the project. Before they did it by experience of the project lead and it turned out that the new model outperformed them in almost all cases a lot.

They had times for the different phases of the project, and resources (people) involved in each phase from previous data and the actual length of it.

A key to success was a clustering step (here it was a latent class analysis) to find good predictors for the different scenarios, as an overall model did not work well.

Then they created for each cluster-scenario the model and used validation and test sets to find the best model and test it against unseen data.

For the modeling process we compared different strategies in JMP Pro, like Bootstrap forest, Generalized Regression with various methods (lasso, elastic net, ...), Neural Nets and others. Comparing these was important to understand possible flaws or situations where (and why) one model performs better (as sometimes it was in a setting they were not interested in that much, so a model who perfomed better for the others would then be chosen, even if the overall performance had been a bit lower).

Finally they had models for the clustered scenarios and used the appropriate model to predict time scale and resource plan based on their project setup.

I believe the transfer to your situation should not be that difficult. What I wanted to make a point at is that you need to understand your data first before going directly into your modeling process. May there is a scenario (may not in the data yet, like line or test equipment, ...) which generates different output for the same parameter settings du to that lurking variable. Taking all the data into one big basket would then probably lead to a bad prediction.

/****NeverStopLearning****/

MachineHippo109 · Jan 29, 2022 03:02 PM

Hi Martin and thanks for your replay.

Your aproach is sound. In fact each of the shift have a rather distinctive patters. System 1 that runs 24/7/365 and System 2 that runs when even one of the other System is running and in 95% of the cased in the third shift only System 12 is running.

I tought on fitting 3 sepearate models for each shift, but then tought that mixed model can just solve this problem, esecially when I recalled from the back of my head that repeated measurements = mixed model.

SDF1 · Jan 28, 2022 09:32 AM

Hi @MachineHippo109 ,

The problem sounds pretty straightforward, but it will require you to explore your data using many different platforms within JMP. Evaluating things like GenReg, decision trees, SVM, NN, KNN, PLS, and just a standard least squares, for example. I do not think your data structure is in the right format for a time-series analysis/prediction. The time series platform requires unique time identifiers for each observation (row). Hence, having shifts 1, 2, 3 all for Day 1 (and so on) wouldn't work.

Based on the structure that you shared, it appears that each system (1-12) is in operation for 0-8 hours during a given shift on a given day and the power consumption is the total power consumption for all the systems.

When developing a model, if you use the standard Fit Model platform, you might want to think/consult with your coworkers to see if there is any reason that you might need/want to include any crossed (mixed) terms in your model. For example, perhaps you know that there's some crossed term between systemX Hrs and the shift, such as maybe System5 is always run for at least 6 hrs during shift 3, but less than 4 hrs for the other two shifts. This would introduce an effect where the shift and system usage hours are intermixed. I do not recommend throwing that in unless you have prior evidence for such interactions to exist.

As mentioned previously, you'll want to partition the data (Power in MWh) into at least training and validation sets, and you'll probably want to stratify it by both Power and Shift so that the training and validation sets are equally represented across shifts and maintain the same distribution for the System Hr usage. I've attached a mock data table with the same structure as yours that would be for Jan and Feb. The validation column I made, I stratified by both Power and Shift, and you can explore the distributions and see how the data is partitioned -- 75% training and 25% validation. You might even consider doing a 60/20/20 split 60% training, 20% validation and 20% test and then split off (make it as a subset) the test data for comparing the different models.

Ultimately, you'll need to explore models with several different platforms and optimize the fit with each model. Once you've generated several models, you'll probably want to see which one performs the best by evaluating each model on the withheld "test" set -- this would be data that is not used in training or validating models. The best performing model would be the one you would want to deploy for your company.

To come back to your thread's title, it doesn't sound like you need to have a random effect included in your model. I don't think you need to treat any of the data as random.

Hope this helps!,

DS

MachineHippo109 · Jan 29, 2022 03:31 PM

Hi SDF1 and thanks for your replay.

There is no doubt that I will defenetly need to try different modeling techniques.

Well based on data when only System 1 is working consumption is steady around 0.5 MWh. System 2 works when even one other system works (exept System1 who works all the time. In 3 shift usually only System 12 is running with System 1 and System 2. In shift 1 there are cases when all system works. Now what I have noticed when I plot the data is that shift 2 and 3 are almost identical and with ready values of around 80% if those in shift 1.

Anyway I turned toward mixed model because repeated measues were almost synonomus for mixed model.

However now I am questioning if observations are independent or corelated? It looks to me that they are corelated abd working in shift 3 will influence to a certain degree on consumtion in the shift 1 next days as Systems have already achieved working teperature and conditioning. However, working in shift 3 does not influence on the shift 1, as production schedule is determined by other factor.

However, there is no doubt that I will need to try diferent methods.

martindemel · Feb 2, 2022 05:35 AM

I'm with SDF1 that a mixed model approach would be appropriate/necessary here. You have to think about what is your random effect here. Just as a simple example:

You have two species and measure for them each season some key measure for the same subjects within the species. This is repeated measures over the course of the seasons.

Applied to your scenario it would only make sense if you would like to understand if there is a month to month difference, or day to day difference, but your projects are all different so each row (observation) is independent from the other rows. There is no real repeated measure unless you do exactly the same project twice, and even then you might argue that they are independent from each other and just help you to assess the error/variance.

The systems measures seem to me not as repeated measurements, as they sum up to the overall time. They may have some interaction but again, there is nothing about repeated measures I can see.

/****NeverStopLearning****/

Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Re: Need help in determing what data should be threated as random in a repeated measurements

Recommended Articles