topic Re: Simulating data to generate larger data set for building models in Discussions

Simulating data to generate larger data set for building models

SDF1 — Wed, 27 Feb 2019 15:32:04 GMT

Dear JMP community,

I'm trying to work out a way to generate simulated data to better train a model I'm building.

I have a decent sized data set -- a few thousand data points, but would like to generate simulated data that maintains a similar structure as the original data in order to improve upon the model. I'd like to have somewhere around 20K-50K data points with a similar structure in order to have a larger set to train, validate, and test the prediction model.

Just for visual purposes, the data looks something like this:

I've tried to work with the "Simulator" feature from the Profiler platform, but the data that is generated there doesn't keep a similar structure (variation) as the original data -- it gets too "washed out" with the standard deviations from the inputs and response. When I try to build a model off this set and compare it to the original model from the source data, the orginal model outscores it because that model maintains the structure of the data in it's prediction. The model generated from the simulated data set doesn't since all structure is essentially lost.

Also, the "Simulate" feature (right clicking a column in a report window) doesn't really get me where I want either.

I've done a bit of research into what people have done with JMP and found a few potential leads, however so far none of them really go after what I'm looking for. If anyone out there has a solution to this, or can point me in the right direction on how to code for it, I would be grateful.

Thanks

Re: Simulating data to generate larger data set for building models

Peter_Bartell — Wed, 27 Feb 2019 16:49:22 GMT

If you have JMP Pro version 14.2 your problem sounds like a perfect application for the Functional Data Explorer. Another option might be time series modeling approach using some kind of ARIMA construct perhaps? No need for JMP Pro for the time series approach.

Re: Simulating data to generate larger data set for building models

Mark_Bailey — Wed, 27 Feb 2019 17:12:03 GMT

First of all, how would you use the simulated data to improve the model? Is it to have sufficient data to demonstrate the feasibility of a particular modeling approach? Is it to assess the sample size that might be sufficient for a particular model? Is just a matter of increasing the sample size?

Second of all, it seems like you have a single time series. Is that interpretation correct?

Third of all, the simulation feature of the Profiler is meant to work with a fitted or saved model. You can describe the variation in the predictors and in the responses in terms of parametric distribution models. I don't know if that mechanism works with time series models. (Other experts can chime in here.)

The platform simulation is intended to resample data based on a save model with a stochastic component to generate the empirical sampling distribution of the chosen statistic in a platform.

I think that I can get a SAS program to simulate time series data that could be adapted to a JMP column formula or script. (No promises.)

Am I on the right track?

Re: Simulating data to generate larger data set for building models

SDF1 — Wed, 27 Feb 2019 17:10:14 GMT

Hi Peter,

Thanks for the comments. I forgot to add in the original post that I am running JMP Pro 14.1.0.

I might have to wait until the next update roll-out, but I can at least do some reading and follow-up.

Thanks!

Re: Simulating data to generate larger data set for building models

SDF1 — Wed, 27 Feb 2019 17:21:31 GMT

Hi Mark,

Thanks for your comments.

The data generated is very slow -- meaning it takes a long time to actually get the original data, days & weeks. So, building a data set that is 50K in size is not feasible. I am using some of the Pro (running Pro 14.1.0) modeling functions to try and generate a model that not only predicts well the data, but also captures the structure within the data. Having a larger data set that mimics the original structure would be very helpful in improving the model. Increasing the sample size (including more train, validation, and test size as well) and capturing the structure of the data would help with training a better model -- at least I think.

It's not strictly a time series, although one could view it that way. The data events are not tied to each other in time, necessarily. Another poster suggested a time series analysis, which I will look into.

Unfortunately, when I try to use the simulation feature of the Profiler and generate the larger data sets, the new output data doesn't show any of the related structure of the original data. I can build a new model on it that fits very well, but when I compare the new model to the original model (built on the original data), using the original data, the new model is not as good of a fit.

Sorry I can't be more specific, but I can't discuss the details of the data. I hope my comments make sense to the points you brought up.

Thanks!

From what I'm wanting to do, the simulation feature in Profiler seems the right way to go, but it's not giving me what I'm trying to get. The Platorm Simulation is definitely not the right way to go, or at least I can see how to adjust it to give me what I'm after.

Re: Simulating data to generate larger data set for building models

Mark_Bailey — Wed, 27 Feb 2019 18:09:36 GMT

So the response is a time-dependent signal? Are there predictors or is it the auto-correlation that is important?

What do you mean by "Unfortunately, when I try to use the simulation feature of the Profiler and generate the larger data sets, the new output data doesn't show any of the related structure of the original data." What is the nature of the desired structure?

How is the data different from a time series?

I don't think that the simulator in the Prolfier or the JMP Pro Simulate platform feature will satisfy your needs. A column formula or a script will likely be the most satisfying approach.

You can't disclose some details (understood) so we have to play "twenty questions!"

Re: Simulating data to generate larger data set for building models

SDF1 — Wed, 27 Feb 2019 18:52:27 GMT

Hi Mark,

The data are not time-dependent. Each measurement is independent of the one previous, but in general the data are collected one after another. They could be considered a time-series, but not strictly speaking as they might not technically be sequential. More importantly though, each measurement is an independent measurement of a repsonse.

Hopfully, the visual below will explain what I mean about the simulated data losing any structure:

I generated that by opening the simulator in Profiler, giving each predictor (I have several) a normal distribution with a realistic SD (derived from the actual data), and generating a table with 50K responses. Although the average, maxs, and mins might be close to what the original data looks like, the other structure is lost. This new data set doesn't contain the same kind of "long range" structure the original one does, and it's this structure that is equally important to capture in a simulated data set as is the actual values of the data. It doesn't have to match exactly, but at least have a similar structure (see original post for comparison).

I agree that the two simulator options built into JMP are not appropriate. Any suggestions on a specific topic to read up on for the column formula or writing a script? I am not familiar with any of the built-in formulas that could take care of this for me. I am fine with scripting such a thing, but would need to know where to start. I still need to look into the other suggestion from Peter Bartell about the functional data explorer and ARIMA modeling to see if those would handle it.

Thanks!

Re: Simulating data to generate larger data set for building models

Peter_Bartell — Wed, 27 Feb 2019 19:20:48 GMT

Given the observations are not a time series I withdraw my earlier recommendations regarding Functional Data Explorer or time series analysis. Both techniques assume some structure ACROSS time between the observations...and since these observations are independent of each other...using FDE or time series is not recommended.

Re: Simulating data to generate larger data set for building models

SDF1 — Wed, 27 Feb 2019 19:28:06 GMT

Hi Peter,

Do either of those platforms allow for maintaining the long range structure? I'll still read up on them, it might give me some ideas on how to code for my own solution.

Thanks!

Re: Simulating data to generate larger data set for building models

Peter_Bartell — Wed, 27 Feb 2019 19:48:13 GMT

@SDF1 : Your phrase 'long range structure' can have many flavors, nuances, and issues attached. For example seasonality can be one form of structure. ARIMA time series methods are VERY well suited for modeling seasonality effects and long term trend or cyclical movement in a process. See the classic seriesg.jmp data table in the JMP Sample Data Directory.

My suggestion for starters is to read up on JMP/JMP Pro's capabilities within the JMP documentation. Here's a link for FDE:

https://www.jmp.com/support/help/14/functional-data-explorer.shtml

And here's a link for time series analysis:

https://www.jmp.com/support/help/14/time-series-analysis.shtml

Re: Simulating data to generate larger data set for building models

Mark_Bailey — Thu, 28 Feb 2019 11:06:15 GMT

So the data are not plotted in time order? They are not from the same source (single source)?

When you say that each measurement is independent, have you estimated the auto-correlation, even just for a lag of 1? While 'independent' might mean that the data were collected separately, I am talking about statistical independence, or lack of correlation.

The simulated data is what I would expect. There is a single population with a stable mean and variance. The Monte Carlo simulation includes random perturbations to both the factors and the response.

The 'structure' that you see in your data but not in the simulation is the result of other 'assignable causes.' The causes can occur randomly in time and in magnitude. The simulation must include their contribution. Do you know what they are? How they occur? How to model them?

I understand that you cannot disclose the true nature and detail of the measurements, but perhaps you could use an analogy to give us a sense of the data and what you are doing. For example, perhaps you are measuring the pressure in three torpedo tubes inside a nuclear submarine from time to time...

Assuming that we can help you to simulate the real data, what kind of analysis or modeling do you plan to use with it?

Peter is very helpful but as we learn more about your problem, I doubt that those suggestions will help, The functional data analysis is for functions, profiles, or curves. Imagine a sample of hundreds of data series and you want to model the shape and either how the shape is influenced by factors or how the shape predicts outcomes. You apparently have a single data series that you want to extend. The FDA does not predict beyond the original domain. The ARIMA model assumes equally spaced points. It can model the auto-correlation, perturbations, and seasonality to predict ahead but not far. On the other hand, a Monte Carlo simulation based on the ARIMA model is possible.

Re: Simulating data to generate larger data set for building models

Peter_Bartell — Thu, 28 Feb 2019 14:04:20 GMT

I agree with @Mark_Bailey last paragraph above...and as I stated in an earlier post...from what you've shared so far...I doubt that FDE or time series analysis is the most appropriate approach for your goals.

Re: Simulating data to generate larger data set for building models

SDF1 — Fri, 01 Mar 2019 21:19:56 GMT

Hi @Peter_Bartell and @Mark_Bailey,

I really appreciate the discussion on this matter, as it's giving me some ideas on how to improve my analysis.

The data come from a process, but are not in real time. They are collected in a database, that is then later accessed. I can't guarantee that each point is truly sequential after the previous. There is a typical time spacing between points, but if there is a pause in the process, which could be hours or days, the database just picks up where it left off. Hence, it's not strictly speaking a time-series. However, in practice if the ARIMA model can duplicate a similar response as the real data, this should be sufficient (I think).

Tha data are from the same source. Samples are pulled from the process and a measurement done on them.

I have not yet tried estimating the autocorrelation yet. The Monte Carlo simulation might be a way to go in order to generate input/output data that mimics the real data. I didn't know JMP could do MC simulations. I've done those before on other problems, but not using JMP. These are things I'll have to look into.

The goal is to develop a model that can well-predict the response. With that, we could monitor just the factors (easier). If the model predicts a warning, then the more challenging efforts can be done.

I still think it would be worth reading about the FDE and ARIMA, it might spark some ideas. I hope my somewhat vague explanations make sense. Thanks for understanding the disclosure concerns.

Thanks

Re: Simulating data to generate larger data set for building models

Mark_Bailey — Sat, 02 Mar 2019 11:39:10 GMT

It is not clear that simulating more data will help. If you can simulate the data, then you have the model of the data. The 'structure' is a shift in the population (not reproduced by the Monte Carlo simulation with a constant mean).

Are you simultaneously measuring the candidates for predictor variables (factors) with this response? Have you explored correlations, associations, and possible models?

Re: Simulating data to generate larger data set for building models

SDF1 — Thu, 07 Mar 2019 16:53:28 GMT

Hi @Mark_Bailey,

Ultimately, I'd like to build a strong model that predicts future values given the incoming factors. I have looked into the ARIMA and FDE. I agree the FDE is probably not a suitable approach. However, I think the ARIMA modeling might provide a pathway for generating new data for training the model.

Several factors are measured more or less in parallel with the response. All measurements use up source material, but there is a large quantity that is sampled to be able to run the other measurements. We have identified a set of factors that have the strongest predictive capability. I have used those factors and the response to check several different models using generalized regression methods, boosted tree, boostrap forest, neural nets, and other partitioning methods. Comparing the models I find that the bootstrap forest consistantly gives better modeling results than the others. I can improve upon this model a little further by holding back for testing as well running the model with a tuning table to optimize the fit parameters.

What I still find, though, is that the model doesn't really capture the highs and lows of the data as much as we believe it should (or could). It does capture some of the structure, but it's not so great at getting those highs and lows. Hence, I'd like to simulate a larger data set that includes more of those events in order to build a better model. And since it takes a long time to actually build up a data set of real values, simulation of data to train a model would be helpful.

I do think that the ARIMA platform can be of help in this regard. I've tested it out to generate predicted values of the factors and response. This generates a new set of values for the factors and responses that mimic the real data. I can then concatenate the simulated and real data into a new data table to generate a new data set that is twice as big.

Unfortunately, the ARIMA modeling does not do a good job of predicting many new values for the factors or response, so I can't really use it to generate forcast periods. Still, I think this alternative approach to build a simulated data set might prove useful.

Re: Simulating data to generate larger data set for building models

Mark_Bailey — Thu, 07 Mar 2019 17:43:42 GMT

First of all, ARIMA and other time series models assume short-term perturbations and some periodic trends, so they are not good for long-range predictions.

The 'lack of fit' issue with your present models (prediction lacks the extreme excursions of the observed response) might not be aided by more data. You might be lacking sufficient predictors. One direction to reduce bias is to include more terms in the basis expansion using the current feature set. You mention several modeling techniques but you don't say anything about the complexity of these models that you explored. Higher complexity, with the risk of over-fitting addressed by cross-validation, might solve the bias problem. The other direction to reduce bias is to search for missing features that would finally explain the departures from the observed response.

Re: Simulating data to generate larger data set for building models

SDF1 — Thu, 07 Mar 2019 20:50:59 GMT

Hi @Mark_Bailey,

There definitely might be some deficiencies in adequate predictors. However, based on the other analyses we've done, the set that we have definitely appear to be better suited than many of the others we've looked into.

With regards to complexity and cross-validation, I definitely take that into account. I generate a train/validation/test column to help with the risk of overfitting. The bootstrap forest modeling generates a rather complex decision-tree prediction formula that is much more extensive than, e.g. the generalized regression models or neural net modeling. It performs better too when comparing the models to one another. Typically, I find that after running the bootstrap forest platform with a tuning matrix, that there are on the order of 60+ trees, with ~4 terms sampled per split. The number of splits per term can vary, but is often in the 1,000's to 3,000's range.

Re: Simulating data to generate larger data set for building models

GM — Sat, 09 Mar 2019 16:01:08 GMT

based on the dicsussion, it sounds like you have an outcome variable that you want to model with other predictor variables, where time, or the order in which the data are generated is not a predictor variable. You may want to simulate your NxP (N= rows, P = variables (y and x's)) matrix using a multivariate normal distribution. This link provides JSL to simulate multivariate normal distributions. http://www.jmp.com/support/notes/36/140.html You need to modify it based on your estimated covariance matrix, but it should work, if your data follows a multivariate normal distribution.

Re: Simulating data to generate larger data set for building models

SDF1 — Mon, 11 Mar 2019 20:07:36 GMT

Hi @GM,

Thanks for the input and suggestion. You are correct about the intended goal and description of the multivariate data. The order the data is generated does not appear to be a good predictor. There are other properties that are measured which, when used collectively, provide a reasonable prediction of the "y" I want to model/predict. The difficulty comes when trying to train the model to be a better fit. I'd like to do this by simulating data, but this is proving more challenging that I anticipated.

The MVN approach certainly does do a better job at preserving some of the multivariate correlations of the data than the simulate option does in profiler. That option just makes an average with a given standard deviation.

Unfortunately when I try to build a model using the MVN script and generated data, it performs much worse than a similar model built off of just the original data.

So far, I find that the ARIMA model does a better job at capturing the short-term swings in the data, as well as some of the longer range structure. It also does a poor job predicting the forcast in data, but what I'm thinking is to use it to generate new data that is added to the original data to build a larger data set and then train a model with that larger data set.

What I think I'll try now is to "Save Columns" from the ARIMA report, extract out the predicted values and then concatenate that with the original data table. Then, I'd like to randomize the rows in the new data table and build a model with that larger data set. I guess another option might be to generate multiple validation columns, run the same model with different cross validation rows and average the resulting models.

I appreciate all the contributions to this discussion. Although I haven't found an appropriate solution yet, I'm learning a lot and coming up with some new ideas to try.

Thanks!

Re: Simulating data to generate larger data set for building models

SDF1 — Thu, 21 Mar 2019 20:44:09 GMT

Hi @Mark_Bailey and @Peter_Bartell,

Just to provide some feedback, and continue the discussion on this topic, I was able to get some reasonable model data using the ARIMA platform. It does a good job of recreating a similar data set, although not good at forcasting, but that's OK. Applying the prediction formula from this model works fairly well with my actual data set, so it does provide a potentially useful path forward.

I'm not convinced this is ultimately how I wish to proceed, so I want to compare it to the other simulation options within JMP. I know that the stochastic simulation from the profiler doesn't help for reasons we've discussed before. Additionally, it's not the best approach for tuning the model for improved fitting, which I would like to do.

For my specific modeling, I'm using a bootstrap forest model and would like to improve it by tuning the model. I have done so in part with a tuning table. There are some limitations in this regard though, particularly with the number of runs I can afford (memory-wise) in the tuning table. I typically run my modeling with as many runs in my tuning table that I can afford. This narrows down on the number of trees in the forest, number of terms sampled per split, bootstrap samples, minimum splits per tree, and minimum size split. This might help to optimize the parameters of the model platform for finding a solution, but it doesn't necessarily mean an optimized model based on the data I have.

When I build the model, I create a validation column that I break into a training and test set, stratified by my response column. I then build the model on only the training data with a certain % held back for validation of the model. The model output depends on the validation column I choose (I've made several). A typical output is seen below (sorry I can't share what the info is for the response or factor columns). The details of the boostrap forest specifications change depending on how the original training and validation data was stratified.

What I'm interested in doing now is tuning the model for an improved fit possibly by using the simulate option for a model statistic. I could do this by changing the N for the training and validation sets, or by RMSE of the individual trees, etc. I know that I won't be able to use this approach to change the specifications of the bootstrap forest (the tuning table does that), but I would like to use it in such a way that I can tune the model, e.g. the number of samples used for training and validation, or the number of splits, in order to maximize r^2.

To summarize: I would like to use a simulation method to help train the data. I know the simulate platform from the profiler doesn't work, which leaves me with ARIMA (possible) or the bootstrap simulate option for a model statistic. I'm not sure how to best use the latter simulation method for tuning the model.

Any thoughts or feedback you might have on tuning models through simulations would be appreciated.

Thanks in advance for your time!