Hello,
I've got a extensive dataset and I'm looking to try out a consensus modelling approach to pick out the likely patterns within my data set. Since these is going to be a pretty extensive approach I just have a few logistical questions about the scripting. Firstly, I'll break down the workflow of what the modelling is trying to achieve.
Workflow
1. Subset dataset by a random - sampling rate of 0.1.
2. Run factor analysis for "y" columns.
3. Using the Eigenvalues from the factor analysis, run a K-means cluster for the following number of factors:
- 3 factors to the number of factors with eigenvalues above 1.
4. Make all Cluster Means into data table excluding standard deviations.
5. Concatenate step 4 with the subset of random data sampled in step 1.
6. Create indicator column that distinguishes samples and k-mean cluster means.
7. Loop to run steps 1 through 6 n amount of times.
8. Create indicator column to distinguish each loop in the final concatenate table.
All of this can be done one at a time, but what I'm trying to get at here is to automate this process through lots of different iterations and create a dashboard to showcase how the results changed through the random sampling procedure.
Any input on how to code these steps into each other would be appreciated. The loop is what is giving me grief as all the other steps (with the exception of step 3) can be saved to the data table and code extracted.
M. Dereviankin