Choose Language Hide Translation Bar

ABCs of Structural Equations Models (2020-US-45MP-590)

Level: Intermediate


Laura Castro-Schilo, JMP Research Statistician Developer, SAS Institute, JMP Division
James Koepfler, Research Statistician Tester, SAS Institute, JMP Division


This presentation provides a detailed introduction to Structural Equation Modeling (SEM) by covering key foundational concepts that enable analysts, from all backgrounds, to use this statistical technique. We start with comparisons to regression analysis to facilitate understanding of the SEM framework. We show how to leverage observed variables to estimate latent variables, account for measurement error, improve future measurement and improve estimates of linear models. Moreover, we emphasize key questions analysts’ can tackle with SEM and show how to answer those questions with examples using real data. Attendees will learn how to perform path analysis and confirmatory factor analysis, assess model fit, compare alternative models and interpret all the results provided in the SEM platform of JMP Pro.



The slides and supplemental materials from this presentation are available for download here.

Check out the list of resources at the end of this blog post to learn more about SEM.

The article and data used for the examples of this presentation are available in this link.



Auto-generated transcript...




Laura Castro-Schilo Hello everyone. Welcome. This session is all about the ABCs of Structural Equation modeling and what I'm going to try is to leave you with enough tools to be able to feel comfortable
  to specify models and interpret the results of models fit using the structural equation modeling platform and JMP Pro.
  Now what we're going to do is start by giving you an introduction of what structural equation modeling is and particularly drawing on the connections it has to factor analysis and regression analysis.
  And then we're going to talk a little bit about how path diagrams are used in SEM and their important role within this modeling framework.
  I'm going to try to keep that intro short so that we can really spend time on our hands-on examples. So after the intro, I'm going to introduce the data that I'm going to be using for demonstration. It turns out these data are
  about perceptions of threats of the Covid 19 virus. So after introducing those data,
  we're going to start looking at how to specify models and interpret their results within the platform, specifically by answering
  a few questions. Now, these, these questions are going to allow us to touch on two very important techniques that can be done with an SEM. One is confirmatory factor analysis and also multivariate regression.
  And to wrap it all up, I'm going to show you just a brief
  model in which we bring together both the confirmatory factor model, regression models and that way you can really see the potential that SEM has for using it with your own data for your own work.
  Alright, so what is SEM? Structural equation modeling is a framework where factor analysis and regression analysis come together.
  And from the factor analysis side, what we're able to get is the ability to measure things that we do not observe directly.
  And on the regression side, we're able to examine relations across variables, whether they're observed or unobserved. So when you bring those two tools together, we end up with a very flexible framework, which is SEM, where we can fit a number of different models.
  Path diagrams are a unique tool within SEM, because all statistical structural equation models, the actual models, can be depicted through a diagram. And so we have to learn just some
  notation as to how those diagrams are drawn. And so squares represent observed variables,
  circles represent latent variables, variances or covariances are represented with double-headed arrows, and regressions and loadings are
  represented with one-headed arrows. Now, as a side note, there's also a triangle that is used for path diagrams.
  But that's outside the scope of what we're going to talk about today. The triangle is used to represent means and intercepts. And unfortunately, we just don't have enough time to talk about all of the awesome things we can do with means as well.
  Alright. But I also want to show you just some fundamental...kind of the building blocks of
  x, which is in a box, is predicting y,
  which is also in a box. So we know those are observed variables and each of them have double-headed arrows that start and end on themselves, meaning those are variances.
  For x, this arrow is simply its variants, but for y, the double-headed arrows represent a residual variance.
  Now, of course, you might know that in SEM, an outcome can be both an outcome and a predictor. So you can imagine having another variable z that y could also predict, so we can put together as many regressions as we are interested in within one model.
  The second basic block for building SEMs are confirmatory factor models and at the most basic
  most basic example of that is is shown right here, where we're specifying one factor or one latent variable, which is unobserved but it's
  it's shown here with arrows pointing to w, x, and y because the latent variable is thought to cause the common variability we see across w, x, and y.
  Now, this right here is called confirmatory factor model, it's, it's only one factor. And I think it's really important to understand the distinctions of a factor in the factor analytic
  perspective and distinguish that from principal components or principal component analysis, which sometimes are easy to get confused.
  So I'll take a quick tangent to show you what's different about a factor from a factor analytic perspective and from PCA.
  So here these squares are meant to represent observed variables, the things that we're measuring, and I colored here in blue
  different amounts from each observed variable, which represents the variance that those variables are intended to measure. So it's kind of like the signal. I mean, there's
  it's this stuff we're really interested in. And then we have these gray areas which represent a proportion of variance that comes from other sources. It can be systematic variance, but it's simply variance that is not
  what we want it to pick up from our measuring instrument. So, it contains
  sources are variance that are unique to each of these variables and they also contain measurement error.
  So what's the difference between factor analysis and PCA is that in factor analysis, a latent variable is going to capture only the common variance that
  exists across all of these observed variables, and that is the part that the latent variable accounts for.
  And that is in contrast to PCA where a principle component represents the maximal amount of variance that can be explained in the dimensions of the data.
  And so you can see that the principal component is going to pick up as much variance as it can explain
  and that means that there's going to be an amalgamation of variance as due to what we intended to measure and perhaps other sources of variance as well. So this is a very important distinction,
  because when we want to measure unobserved variables, factor analysis is indeed a better choice, unless you know if our goal truly is dimension reduction, then PCA is an ideal tool for that.
  Also notice here that the arrows are pointing in different directions. And that's because in factor analysis, there really is a an underlying assumption that that unobserved variable is causing the variability we observe.
  And so that is not the case in PCA, so you can see that distinction here from the diagram. So anytime we talk about
  a factor or a latent variable in SEM, we're most likely talking about a latent variable from this perspective of factor analysis. Now here's a
  large structural equation model where I have put together a number of different elements from those building blocks I showed before. When we see numbers because we've estimated this model.
  And you see that there's observed variables that are predicting other variables. We have some sequential relations, meaning that
  one variable leads to another, which then leads to another. This is also an interesting example because we have a latent variable here that's being predicted by some observed variable. But this latent variable also
  in turn, predicts other variables. And so it illustrates, this diagram illustrates nicely, a number of uses that we can have in, for basically reasons that we can have for using SEM, including the five that we have unobserved variables that we want to model within a larger
  context. We want to perhaps account for measurement error. And that's important because latent variables are purged of measurement error because they only account for the variance that's in common across their indicators.
  We...if you also have the need to study sequential relations across variables, whether they're observed or unobserved, SEM is an excellent tool for that.
  And lastly, if you have missing data, SEM can also be really helpful even if you just have a regression, because
  all of the data that are available to you will be used in SEM during estimation, at least within the algorithms that we have in JMP pro.
  So that...those are really great reasons for using SEM. Now I want to use this diagram as well to introduce some important terminology that you'll for sure come across multiple times if you decide that SEM is a
  a tool that you are going to use in your own work. So we talked about observed variables. Those are also called manifest variables in the SEM jargon.
  There's latent variables. There are also variables called exogenous. In this example, there's only two of them.
  And those are variables that only predict other variables and they are in contrast to endogenous variables, which actually have variables predicting them. So here, all of these other variables are endogenous.
  Latent variables, the manifest variables that they point to, that they predict, those are also called latent variable indicators.
  And lastly, we talked about the unique factor variance from a factor analytic perspective. And those are these residual variances from a factor,
  from a latent variable, which is variance that is not explained by the latent variable, and represents both the combination of systematic variance that's unique to that variable plus measurement error.
  All right.
  I have found that in order to understand structural equation modeling in a bit easier way, it's important to shift our focus into realizing that the data that we're really
  modeling under the hood is the covariance structure of our data. We also model the means, the means structure, but again, that's outside the scope of the presentation today.
  But it's important to think about this because it has implications for what we think our data are.
  You know, we're used to seeing our data tables and we have rows and variables are our columns, and and yes that is...
  these data can be used to launch our SEM platform. However, our algorithms, deep down, are actually analyzing the covariance structure of those variables. So when you think about the data, these are really the data that are
  being modeled under the hood. That also has implications for when we think about residuals, because residuals in SEM are with respect to variances and covariances of
  you know, those that we estimate, in contrast to those that we have in the sample. So residuals, again, we need a little bit of a shift in focus to what we're used to, from other standard statistical techniques to really
  wrap our heads around what structural equation models are. And things...concepts such as degrees of freedom are also going to be degrees of freedom with respect to the covariance
  matrix. And so once we make this shift in focus, I think it's a lot easier to understand structural equation models.
  Now I want to go over a brief example here of how is it that SEM estimation works. And so usually what we do is we start by specifying a model and the most
  exciting thing about JMP Pro and our SEM platform is that we can specify that model directly through a path diagram, which makes it so much more intuitive.
  And so that path diagram, as we're drawing it, what we're doing is we're specifying a statistical model that implies the
  structure of the covariance matrix of the data. So the model implies that covariance structure, but then of course we also have access to the sample covariance matrix.
  And so what happens during estimation is that we try to match the values from the sample covariance as close as possible, given what the model is telling us the relations between the data are, not within the variables.
  And once we have our model estimates, then we use those to get an actual
  model-implied covariance matrix that we can compare against the sample covariance matrix. So, by looking at the difference between those two matrices,
  we are able to obtain residuals, which allows us to get a sense of how good our models fit or don't fit. So in a nutshell, that is how structural equation modeling works.
  Now we are done with the intro. And so I want to introduce to you, tell you a little bit of context for the data I'm going to be using for our demo.
  These data come from a very recently published article in the Journal of Social, Psychological and Personality Science.
  And the authors in this paper wanted to answer a straightforward question. They said, "How do perceived threats of Covid 19 impact our well being and public health behaviors?" Now what's really interesting about this question is that perceived threats of
  Covid 19 is a very abstract concept. It is a construct for which we don't have an instrument to measure it. It's something we do not observe directly.
  And that is why in this article, they had to figure out first, how to measure those perceived threats.
  And the article focused on two specific types of threats. They call the first realistic threats, because they're related to physical or financial safety.
  And also symbolic threats, which are those threats that are posed on one's sociocultural identity.
  And so they actually came up with this final threat scale, they went through a very rigorous process to develop this survey, this scale to measure realistic threat and symbolic threat.
  Here you can see here that people had to answer how much of a threat, if any, is the corona virus outbreak for your personal health,
  you know, the US economy, what it means to be an American, American values and traditions. So basically, these questions are focusing on two different aspects of the
  threat of the virus, one that is they labeled realistic, because it deals with personal and financial health issues.
  And then the symbolic threat is more about that social cultural component. So you can see all of those questions here.
  And we got the data from one of the three studies that they that they did. And those data from 550 participants who answered all of these questions in addition to a number of other surveys and so we'll be able to use those data in our demo.
  And we're going to answer some very specific questions. The very first one is, how do we go about measuring perceptions of Covid 19 threats. There's two types of threats, we're interested in. And the question is, how do we do this, given that it's such an abstract concept.
  And this will take us to talk about confirmatory factor analysis and ways in which we can assess the validity and reliability of our surveys.
  One thing we're not going to talk about though is the very first step, which is exploratory factor analysis. That is something that we do outside of SEM
  and it is something that should be done as the very first step to develop a new scale or a new survey, but here we're going to pick up from the confirmatory factor analysis step.
  A second question is do perceptions of Covid 19 threats predict well being and public health behaviors? And this will take us to talk more about regression concepts.
  And lastly, are effects of each type of threat on outcomes equal? And this is where we're going to learn about a very unique
  helpful feature of SEM, which is allowing us to impose equality constraints on different effects within our models and being able to do systematic model comparisons to answer these types of questions.
  Okay, so it's time for our demo. So let me go ahead and show you the data, which I've already have open right here.
  Notice my data tables has a ton of columns, because there's a number of different surveys that participants
  responded to, but the first 10 columns here are those questions, or the answers to the questions I showed you in that survey.
  And so what we're going to do is go to analyze, multivariate methods, structural equation models and we're going to use those
  answers from the survey. All of the 10 items, we're going to click model variables so that we can launch our platform with those data.
  Now, right now I have a data table that has one observation per row. That's what the wide data format is, and so that's what I'm going to going to use.
  But notice there's another tab for summarize data. So if you have only the correlation or covariance matrix, you can now input that in...well, you will be able to do it in JMP 16,
  JMP Pro 16 and so that's another option because, remember that at the end of the day, what we're really doing here is modeling covariance structures. So you can use summarize data to launch the platform.
  Alright, so let's go ahead and click OK. And the first thing we see is this model specification window which allows us to do all sorts of fun things.
  Let's see, on the far right here we have the diagram and notice our diagram has a default specification. So our variables
  all have double headed arrows, which means they all have a variance
  They also have a mean, but notice if I'm right clicking on the canvas here and I get some options to show the means or intercepts. So
  again, this is outside the scope of today, so I'm going to keep those hidden but do know that the default model in the platform has variances and means for every variable that we observe.
  The list tab contains the same information as the diagram, but in a list format and it will split up your
  paths or effects based on their type. We also have a status step which gives you a ton of information about the model that you have at that very moment. So right now, it tells us,
  you know, the degrees of freedom we have, given that this is the model we have specified here is just the default model. And it also tells us, you know, data details and other useful things.
  Notice that this little check mark here changes if there is a problem with your model. So as you're specifying your model, if we encounter something that looks problematic or an error, this tab will change in color and type
  and so it will be helpful to hopefully help you solve any type of issues with the specification of the model.
  Okay, so on the far left, we have an area for having a model name. And we also have from and to lists. And so this makes it very easy to select variables here,
  in the from and then in a to role, wherever those might be. And we can connect them with single-headed arrows or double-headed arrows, which we know, they are regressions or loadings or variances or covariances.
  Now for our case right now, we really want to figure out how do we measure this unobserved construct of perceptions of Covid 19
  threat. And I know that the first five items that I have here are the answers to questions that the authors labeled realistic threats. So I'm going to select those
  variables and now here we're going to change the default name of latent one to realistic because that's going to be the realistic threat latent variable. So I click the plus button. And notice, we automatically give you
  this confirmatory factor model with one factor for realistic threat. An interesting observation here is that there is a 1 on this first loading
  that indicates that this path, this effect of the latent variable on the first observed variable is fixed to one.
  And we need to have that, because otherwise our model would not be identified.
  So we will, by default, fix what your first loading to one in order to identify the model and be able to get all of your estimates.
  An alternative would be to fix the variance of the latent variable to one, which would also help identify the model, but it's a matter of choice which you do.
  Alright, so we have a model for realistic threat. And now I'm going to select those that are symbolic threat and I will call this symbolic and we're going to hit go ahead and add
  our latent variable for that. I changed my resolution. And so now we are seeing a little bit less than what I was expecting. But here we go. There's our model. Now
  we might want to
  specify, and this is actually very important,
  realistic and symbolic threats. We expect those to vary, to co-vary and therefore, we would select them in our from and to list and click on our double-headed arrow to add that covariance.
  And so notice here, this is our full two factor confirmatory factor analysis. So we can name our model and we're ready to run. So let's go ahead and click Run.
  And we have all of our estimates very quickly. Everything is here. Now
  what I want to draw your attention to though, is the model comparison table. And the reason is because we want to make sure that our model fits well before we jump into trying to interpret the results.
  So let's talk about what shows up in this table. First let's note that there's two models here that we did not fit but the platform fits them by default upon launching.
  And we use those as sort of a reference, a baseline that we can use to compare a model against. The unrestricted model I will show you here,
  what it is, if we had every single variable covarying with each other, that right there is the unrestricted model. In other words, is a baseline for what is the best we can do in terms of fit.
  Now the chi square statistic represents the amount of misfit in a model. And because we are estimating every possible variance and covariance without any restrictions here,
  that's why the chi square is zero. Now, we also have zero degrees of freedom because we're estimating everything. And so it's a perfectly fitting model and it serves as a baseline for...
  serves as a baseline for understanding what's the very best we can do.
  All right, then the independence model is actually the model that we have here as a default when we first launched the platform. So that is a baseline for the worst, maybe not the worst, but a pretty bad model.
  It's one where the variables are completely uncorrelated with each other. And you can see indeed that the chi square statistic jumps to about 2000 units. But of course, we now have 45 degrees of freedom because we're not estimating much at all in this model.
  And then lastly we have our own two factor confirmatory factor model and we also see that the chi square is large, is 147
  with 34 degrees of freedom. It's a lot smaller than the independence model, so we're doing better, thankfully, but it's still a significant chi square, suggesting that there's significant misfit in the data.
  Now, here's one big challenge with SEM. Back in the day was that people realize that the chi square is impacted, it's influenced by the sample size. And here we have 550 observations. So it's very likely that even well fitting models are going to have a little
  Going to turn out to be significant because of the large sample size. So what has happened is that fit indices that are unique to SEM have been developed to allow us to assess model fit, irrespective of the chi square and that's where these baseline models come in.
  The first one right here is called the comparative fit index. It actually ranges from zero to one. You can see here that one is the value for a perfectly fitting model and zero is the value for the really poor fitting model, the independence model.
  I keep sorting this by accident. Okay, so what what this index for our own model means a .9395. It's the proportion, it's yeah, it's a proportion of how much better are we doing with our model
  in comparison to the independence model. So this is just, we're about, you know, 94% better than the independence model, which is pretty good.
  The guidelines are that CFI of .9 or greater is acceptable. We usually want it to be as close to one as possible and .95 is ideal.
  We also then have this RMSCA, the root main square error of approximation, which represents the amount of misfit per degree of freedom.
  And so we want this to be very small. It also ranges from zero to one. And you see here, our model has .07, about .08,
  and generally speaking .1 or lower is good, is adequate, and that's what we want to see. And then on the ride, we get a some confidence limits around this one statistic. So
  what this suggests is that indeed the model is a good an acceptable fitting model and therefore we're going to go ahead and try and interpret it.
  But it's really important to assess model fit before really getting into the details of the model because we're not going to learn very much, or at least not a lot of useful information if our model doesn't fit well from the get go.
  Alright, so, because this is a confirmatory factor analysis, we're going to find it very useful to show (I'm right clicking here), and now I'm going to show the standardized estimates on the diagram. All right. This means that we're going to have
  estimates here for the factor loadings that are in the correlation metric, which is really useful and important for interpreting these loadings in factor analysis.
  This value here is also going to be rescaled so that it represents the actual correlation between these two latent variables, which in this case is
  substantial point for correlation. In the red triangle menu, we also can find the standardized parameter coefficients in the table form so we can get more details about standard errors and Z statistic and so on.
  But you can see here that all of these values are know they're they're pretty acceptable. They are the correlation between the observed variable and the latent variable.
  And they're generally about .48 to about .8 or so around here. So those are pretty decent values. We want them to be as high as possible.
  And another thing you're going to notice here about our diagrams, which is a JMP pro 16 feature, is that
  these observed variables have a little bit of shading, a little gray area here, which actually represents the amount or portion of
  variance explained by the latent variable. So it's really cool, because you can really see just visually from the little shading that the variables for symbolic threats
  actually have more of their variance explained by the latent variable. So perhaps this is a better factor than the realistic threat factor, just based on looking at how much variance is explained.
  Now I want to draw your attention to an option called assess measurement model and that is going to be super helpful to allow us to understand whether
  the questions, the individual questions in that survey are actually good questions.
  We know that based on the statistical...the indicator reliability, we want our questions to be reliable and that's what we are plotting over on this side. So notice we give a little
  line here for a suggested threshold of what's good acceptable reliability for any one of these questions and you can see in fact that the symbolic threat is a better
  Seems to be doing a little better there. The questions are more reliable in comparison to the realistic threat. But generally speaking, they're both fairly good
  You know, the fact that they're not crossing the line is not terrible. I mean, they're they're around...these two questions are around the the
  adequate threshold that we would expect for indicator reliability. We also have statistics here that tell us about the reliability of the composite. In other words, if we were to grab all of these questions and
  maybe grab all of these questions for a realistic threat and we get an average score for all of those answers
  per individual, that would be a composite for realistic threat and we could do the same for symbolic.
  And so what we have here is that index of reliability. All of these indices, by the way, range from zero to one.
  And so we want them to be as close to one as possible because we want them to be very reliable and we see here that both of these
  composites have adequate reliability. So they are good in terms of using an average score across them for other analyses.
  We also have construct maximal reliability and these are more the values of reliability for the latent variables themselves rather than
  creating averages. So we're always going to have these values be a bit higher because when you're using latent variables, you're going to have better measures.
  The construct validity matrix gives us a ton of useful information. The key here is that the lower triangular simply has the correlation between the two factors in this case.
  But the values in the diagonal represent the average variance extracted across all of the indicators of the factor.
  And so here you see that symbolic threats have more explained variance on average than realistic threat, but they both have substantial values here, which is good.
  And most importantly, we want these diagonal values to be larger than the upper triangular because the upper triangular represents the overlapping barriers between the two factors.
  And you can imagine, we want the factors to have more overlap and variance with their own indicators than with some other
  construct with a different factor. And so this right here, together with all of these other statistics
  are good evidence that the survey is giving us valid and reliable answers and that we can in fact use it to pursue other questions. And so that's what we're going to do here. We're going to close this and I'm going to
  run a different model, we're going to relaunch our platform, but this time I'm going to use...
  I have a couple of variables here that I created. These two are
  composites, they're averages for all of the questions that were related to realistic and symbolic threats. So I have those composite scores right here.
  And I'm going to model those along with...I have a measure for anxiety. I have a measure for negative affect. And we can also add a little bit of the CDC
  adherence. So these are whether people are adhering to the recommendations from the CDC, the public health behaviors. And so we're going to launch the platform with all of these variables.
  And what I want to do here is focus perhaps on fitting a regression model. So I selected those variables, my predictors, my outcomes. And I just click the one-headed arrow to set it up. Now the model's not fully...
  correctly specified yet because we want to make sure that both our.
  threats here, we want to make sure that those are covarying with each other, because we don't have any reason to assume they don't covary.
  Same with the outcome, they need to covary because we don't want to impose any weird restrictions about them being orthogonal.
  And so this right here is essentially a path analysis. It's a multivariate multiple regression analysis. And so we can just put it here, multivariate regression, and we are going to go with this and run our model.
  Okay. So notice that because we fit every...
  we have zero degrees of freedom because we've estimated every variance and covariance amongst the data. So even though this suggest is a perfect fit,
  all we've done so far is fit a regression model. And what I want to do to really emphasize that is show you...I'm going to change the layout of my diagram
  to make it easier to show...the associations between both types of threat. I want to hide variances and covariances and you can see here, just...
  I've hidden the other edges so that we can just focus on the relations of these two threats have on our three outcomes. Now in my data table, I've already fit a, using fit model, I used anxiety that same
  variable as my outcome and the same two threats as the predictors. And I want to put them side by side because I want to show you that in fact
  the estimates for the regression of the prediction of anxiety here are exactly the same values that we have here in the SEM platform for both of those predictions.
  And that's what we should expect. Fit model, I mean, this is regression. So far we're doing regression.
  Technically, you could say, well, if I just, I'm comfortable running three of those fit models using these three different outcomes, then what is it that SEM is buying me that's, that's better.
  Well, it might, you might not need anything else and this might be great for your needs, but one unique feature of SEM is that we can test directly whether there are
  equality...whether equality constraints in the model are tenable. So what we mean by that is that I can take these two effects (for example, realistic and symbolic threat effects on anxiety) and I can use this set equal button to impose an equality constraint. Notice here,
  these little labels indicate that these two paths will have an equal effect. And we can then
  run the model and we can now select those models in our model comparison table and compare them with this compare selected models option.
  And what we'll get is a change in chi square. So, we see just the change of chi square going from one model to the next. So this basically says, okay,
  the model is going to get worse because you've now added a constraint. So you gained a degree of freedom, you have now more misfit, and the question is, is that misfit significant?
  The answer in this case is, yes. Of course, this is influenced by sample size. So we also look at the difference in the CFI and RMSEA and anything that's .01 or larger suggests that
  the misfit is too much to ignore. It's a significant additional misfit added by this constraint.
  So now that we know that, we can say directly in this...we know that this is the better fitting model, the one that did not have that constraint and we can assert that realistic threats have greater positive
  association with anxiety in comparison to the symbolic threats, which also have a positive significant effect, but is not a strong significantly, not as strong as statistically, not as strong as the realistic threats. All right. And there's other
  interesting effects that we have here. So what I'm going to do as we are approaching the end of the session is just draw your attention to this interesting
  effect down here where both types of threats have different effects on the adherence to CDC behaviors. And the article really
  pays a lot of attention to this finding because, you know, these are both threats. But as you might imagine, those who feel threats to their personal
  health or threats to their financial safety, they're more likely to adhere to the CDC guidelines of social distancing and avoiding social gatherings, whereas those who
  are feeling that threat is symbolic, a threat to to their social cultural values,
  those folks are significantly less likely to adhere to those CDC behaviors, perhaps because they are feeling those social cultural norms being threatened. And so it's an interesting finding and we see that here we can of course test equivalence of a number of other paths in this model.
  Okay, so the last thing I wanted to do is just show you (we're not going to describe this full model), but I did want to show you what happens when you bring together both
  (let's make this a little bigger) regression and factor analysis.
  To really use the full potential of structural equation models, you ought to model your latent variables, we have here both threats as latent variables which allow us to really purge the measurement error from those
  survey items, and model associations between latent variables, which allows us to have unbiased effects because...unattenuated effect...because we are accounting for measurement error when we measure
  the latant and variables directly. And notice, we are taking advantage of SEM, because we're looking at sequential associations across a number of different factors and so down here you can see our cool diagram which
  I can move around to try and show you all the cool effects that we have and also highlight the fact that our diagrams are
  fully interactive, really visually appealing, and very quickly we can identify, you know, significant effects based on the type of line versus non significant effects, in this case, these dashed lines. And so
  again to really have the full power of SEM, you can see how here we're looking at those threats of latent variables and looking at their associations with a number of other
  public health behaviors and with well being variables. And so with that, I am going to stop the demo here and
  let you know that we have a really useful document in addition to the slides, we have a really great document that the tester for our platform, James Koepfler, put together where he gives a bunch of really great tips on how to use our platform,
  from specifying the model, to tips on how to compare models, what is appropriate to look at, what's a nested model, all of this information I think you're going to find super valuable if you decide to use
  the platform. So definitely suggest that you go on to JMP Community to get these materials which are supplementary to the presentation. And with that, I'm going to wrap it up and we are going to take questions once this all goes live. Thank you very much.



This webinar is super pertinent and complete. But, I still have a question about how to create a factor regrouping variables (questions)? Like Dr. Laura Castro-Schilo did with her to factors.


Thank you for your help!




Hi @NatalyLevesque,


Glad to know you enjoyed the webinar! 


The first step for creating a measurement scale, after you've collected your data, is to do an exploratory factor analysis. You can do this in JMP by going to Analysis > Multivariate Methods > Factor Analysis, selecting all your measured variables and launching the platform. The platform allows you to explore the adequate number of factors (i.e., latent variables) to extract --once you've found a good factor solution (e.g., you've accounted for a good amount of variance in the data, you have factors with a combination of strong and weak standardized factor loadings --known as simple structure,-- acceptable fit, etc), you should use an independent dataset to fit a confirmatory factor analysis in SEM that helps you validate the factorial structure (the pattern of standardized factor loadings). You can find more info on exploratory factor analysis in our documentation: 


In sum, the standardized factor loadings of an exploratory factor analysis help you identify which variables group to define the latent variable in SEM. Theory and domain expertise should also guide this process, as you probably have an expectation of which variables are hypothesized to be caused by the same unobserved (latent) variable.




Thank you Dr Castro-Schilo,


But, I still have a question on how to create a factor (variable) grouping the elements of one dimension (questions) into one column and then test the effect of the dimension on the others. It's like when you go to see the fit measurements -> Composite reliability / construct.


For example, I have 3 dimensions:

- Cognitive: 2 sub-dimensions (13 items)

- Affective: 1 sub-dimension (6 items)

- Behavior: 3 sub-dimension (14 items)


How do I create a new variable comprising the 2 cognitive sub-dimensions, another with the 6 of the affective dimension and the 14 that make up the behavior dimension? In your tutorial, you mention having done it and I see the variable with a « star » after its names. If you watch your video at 34 minutes 40 seconds, you will find what I means. I hope I am clear ...


Thank you very much for your collaboration!



Dear Dr. Castro-Schilo,


I forgot to mentioned that I have already done my exploratory factor analysis which allowed me to reduce from 100 questions to 33. Now, I want to do confirmatory factor analysis and SEM including patch analysis, convergent, discriminant and nomological validity.


So, how do we create a model like this in the SEM of JMP Pro? See the photo attached (Sorry for the quality of my drawing).


Thank you so much for your help.









That's a great diagram! =)


Composite variables, as those I show in the video at minute 34 with 40 seconds, are just averages across the variables that load onto a latent variable. JMP makes it easy to create these averages by selecting the columns in the data table, right-clicking and finding the New Formula Column > Combine > Average option:



Once you have the composite variables, you can use those to create latent variables as described in the video in minute 22 with 19 seconds. However, you might want to use your original variables and specify a "higher order confirmatory factor model" by first using the original questions to create your sub-dimensions and then selecting those latent variables to create an overall latent variable that represents your higher-order construct:




Moreover, the statistics from "Assess Measurement Model" can be computed from the current JMP 15.2 output (the option will be available in the upcoming JMP Pro v16.0). Here's a brief description to help you obtain a few of those statistics:


Indicator Reliability: these are squared standardized loadings. You can obtain the values by selecting "Standardized Parameter Estimates" from the RTM. Then, hover over the stdz loadings, right-click on the table, and select "Make into Data Table." In the new data table, right-click over "Estimate" and select New Formula Column > Transform > Square. The new column has the indicator reliabilities, which you can take to  graph builder to make an Estimate^2 by Loading barplot.


Construct Validity Matrix: has standardized covariances (correlations) among latent variables in the lower triangular, which will be displayed when you get the table of standardized estimates. The upper triangular has squared correlations among latent variables, so simply square the correlation values. Lastly, the diagonal has the average variance extracted by each latent variable; you can obtain these values by averaging the squared stdz loadings for each latent variable. In the data table that has the Estimate^2, select the "Loadings" column, right-click on it, and go to New Formula Column > Character > First Word. Then go to the Tables menu and select Summary. Place "First[Loadings]" under "Group" and get "Mean(Estimate^2)" under "Statistics." After clicking OK, you'll see a new data table with the average variance extracted by each latent variable.


Exploring reliability and validity takes a bit longer in version 15.2, but I certainly hope you upgrade to v16 this March so you can take advantage of these and other great new features!




Thank you very much: it's clear and answering all my questions.


Best regards!


Nataly :)