Discussions

lujc07 · Feb 2, 2022 09:32 PM

Hi!

I am using structural equation modeling to do some analyses. I have 75 samples and some variables. One variable has 7 missing values. The structural equation modeling can still work with missing values. Could anyone tell me what technique does JMP use to deal with missing values when running structural equation modeling (eg. pairwise deletion, listwise deletion)? Thanks.

LauraCS · Feb 3, 2022 08:51 AM

Hello @lujc07,

When missing data are present, the SEM platform uses all available data to obtain a maximum likelihood estimate. This method is known in the literature as full information maximum likelihood (FIML), and has been shown to perform very well. You can see a recent comparison between FIML and Multiple Imputation (MI) in this article (https://psycnet.apa.org/record/2021-12018-001), where the authors affirm that their study "confirmed well-known knowledge that the two estimators [FIML and MI] tend to yield essentially equivalent results."

FIML is a cutting-edge method for handling missing data that will give you unbiased estimates but, just like MI, it requires the assumption that your data are missing at random (known as "MAR" in the literature... but if they're missing "completely at random" aka MCAR, that's even better!). The MAR assumption means that your missing data can have systematic differences from your observed data, as long as those differences depend only on other observed variables that are accounted for in your model. In contrast, the MCAR assumption means the data that are missing are simply a random subset of what the full data would have been.

When listwise or pairwise deletion is used, analysts are making the assumption that their data are MCAR. However, if the data are MAR, the resulting estimates will be inferior. Pairwise deletion has other potential issues too.

In sum, FIML and MI are two cutting-edge methods for missing data but FIML is much easier to implement and that's what is used in our SEM platform.

HTH,

~Laura

Laura C-S

View solution in original post

LauraCS · Feb 4, 2022 11:45 AM

Hi Juncheng,

You're not alone in the confusion! This is a very common question because the labels MCAR, MAR, and MNAR are not intuitive. Based on your description, it sounds like your 7 observations are indeed MCAR. As long as there isn't anything peculiar about these 7 sites that relates to your outcome of interest, then you should be okay with the MCAR assumption.

Also, here's my attempt to clarify the differences between all these missing data mechanisms (this is oversimplified, so I definitely recommend more reading to get full clarity):

MCAR: Missing data are a random subset of what you would've observed if you had zero missing.

MAR: Missing data are not random (totally counterintuitive given the "MAR" label--this leads to much of the confusion!). There's a systematic relation between the missing observations and factors you measured.

MNAR: Missing data are systematically related to factors you didn't measure (and you might not even know about).

I made an illustration that attempts to show these differences:

The first data set shows you the responses to three hypothetical questions from a survey. The image shows what your data would look like if you didn't have any missing values. To the right, the data shows a pattern of missing values (in orange) in the outcome (Y) that are missing completely at random. Next over, you can see the MAR mechanism; that is, the missing values tend to be related to the Age and Sex variables such that mostly younger males failed to provide an answer to the outcome. Lastly, you can see the MCAR pattern, where the missing values are not related to Age nor Sex (they're spread out across younger/older males/females)--however, let's say we failed to collect data on income and if we had observed those unmeasured data, we would see a relation between the values that are missing and the unobserved variable (notice it's the low income folks that didn't respond).

Craig Enders has an excellent book titled "Applied Missing Data Analysis" so I encourage you to check it out if you want to learn more about these issues.

Best,

~Laura

Laura C-S

View solution in original post

LauraCS · Feb 3, 2022 08:51 AM

Hello @lujc07,

When missing data are present, the SEM platform uses all available data to obtain a maximum likelihood estimate. This method is known in the literature as full information maximum likelihood (FIML), and has been shown to perform very well. You can see a recent comparison between FIML and Multiple Imputation (MI) in this article (https://psycnet.apa.org/record/2021-12018-001), where the authors affirm that their study "confirmed well-known knowledge that the two estimators [FIML and MI] tend to yield essentially equivalent results."

FIML is a cutting-edge method for handling missing data that will give you unbiased estimates but, just like MI, it requires the assumption that your data are missing at random (known as "MAR" in the literature... but if they're missing "completely at random" aka MCAR, that's even better!). The MAR assumption means that your missing data can have systematic differences from your observed data, as long as those differences depend only on other observed variables that are accounted for in your model. In contrast, the MCAR assumption means the data that are missing are simply a random subset of what the full data would have been.

When listwise or pairwise deletion is used, analysts are making the assumption that their data are MCAR. However, if the data are MAR, the resulting estimates will be inferior. Pairwise deletion has other potential issues too.

In sum, FIML and MI are two cutting-edge methods for missing data but FIML is much easier to implement and that's what is used in our SEM platform.

HTH,

~Laura

Laura C-S

lujc07 · Feb 3, 2022 03:13 PM

Hi Laura,

Thank you so much for your answer. I have a question about determining if the data is MAR, MCAR, or MNAR. I tried to understand their definitions but still confused and don't know which category my data belongs to. Could you please give me the answer?

I have 75 samples (75 sampling sites). The variable with 7 missing values is a streamflow metric. This metric is calculated using continuous streamflow data that was collected by a specific device. The reason why there are 7 missing values is that this type of devie failed or washed away at the 7 sampling sites (the fail or washing away was random). So we were not able to collect the data. I am wondering the data is MAR or MCAR but not sure which one is right. Thanks.

Juncheng

LauraCS · Feb 4, 2022 11:45 AM