cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
  • Register to see how to import and prepare Excel data on Jan. 30 from 2 to 3 p.m. ET.

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
Novice_Hector
Level II

Analysis of Data Using Mean and SD

I wanted to know if there is a way to run models and perform data analysis when the data is only available as means and standard deviations. While modeling generally uses the mean and standard deviation as metrics to analyze and model data, JMP calculates them and then uses them for your chosen method. Unfortunately, my partners only have the data available as the means and SD per formulation. Thank you very much, and have a great day. I appreciate your time and expertise. 

Hector
8 REPLIES 8
statman
Super User

Re: Analysis of Data Using Mean and SD

I'm sorry if I don't understand your question.  I think it is appropriate to look at the raw data before summarizing and describing the raw data with enumerative statistics (e.g., mean and standard deviation).  Are those the appropriate statistics to use?  Once you have the summary statistics, you can certainly model them.  I would first look to see if they correlate, then run separate fit models to asses model effects.

"All models are wrong, some are useful" G.E.P. Box
Novice_Hector
Level II

Re: Analysis of Data Using Mean and SD

Unfortunately, I only received my data as the summary statistics rather than the individual raw data. I may be able to get them if I keep asking my partner, but they may not have stored them. Thank you for the suggestion.

Hector
Potcner
Staff

Re: Analysis of Data Using Mean and SD

Yes, you can do the analysis on the means and std devs very similarly if you had the raw data. In your example, however, 'Formulation' isn't a possible factor to use as that's just an ID for each set of data values. You would just want to use 'Percent'. That's a factor that has 3 levels with 2 values at 0, 3 at 12.71, and 4 at 30.

If you have 'Percent' as a categorical variable, than the Fit Y by X platform would result in doing ANOVAs. Note: Some people like to use Log(SD) as the response instead of just the SD, as that transforms the values so they're more normally distributed. Not necessarily though unless you're trying to some very precise inference. You're probably fine as is.
If you have 'Percent' as a continuous variable, than the Fit Y by X platform will fit a linear regression.

Just make sure when interpreting results you do so knowing the data analyzed are the means and std deviations and not individual data values for whatever attribute was measured. A nice graph to do with this kind of data is to plot the means on one axis, the std dev on the other, and then use a symbol or color to show the 3 % levels. This allows you to simultaneously view the central tendency and variation for each of the 3 % levels.

I attached .jmp file with the analyses and graphs.

dlehman1
Level VI

Re: Analysis of Data Using Mean and SD

Your response confuses me a bit.  I'll admit I hadn't thought of doing the analysis you suggest and it might be what Hector is looking for.  But my initial reaction was that more information would be required - at least something about the sample sizes that lie behind these summary statistics.  My inclination would have been to say that some simulated data would be necessary.  If you make an assumptions about the distribution of the individual data points behind these means and standard deviations, then you can create a simulated data set (perhaps several, based on different distributional assumptions) to analyze rather than just focusing on these available summary statistics.

Now, based on your suggestion, I see that such an analysis might be unnecessary.  However, it seems to me that your analysis might be making some implicit assumptions about the data distributions behind these summary statistics.  Can you comment on this?

Potcner
Staff

Re: Analysis of Data Using Mean and SD

Hi Dale.

Yes, it's always nice to have as much info as you can about how the data was generated, and certainly wouldn't pass on looking at that if available. But it's not necessary to know those details in order to do a formal analysis that can estimate how the factor(s) being studied affects both the central tendency and variation in the attribute being measured, as in Hector's example


For many analyses, we often take an individual data value to be just that. However, in most engineering applications, that single data value is typically some kind of summary statistics of a measurement system that may be taking many measurements (possibly not even available to see) in order to produce that one value. Example: think of some kind of optical measurement system scanning a substrate making hundreds of light reflectance measurements. The system may spit out a few number (the min and max reflectance, or different percentiles, the mean or the median, or some kind of measure of variation, etc.). And we go forward and run analyses on any of those, treating one set of summary statistics as one experimental unit (n=1). We just have to keep in mind what characteristic that the values we're using for our analyses is measuring. If we measure a second piece of substrate that received a treatment (i.e., another experimental unit), then we would treat it as n=2.
If that measurement system instead made a million light reflectance measurements on each of those substrates in order to come up with that set of summary statistics. Our analysis would still treat it as n=2.

Again... still get your point that having those behind the scene details would be nice, Hector can still do one-way ANOVA or regression knowing his analyses are tracking two components (central tendency and variation) in whatever it is is being measured.

Novice_Hector
Level II

Re: Analysis of Data Using Mean and SD

Potcner!

Thank you very much for your detailed response. Sorry about the confusion with the factors shown in the screenshot, but I left out as much detail as possible from this randomized example to avoid giving away the components of my formulation. Yes, indeed, every formulation comprises various numerical and categorical factors not shown above. The "percent" factor is the percentage of filler used in the formulation and scales with my response.

My question is as follows: when running a model such as standard least squares or generalized linear regression, wouldn't the degrees of freedom, R-squared values, analysis of variance, and lack-of-fit analysis be artificially low and useless?

Hector

Re: Analysis of Data Using Mean and SD

Hello,

Looking at your data set, there are a lot of questions you first need to answer, before you will know if an analysis is possible. They clearly did 9 formulations of something for a certain purpose. In my experience these formulations will consist of some base ingredients, most likely consistent over such a study. So if you could get your partners to give you that information behind the formulation column, then you might be able to extract some model approximating the impact of the formulation ingredients. At the moment if the nine formulations are seen as categorical, there is no way of interpreting the data in a sensible way.

Also what is the purpose of the "percent" column? Percent of what? It has three different values and if you have the formulations as categorical variables, you can at best compare within those three groups.

So a deeper analysis makes only sense if you can see behind the different formulations and add some information there. Also you do need to know the number of observations behind your mean and standard deviations, because they may not be really comparable! Remember in both formulas number of observations is in the denominator... And while you are at it, be aware that to get proper insights a good design of an experiment will allow proper answers, otherwise it is just correlation and there might be no connection, or a completely different one behind the observations.

Novice_Hector
Level II

Re: Analysis of Data Using Mean and SD

Winfried!

Thank you for the comment. I appreciate your experience and questions. In the example above, I tried to hide my formulation by referring to them by their assigned generic value to avoid revealing what I was working on. As for the percent, it is just the percent of the filler I am studying in my experimentation. 

The experiment is part of a DOE I conducted, and the 9 formulations above are part of a larger scheme studying multiple factors. I randomized the information and extracted 9 values to represent my problem. Sorry for the confusion. 

Hector

Recommended Articles