cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
bjoern_arnold
Level III

Distribution fitting given cumulative (quantile) data?

Hi,

 

I have data in the form of quantiles (percentiles actually) and would like to estimate distribution parameters and do a Q-Q-plot from that. I once used the Life Distribution platform for fitting but that only worked for probability data by using the probability as a frequency variable. 

 

Maybe that's a trivial task in JMP but I don't see how I could do it. 

 

Björn

1 ACCEPTED SOLUTION

Accepted Solutions
peng_liu
Staff

Re: Distribution fitting given cumulative (quantile) data?

What you described in the second paragraph concerns me. If your data is time-to-event, and you don't know what is the original data, but someone provide a summary in "quantile" form, you should not trust the "quantile" form if you suspect there is censoring in the original data with the slightest hunch. My previous suggestion and the following assume no censoring.

 

With your "quantile" form data, it does not matter whether original data has an N-column or not, which records the number of observations with the same value. The one gives you the "quantile" form should have taken care of that column properly. Otherwise, the "quantile" form data is rubbish to you.

What you described in your third paragraph is an appropriate approach. Think about the Kolmogorov test. That is the similar idea.

 

But as you imagined, you need to setup something in JMP. The answer is to use Nonlinear platform. Assume you data has a column of observation values (quantiles), and a column of probabilities (associated with quantile). To use the Nonlinear platform for the purpose here, you need to write a formula column, which takes a quantile value, and return calculated probability. Then you can use the new column together with the original probability column to fit a nonlinear model. I attach a data table which illustrate it.

 

 

 

View solution in original post

4 REPLIES 4
peng_liu
Staff

Re: Distribution fitting given cumulative (quantile) data?

Let me imagine what your data look like. You have 1% at x01, 2% at x02, 50% at x50, 99% at x99.

Now you simply put x01, x02, ... x99 in a column, and use Life Distribution to fit a distribution with that column, nothing else.

What is the result then? You can do that with Distribution, too.

Now what if you don't have evenly spaced percentiles. Say, there is no x02, x03, x04. There is x01, going straight to x05. How would you treat x05, is it the same as x01 and x06? It shouldn't be.

But the above strategy is still applicable. The final result will depend on how you treat x05, and there is no unique way. One way is to create a new x05 = (x01 + x06)/2, so it is in the middle between x01 and x06. Meanwhile, you need a new frequency column, which is filled with 0.01, except the entry corresponding to x05, which should be 0.04. And now use the x values, together with the frequency column, to fit a distribution using either Life Distribution or Distribution.

bjoern_arnold
Level III

Re: Distribution fitting given cumulative (quantile) data?

Your assessment is right how my data looks. An the quantiles are not evenly spaced.

 

Regarding Life Distribution, that's what I am/was uncertain about: if I did a failure analysis, I might record for each item how long it took until it broke and so I have a column of N items where each row contains the life time of the item and I expected this sort of data as input to the Life Distribution platform.

 

Regarding your point of unevenly spaced quantiles, in my opinion, one should be able to derive the distribution mathematically. For my N percentile (PC) values, I have N equations of the form P(x <= xPC | dist. params) = PC. Now by whatever method on could run an optimisation finding the best fit dist. params. So I am wondering if there's a JMP platform that does it directly, without reconstructing the evenly spaced quantiles?

 

Do you think you could setup a sample JMP table showing this?

 

peng_liu
Staff

Re: Distribution fitting given cumulative (quantile) data?

What you described in the second paragraph concerns me. If your data is time-to-event, and you don't know what is the original data, but someone provide a summary in "quantile" form, you should not trust the "quantile" form if you suspect there is censoring in the original data with the slightest hunch. My previous suggestion and the following assume no censoring.

 

With your "quantile" form data, it does not matter whether original data has an N-column or not, which records the number of observations with the same value. The one gives you the "quantile" form should have taken care of that column properly. Otherwise, the "quantile" form data is rubbish to you.

What you described in your third paragraph is an appropriate approach. Think about the Kolmogorov test. That is the similar idea.

 

But as you imagined, you need to setup something in JMP. The answer is to use Nonlinear platform. Assume you data has a column of observation values (quantiles), and a column of probabilities (associated with quantile). To use the Nonlinear platform for the purpose here, you need to write a formula column, which takes a quantile value, and return calculated probability. Then you can use the new column together with the original probability column to fit a nonlinear model. I attach a data table which illustrate it.

 

 

 

bjoern_arnold
Level III

Re: Distribution fitting given cumulative (quantile) data?

Thank you for the sample file! Regarding the fitting of the distribution, it leads to the result that I intend to have. Although, I was hoping there's a platform that could (a) understand this form of data immediately and fit a distribution and (b) subseuqently provide an option of a Q-Q-plot, where I could select what the estimate distribution is to have a visual representation of the data.

I suppose I need to write a script for that to have a convenient solution. Don't know if it makes sense to place it as a proposal for a new feature in JMP?

 

Regarding data quality, I'm ok. The comparison to life distribution data was only done by me to stay in the lingua of the platform. The original data is a discrete probability distribution, computed from a physically continous PDF. As the probability distribution cannot be stored in full, it needs to be aggregated and what I get is those quantiles. Of course, there's loss of information but for the application, that's sufficient.

 

Thanks for your help!