cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar

fitting truncated data to a continuous distribution

Suppose I have a sample of data that I believe are sampled from a Cauchy distribution. Let's suppose it's symmetrical about 0. 

However, let's suppose my observations are truncated to the range (0,2). So, I am seeing only a relatively limited view of data. I generated some Cauchy (0,2) random variates and truncated the data table to show only those between 0 and 2: 

GregChesterton_0-1720198848792.png

Is there any way to fit this data to a Cauchy (0, gamma) to get a best estimate of gamma? More generally, does JMP support fitting truncated data?

7 REPLIES 7
Mauro_Gerber
Level IV

Re: fitting truncated data to a continuous distribution

Not sure if this is what you mean but I think you talk about censored data. JMP must know how many datapoints are cut in order to make a better estimation. By adding the column property "Detection Limits" you can tell JMP that this data was cut:

 

image.png

 

When you then fit your distribution, it can estimate the values better based on this information:

 

Mauro_Gerber_0-1720437202200.png

The problem at the moment is, that if you pre-select a distribution, this does not seems to work. You must fit a new distribution within the platform.

"I thought about our dilemma, and I came up with a solution that I honestly think works out best for one of both of us"
- GLaDOS

Re: fitting truncated data to a continuous distribution

Thanks for the response, but my data are not censored. In your example, you have a known number of observations that are greater than or equal to some value (in your case, 1900). That's not my situation.

In my example, I do not know how many exceeded 2. I only have observations lying between 0 and 2, with no knowledge of how many observations were outside that range (because my data collection mechanism can not capture them or even know of them). So I do not have a bunch of censored observations. 

If you consider my data-generating mechanism, this should be more clear. Suppose I have an explosive device sending fragments in every direction with angular uniformity. Impact location data were collected from a flat panel in the x-y plane, where x is lateral and y is vertical. I have no idea how many fragments exceeded the limits of my data collection panel. But it's not unreasonable to think that the distribution of x is Cauchy. I am trying to estimated the parameters of this Cauchy from my limited range of observations. With this estimate, I could characterize the distribution of fragments over some larger surface.  

 

MRB3855
Super User

Re: fitting truncated data to a continuous distribution

Hi @GregChesterton : This may not be very satisfying, but it may be worth exploring; assuming the distribution is Cauchy as you describe, you know the functional form of the pdf = f(x| 0, gamma).  So, you could use the nonlinear platform to estimate gamma?

Mauro_Gerber
Level IV

Re: fitting truncated data to a continuous distribution

Next try, this adds censored datapoints from 1 to 200 and sums up the deviation from the normal distribution to the ECDF.

The minimum value will be displayed at the end with the minimum setting.

simulated data: Norm(0,10)

missing: 163 datapoints above 10

 

script estimation:

Mean: 0.015

Std:     9.166

Missing: 133

 

Now you have to fit it to you needs (your own distribution).

 

simply run the table script "search n_mising" in the Cut_Normal.jmp file or the JSL itself.

I added the original simulation as Cut_Normal_original.jmp.

 

This is the idea behind it:

I first fit a normal distribution with on additional censored data point and sum up all differences with the ECDF:

Mauro_Gerber_0-1720583344375.png

 

Then I iterate through all additional censored datapoints, fit the normal distribution again until I get to the last one:
this ist the sum of absolute difference over the additional points:

Mauro_Gerber_1-1720583543385.png

The lowest point now gives me the estimation of how many points are missing:

This then gives me a "best" fit for the distribution:

Mauro_Gerber_2-1720583714749.png

 

I hope this helps.

 

 

"I thought about our dilemma, and I came up with a solution that I honestly think works out best for one of both of us"
- GLaDOS
MRB3855
Super User

Re: fitting truncated data to a continuous distribution

Hi @Mauro_Gerber :  Your proposed method is very interesting; can you give more detail about it please? 

 

And when I run your script, I get the following error. The data set for the first plot is generated, but the second data set has 200 rows and is full of missing values.  

MRB3855_0-1720689273194.png

 

Mauro_Gerber
Level IV

Re: fitting truncated data to a continuous distribution

Hi @MRB3855 

The script within the Cut_Normal should work but JMP may have some problems with language or pe-sets.

I work with the English JMP 18.0.1 version and I may have some other Preferences (like I have my plots always stacked).

I did not search for the Display Box with a fix value since the censor limit is in the title. Maybe you can search for the title when you know how JMP will pars it in the title:

Mauro_Gerber_1-1720694265984.png

So those two numbers can be different in your JMP:

Mauro_Gerber_0-1720694028221.png

 

When I teach my internal statistic courses, I try to emphasize the importance that the distribution curve (PDF) should fit the data aka. histogram.

 

Mauro_Gerber_2-1720694975364.png

 

You can say the same thing with the ECDF and the CDF function of the distribution. and how much they deviate from each other.

Mauro_Gerber_3-1720695089647.png

 @GregChesterton has only part of the curve so I fit different number of missing values and compare how well the estimation fits with the data that are there.

this is the best fit (least sum deviation) from the normal data set:

Mauro_Gerber_8-1720695644468.png --> Mauro_Gerber_7-1720695597182.png

 

In JMP you can have also a look at the PP Plot under the fit witch is the direct comparison.

Mauro_Gerber_5-1720695433867.png

I iterate through possible missing values and write down the sum difference between ECDF and the fitted CDF of the probability function.

Its "brute force" and not very elegant but gives a OK result.

An other possible option is to make a simple numerical optimizer with the ECDF tail and the function with the sum difference as its target and then minimize it.

 

 

 

 

"I thought about our dilemma, and I came up with a solution that I honestly think works out best for one of both of us"
- GLaDOS
peng_liu
Staff

Re: fitting truncated data to a continuous distribution

If you are going to use the Nonlinear platform to fit the data using the maximum likelihood method, you need following pieces.

First, the density function of your truncated Cauchy(0, gamma).

Refer to the PDF formula on this page https://en.wikipedia.org/wiki/Truncated_distribution

peng_liu_0-1720702459059.png

In your case, b = 2, and a = 0.

On the numerator, it is the Cauch(0, gamma) density function, in your case. Find it in the JMP scripting index:

peng_liu_1-1720702553078.png

On the denominator, F function is the Cauchy(0, gamma) distribution, in your case. Find it in the index:

peng_liu_2-1720702629479.png

Now you need to piece the information to create a negative log-likelihood loss function formula column.

The formula is something like the following. (Assume your data column is called "Y".)

peng_liu_0-1720703900782.png

It is a JSL function "Parameter". The first argument is a list of parameters that we need to fit. In this case, there is only one here: gamma. And the value 1 is an initial value. It should be as close to the final estimate as possible. So, use your best judgement. Or you can trial-and-error it out. Don't worry if it is not close initially. Nonlinear platform allows you to try different values after launching.

The second argument of "Parameter" function call is that negative loglikelihood function. Take a look at it and see how it comes from the density function of your truncation Cauchy.

Name the new column "loss". And configure Nonlinear launch dialog. Click "OK".

peng_liu_1-1720704334845.png

After launching, you can click "Go", or change the value of "gamma" then click "Go". If failed, you can change "gamma" and click "Go" again.

peng_liu_2-1720704822530.png

 

Here are the links to a few relevant documentation:

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/create-formulas-in-jmp.shtml#ww96570

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/nonlinear-regression.shtml#

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/create-formula-columns-for-multiple-co...

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/create-a-formula-column.shtml