fitting truncated data to a continuous distribution

GregChesterton · Jul 5, 2024 01:10 PM

Suppose I have a sample of data that I believe are sampled from a Cauchy distribution. Let's suppose it's symmetrical about 0.

However, let's suppose my observations are truncated to the range (0,2). So, I am seeing only a relatively limited view of data. I generated some Cauchy (0,2) random variates and truncated the data table to show only those between 0 and 2:

Is there any way to fit this data to a Cauchy (0, gamma) to get a best estimate of gamma? More generally, does JMP support fitting truncated data?

Mauro_Gerber · Jul 8, 2024 07:16 AM

Not sure if this is what you mean but I think you talk about censored data. JMP must know how many datapoints are cut in order to make a better estimation. By adding the column property "Detection Limits" you can tell JMP that this data was cut:

When you then fit your distribution, it can estimate the values better based on this information:

The problem at the moment is, that if you pre-select a distribution, this does not seems to work. You must fit a new distribution within the platform.

"I thought about our dilemma, and I came up with a solution that I honestly think works out best for one of both of us"
- GLaDOS

GregChesterton · Jul 8, 2024 09:12 AM

Thanks for the response, but my data are not censored. In your example, you have a known number of observations that are greater than or equal to some value (in your case, 1900). That's not my situation.

In my example, I do not know how many exceeded 2. I only have observations lying between 0 and 2, with no knowledge of how many observations were outside that range (because my data collection mechanism can not capture them or even know of them). So I do not have a bunch of censored observations.

If you consider my data-generating mechanism, this should be more clear. Suppose I have an explosive device sending fragments in every direction with angular uniformity. Impact location data were collected from a flat panel in the x-y plane, where x is lateral and y is vertical. I have no idea how many fragments exceeded the limits of my data collection panel. But it's not unreasonable to think that the distribution of x is Cauchy. I am trying to estimated the parameters of this Cauchy from my limited range of observations. With this estimate, I could characterize the distribution of fragments over some larger surface.

MRB3855 · Jul 9, 2024 06:56 AM

Hi @GregChesterton : This may not be very satisfying, but it may be worth exploring; assuming the distribution is Cauchy as you describe, you know the functional form of the pdf = f(x| 0, gamma). So, you could use the nonlinear platform to estimate gamma?

Mauro_Gerber · Jul 9, 2024 9:04 PM

Next try, this adds censored datapoints from 1 to 200 and sums up the deviation from the normal distribution to the ECDF.

The minimum value will be displayed at the end with the minimum setting.

simulated data: Norm(0,10)

missing: 163 datapoints above 10

script estimation:

Mean: 0.015

Std: 9.166

Missing: 133

Now you have to fit it to you needs (your own distribution).

simply run the table script "search n_mising" in the Cut_Normal.jmp file or the JSL itself.

I added the original simulation as Cut_Normal_original.jmp.

This is the idea behind it:

I first fit a normal distribution with on additional censored data point and sum up all differences with the ECDF:

Then I iterate through all additional censored datapoints, fit the normal distribution again until I get to the last one:
this ist the sum of absolute difference over the additional points:

The lowest point now gives me the estimation of how many points are missing:

This then gives me a "best" fit for the distribution:

I hope this helps.

"I thought about our dilemma, and I came up with a solution that I honestly think works out best for one of both of us"
- GLaDOS

MRB3855 · Jul 11, 2024 05:19 AM

Hi @Mauro_Gerber : Your proposed method is very interesting; can you give more detail about it please?

And when I run your script, I get the following error. The data set for the first plot is generated, but the second data set has 200 rows and is full of missing values.

Mauro_Gerber · Jul 11, 2024 07:12 AM

Hi @MRB3855

The script within the Cut_Normal should work but JMP may have some problems with language or pe-sets.

I work with the English JMP 18.0.1 version and I may have some other Preferences (like I have my plots always stacked).

I did not search for the Display Box with a fix value since the censor limit is in the title. Maybe you can search for the title when you know how JMP will pars it in the title:

So those two numbers can be different in your JMP:

When I teach my internal statistic courses, I try to emphasize the importance that the distribution curve (PDF) should fit the data aka. histogram.

You can say the same thing with the ECDF and the CDF function of the distribution. and how much they deviate from each other.

@GregChesterton has only part of the curve so I fit different number of missing values and compare how well the estimation fits with the data that are there.

this is the best fit (least sum deviation) from the normal data set:

-->

In JMP you can have also a look at the PP Plot under the fit witch is the direct comparison.

I iterate through possible missing values and write down the sum difference between ECDF and the fitted CDF of the probability function.

Its "brute force" and not very elegant but gives a OK result.

An other possible option is to make a simple numerical optimizer with the ECDF tail and the function with the sum difference as its target and then minimize it.

"I thought about our dilemma, and I came up with a solution that I honestly think works out best for one of both of us"
- GLaDOS

peng_liu · Jul 11, 2024 09:34 AM

If you are going to use the Nonlinear platform to fit the data using the maximum likelihood method, you need following pieces.

First, the density function of your truncated Cauchy(0, gamma).

Refer to the PDF formula on this page https://en.wikipedia.org/wiki/Truncated_distribution

In your case, b = 2, and a = 0.

On the numerator, it is the Cauch(0, gamma) density function, in your case. Find it in the JMP scripting index:

On the denominator, F function is the Cauchy(0, gamma) distribution, in your case. Find it in the index:

Now you need to piece the information to create a negative log-likelihood loss function formula column.

The formula is something like the following. (Assume your data column is called "Y".)

It is a JSL function "Parameter". The first argument is a list of parameters that we need to fit. In this case, there is only one here: gamma. And the value 1 is an initial value. It should be as close to the final estimate as possible. So, use your best judgement. Or you can trial-and-error it out. Don't worry if it is not close initially. Nonlinear platform allows you to try different values after launching.

The second argument of "Parameter" function call is that negative loglikelihood function. Take a look at it and see how it comes from the density function of your truncation Cauchy.

Name the new column "loss". And configure Nonlinear launch dialog. Click "OK".

After launching, you can click "Go", or change the value of "gamma" then click "Go". If failed, you can change "gamma" and click "Go" again.

Here are the links to a few relevant documentation:

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/create-formulas-in-jmp.shtml#ww96570

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/nonlinear-regression.shtml#

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/create-formula-columns-for-multiple-co...

https://www.jmp.com/support/help/en/18.0/index.shtml#page/jmp/create-a-formula-column.shtml