Aug 7, 2020 2:31 AM
| Last Modified: Aug 7, 2020 8:47 AM(408 views)
I'm trying to find a correlation between some X predictors and a continuous response variable strongly skewed on the right.
The database contains 140 instances only.
First of all I transformed the response variable by using the function lnx+1.
Then I normalized all the x in the range 0 and 1, reduced the dimensionality with the screening option and then removed the correlated predictors X. At the end of all this process I had 4 predictors only. By running a Random forest without any weigth and the R2 on training data was very low (see first plot below). Then I used a like parabolic weight function (see second plot) e re-ran the Random forest. The result was better as showed in the last plot. Looks like there was a rotation of data around the center of plot. Some one can explain why?
I think it would be helpful for you to provide more details on the features in your data, and possibly a sample of some of the data. In general, I think it is strange to weight an analysis unless the weights are clearly appropriate, such as with surveys where the sampling is specifically designed for weights to represent the population being sampled. The other place that weights would be appropriate is where you are running a regression analysis and there is heteroscedasticity present. But I think that using weights in a random forest is a strange thing to do. For that matter, I'm not sure why you need to reduce the number of features is using a random forest - validation should prevent overfitting and the technique should focus on the most relevant features. Without more context on why your are using weights, I am skeptical of applying a weighting function because it provides a better fitting model.
I don't think this is legitimate. It sounds like you feel that the extreme values may have been under-represented in your data. If that is the case, then I think the correct thing to do is to sample differently - oversample likely extreme values. Then weighting might be justified. But to weight extreme values after you have collected your data sounds like inventing data that might tell you what you are hoping for. I haven't heard a good reason why those values should be weighted more heavily.
There are other options as well. Quantile regression is a technique aimed at estimating particular quantiles, such as extreme ones, rather than the average response. But that is different than weighting those extreme values more heavily.