There is a document describing JMP's implementation of the robust option for the neural network platform JMP ANN that notes it maximizes the liklihood of a Laplacian distribution, which [of course] is the same as minimizing the absolute deviations. However, there is no analytic gradient in this case (the function is not differentiable), so how does JMP actually run the optimization in this case? Do they use a smooth approximation to the absolute value function and then apply a quasi-newton method? Or do they calculate a numeric gradient and still apply a quasi-newton method, but just be especially careful around zero?
... View more