These criteria are based on information theory. The -2L is a measure of model bias, so the smaller this value, the less bias in model. We generally want the model with minimal bias. Without regard to model variance, you just increase the complexity of the model (e.g., add more terms to the linear model) until you have a perfect fit. That result is fine if you only want to model the current data set. Such a model does not 'generalize' to represent new data, though. So the -2L provides information about the model but is generally regarded as poor criterion for model selection.
AIC and BIC are both based on -2L plus a penalty for the variance. The minimum AIC or BIC trades off bias and variance optimally. The difference between the two criteria is in the definition of the penalty. The penalty in both cases is a function of the model complexity and the sample size. AIC is generally favored over BIC but some types of models seem to be selected better with BIC depending on the interpretation of complexity.
The difference between the two models with the best AIC can be helpful when assessing the candidate models. An AIC difference less than 4 indicates that the second best model has significant support from the data, a difference of 4 to 10 indicates considerably less support, and a difference greater than 10 indicates essentially no support.
NOTE: these criteria are dependent on the training data.
It is best when the choice of the model also includes available knowledge about the observed system. How do the data occur? Can that information guide the choice of the model? Model selection should not be just about fitting data.