I am still confused because your example includes more than one response, but only one type of model (binary GLM with Probit link function).
One approach for comparing models is to form a hierarchy of terms and then test sub-models within this hierarchy and then evaluate the overall change in model performance or the significance of individual terms. For example, assessing the linear degradation of a drug substance might postulating a model of potency = intercept + batch + time + batch*time + error. This model is hierarchical. If the batch*time term is deemed not significant, then it can be removed to make a better model and batch-dependent slopes are unnecessary.
Another approach is to use a model selection criterion such as AICc for comparison (smaller is better). It is not a statistic for inference, though. The assumption is that all models use the same Y but the models can vary (linear regression, partition, neural network, et cetera).