Thank you for the answer. I had the same considerations about the dispersion, but had not thought about your point 1. - Great input.
I had also wondered about changing it to a pass/fail test, with the downside that it would increase the required sample size. But with the challenges you mentioned, that may look like the most suitable option.
Unfortunately, there are no other metrics I can use, so basically, I only have pass/fail data.