Ah ok. If you do run it again that would be a worthwhile change. I know I personally have biases about models and I have seen others commenting the same - it seems likely it would skew the results at least a little.
Nonetheless you've convinced me to try an even wider variety of models, thanks!
In fact, this makes me think I should add this as a feature to my AI dev tooling - compare responses side by side and pick the best one.
But that is a good point. Perhaps it should be mapped to something unidentifiable.