Please help me understand: You train a transformer, which gets as input image patches like in ViT and predicts the pose (i.e., the orientation) of the object. The image itself has multiple possible solutions to this problem, as some poses are ambiguous. By sampling 10000 x you get the probability distribution over all possible solutions.

Hope this is about right. Very cool!

1) I don't quite get whats the output of the transformer. Instead of directly predicting the orientation, you input parts of the rotation (x and y of q), and classify the Bins? There are over 50000 bins, so does it mean this is a multi class classification with 50k classes?

2) How do you get the gt labels? As I understand, you have positive samples (image - orientation pair) and a negative sample which has a random orientation assigned? Or is there something about the math I'm overseeing?

3) Why do we need a start token?