As illustrated in section 2.4, The AlignEvaluator is trained on (reprompt, image) pairs. But The GRPO Training Loop in section 2.3, AlignEvaluator calculates a scalar reward $r_i$ for each pair ($p_i, I_i$). Here, the $p_i$ denotes user prompt, not reprompt.
Would you please explain how this part was considered?
As illustrated in section 2.4, The AlignEvaluator is trained on (reprompt, image) pairs. But The GRPO Training Loop in section 2.3, AlignEvaluator calculates a scalar reward$r_i$ for each pair ($p_i, I_i$ ). Here, the $p_i$ denotes user prompt, not reprompt.
Would you please explain how this part was considered?