: Pure text pre-training does not adapt well to visual grounding; the AG-ALICE integration requires careful tuning of attention temperature.