Laser
New preprint! LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits led by Duy Nguyen and Archiki Prasad with Mohit Bansal on using bandit methods to pick the best-suited RM to optimize at an instance level, improving LLMs on reasoning, instruction-following, and long-context understanding.