Rewards, which make up for much of the RL systems, are tricky to design. A smarter reward system ensures an outcome with better accuracy.
In the context of reinforcement learning, a reward is a bridge that connects the motivations of the model with that of the objective. Reward design decides the robustness of an RL system. Designing a reward function doesn’t come with much restrictions and developers are free to formulate their own functions. The challenge, however, is the chance of getting stuck in local minima.
Reward functions are peppered with clues to make the system/model/machine to move in a certain direction. The clues in this context are a bunch of mathematical expressions that are written with efficient convergence in mind.
Automating Reward Design
Machine learning practitioners, especially those who deal with reinforcement learning algorithms, encounter a common challenge of making the agent realise that certain task is more lucrative than the other. To do this, they use reward shaping.
During the course of learning, the reward is edited based on the feedback that is generated on completion of tasks. This information is used to retrain the RL policy. This process is repeated until the agent performs desirable actions.
The challenges to retrain policies and observing for long durations makes one question if reward design can be automated and if there can be a proxy reward that while promoting the learning, also meets the task objective.
In an attempt to automate the reward design, the Robotics department at Google, introduced AutoRL, a method that automates RL reward design by using evolutionary optimisation over a given objective.
To measure the effectiveness, the team at Google, applied AutoRL’s evolutionary reward search to four continuous control benchmarks from OpenAI Gym, including:
- Ant
- Walker2D
- HumanoidStandup
- Humanoid
These were applied over two RL algorithms: off-policy Soft Actor-Critic and on-policy Proximal Policy Optimisation.
To assess AutoRL’s ability to reduce reward engineering while maintaining the quality of existing metrics, the team considered task objectives and standard returns.
Task objectives measure task achievement for continuous control: distance traveled for Ant, Walker, and Humanoid, and height achieved for Stand Up. Whereas, standard returns are the metrics by which tasks are normally evaluated.
Key Findings
The authors, in their paper, list the following findings:
- Evolving rewards trains better policies than hand-tuned baselines, and on complex problems outperforms hyperparameter-tuned baselines, showing a 489% gain over hyperparameter tuning on a single-task objective for SAC on the Humanoid task.
- Second, the optimisation over simpler single-task objectives produces comparable results to the carefully hand-tuned standard returns, reducing the need for manual tuning of multi-objective tasks.
- Lastly, under the same training budget, reward tuning produces higher-quality policies faster than tuning the learning hyperparameters. […]
No comments:
Post a Comment