For 2, consider an environment with two rooms where the agent needs to press different buttons to get the optimal reward. If there's a cheap way to determine which room the agent is in, that would be the optimal policy :)
For 2, consider an environment with two rooms where the agent needs to press different buttons to get the optimal reward. If there's a cheap way to determine which room the agent is in, that would be the optimal policy :)
Hope that helps!
Hope that helps!