An agent’s predicted best course of action through time to get from the current state to a reward state.

The policy describes a future trajectory of state-action pairs which lead to a state in which the agent’s perceived reward is high.

A policy is an approximation, since any individual agent has an incomplete description of their state, as well as an incomplete memory, as well as incomplete knowledge of the environment to predict the future.

It’s a best guess at predicting which future state-actions will lead to reward.