What is the effect of picking action deterministicly at inference with Policy Gradient Methods?

In policy gradient methods such as A3C/PPO, the output from the network is probabilities for each of the actions. At training time, the action to take is sampled from the probability distribution. When evaluating the policy in an environment, what would be the effect of always picking the action that has the highest probability instead…