You can't justify this using theoretical Bayesian statistical theory in any easy way that I see. However, one can view the problem as max log(P(W|A)) + lambda * log(P(W)) and here log(P(W)) is a smoothness or regularization term and log(P(W|A)) can be seen as relating to the expected loss of the function (e.g. use Markov's inequality). Regularization has plenty of theoretical justification in machine learning for improving performance on unseen data (that is reasonably similar to the training data).
This is very much a discriminative tactic in disguise.
This is very much a discriminative tactic in disguise.