What does any of that have to do with modifying P(sentence) relative to P(sentence is a translation of X), though? Everything you're talking about involves getting that initial P(sentence) estimate, but I don't see anywhere in there why uniformly raising all naive estimates to a power would come close to accounting for power law word frequencies (which presumably would be incorporated in the initial model that gave us a P(sentence) estimate).
According to Norvig, the real problem this is intended to address is that P(sentence is a translation of X) tends to be a crappy estimate, and that in fact P(sentence) itself is much more reliable. Which would seem to conflict with your suggestion that this is really applying a correction for P(sentence).
Ah I see your point. Norvig uses a positive exponent on the prior whereas I have written it in the form where there is a negative exponent on the conditional. According to Bayes, we predict class 1 when
P(C_1) P(words | C_1) > P(C_2) P(words | C_2)
The positive exponent on P(C_i) and a negative exponent on P(Words |C_i) are equivalent as far as classification is concerned, one needs to take a suitable (negative power) on both sides so that the exponent on the conditional becomes one and you end up with a positive exponent on P(C_i).
Thanks for catching this.
The constant c can and do vary with the class label. As a result the Bayes classifier will use conditional distributions that have a fixed exponents. Typically however these will depend per word and per class. The model that Norvig is using a very simple variant where he is fixing the exponent. As I said its not the exact expression that one would have obtained by assuming a power law, but is very similar.
According to Norvig, the real problem this is intended to address is that P(sentence is a translation of X) tends to be a crappy estimate, and that in fact P(sentence) itself is much more reliable. Which would seem to conflict with your suggestion that this is really applying a correction for P(sentence).