This is more of tweak of naive Bayes than Bayes' theorem and I suspect he is being a bit tongue in cheek and not letting on whats behind the tweak.
I am sure you have heard that naive Bayes makes gratuitous assumptions of independence. What is not mentioned as often is that it also assumes the document has been generated by a memory-less process.
So if I were to generate a document according to the model assumed by naive Bayes, and I want to decide if I should concatenate another "the" in the document, then I dont need to keep track of how many "the"s that I have already added to the document. As a result the probability of multiple occurrence n_i of a word i goes down exponentially, like this
P(n_i) = p_i^n_i.
Many words do not behave like this. Their probability do not go down monotonically from the start, rather, for words like "the" their probabilities (conditioned on their history) climb initially and then go down.
Naive Bayes works surprisingly well in spite of their grossly violated assumptions. There are many explanations for that. However, we usually forget that to make NB work, we have to throw away "stop-words". Among them are those exact "memory-full" words that violate NB's assumptions.
Lets get back to word frequencies: A model that fits word frequencies quite well is the power law. http://en.wikipedia.org/wiki/Power_law They look like this
P(n_i) \propto n_i^c
where c is a constant for that word id i. For English, c usually lies in the ball park of -1.5 to -2.1
The tweak that Norvig mentions is not an exact match for power law assumptions but it comes very close. Its using a power law assumption but with sub-optimal parameters. In fact using the power law assumption and with their parameters estimated from the data you could get an even better classifier. Though be careful of the industry of seeing power laws everywhere. Log normal distributions can look deceptively similar to a power law and is more appropriate on many such occasions.
Yeah Naive Bayes has a bad assumption of independence, but there is no reason that they have to be memory-less too and that is partly what the tweak fixes, and the theorem isn't really being violated.
What does any of that have to do with modifying P(sentence) relative to P(sentence is a translation of X), though? Everything you're talking about involves getting that initial P(sentence) estimate, but I don't see anywhere in there why uniformly raising all naive estimates to a power would come close to accounting for power law word frequencies (which presumably would be incorporated in the initial model that gave us a P(sentence) estimate).
According to Norvig, the real problem this is intended to address is that P(sentence is a translation of X) tends to be a crappy estimate, and that in fact P(sentence) itself is much more reliable. Which would seem to conflict with your suggestion that this is really applying a correction for P(sentence).
Ah I see your point. Norvig uses a positive exponent on the prior whereas I have written it in the form where there is a negative exponent on the conditional. According to Bayes, we predict class 1 when
P(C_1) P(words | C_1) > P(C_2) P(words | C_2)
The positive exponent on P(C_i) and a negative exponent on P(Words |C_i) are equivalent as far as classification is concerned, one needs to take a suitable (negative power) on both sides so that the exponent on the conditional becomes one and you end up with a positive exponent on P(C_i).
Thanks for catching this.
The constant c can and do vary with the class label. As a result the Bayes classifier will use conditional distributions that have a fixed exponents. Typically however these will depend per word and per class. The model that Norvig is using a very simple variant where he is fixing the exponent. As I said its not the exact expression that one would have obtained by assuming a power law, but is very similar.
As I mention below, this tweak shows up even in models that go well out of their way to avoid independence assumptions. I didn't watch the video, so I'm not sure what Norvig mentions, but I don't think citing independence violations makes the fudge factor vanish.
This has been done for the last 20 years (sort of in secret) in the Automatic Speech Recognition community. In brief, if A is the auditory observation you're trying to transcribe and W a possible word sequence, then you use Bayes law to rewrite
P(W|A) = P(A|W)P(W)
The first is known as the auditory model which is responsible for transcribing sounds into phonemes and potential words. The second is the language model responsible just for ranking likely word sequences.
The first tends to be a lot harder than the second, it's models are fuzzier. So to compensate, the real, in practice model is
P(W|A) = P(W|A)P(W)^a
where a is a positive number. I forget what the typical value of the fudge factor is, but generally you use cross validation to optimize it.
Exactly one class told me about the existence of it. Most others pretend like it's not there.
This is great to hear! I used a similar construction on my final project (electrical engineering) but left the appropriate mathematical proof (if any) into the Future Works section. The classification rule I found to be better was:
c - best class
C - set of classes
P(c) - prior prob. of class c
P(d_i|c) - conditional of the word d_i on the class c and
N_{d,d_i} - frequency of word d_i in document d
Excuse me for the possibly weird notation.
Virtually this expression means it places a frequency power upon the word probability, which is basically assuming (naively, I think) "words that are more likely to occur should be empowered".
The rigorous mathematical justification this trick works better than not using this trick is that speech recognition works better when used. It is for theory to explain reality rather than the other way around.
Specifically, what justifies this is that the parameter is optimized via cross-validation, so within the space that we're optimizing over, we know that we're picking a good parameter. If it's not equal to 1, then it turns out that the "rigorous" probabilistic model was missing something that turned out to be useful.
In this case, it's that the probabilistic model only incorporates naively measured probabilities, and does not account for the error in those measured values. This is rather tricky to do right - you need to know a lot about the true prior distributions as well as the error distributions in order to account for it, and generally we don't have enough information to do this effectively.
So we "cheat", and just use the probabilistic model as a starting point to launch an optimization against.
Stated differently, it is rigorous if you accept that A and W are what we think they are that we want to find w`, that we can as w` = argmax P(A|W)P(W).
The problem is that we can't know P(A|W) or P(W). We must choose a family of distributions (a model) that we believe they live in and then perform statistical estimation to find the best representative of those families, call them f`(A|W) and g`(W). We can induce error here in three ways (1) choosing too small a family (2) having too little data to find the best representative of our family (3) having finite resources and thus giving up before we even get the optimal one we could have found.
(1), (2), and (3) are in contention. For instance, increasing (1) can be done so we can quickly process more data to reduce (2) and (3). It's a battle of tradeoffs.
The fudge factor is induced to compensate for the fact that P(W) is somewhat easier to estimate that P(A|W) (but don't get me wrong, they're both technically difficult). That means that f`(A|W) tends to be blurrier than g`(W) (has lower Fisher information, perhaps). Considering f` and g` to be on equal ground is then pretty silly, so giving more power to the sharper model improves classification accuracy.
---
The point being, it's not actually surprising that the fudge factor exists. It's more of a conflation of ideas and notation that makes it seem confusing.
(P.S. read f` as "f star", the asterisks were being eaten)
There isn't one. It's an engineering solution that just happens to work pretty often in ASR.
Complex P(W)'s are almost certainly fit by MLE. The models used are super complex, so MLE is often the only practical estimate. There are MaxEnt and Bayes methods, but I think generally people trade computational tractability of enormous models which wouldn't gain much from prior information because there's just so much data available.
If you're maximizing over all possible values of W, A never changes, which is why the /P(A) will usually be omitted from the function you're maximizing.
You're right, though, it shouldn't be written as P(W|A) in that case, that's misleading.
I just saw Norvig give this talk. The quote cuts off before he offers his explanation: the 1.5 means 'trust this data set more.' For whatever reason (e.g. bigger sample size) the probabilities from that set tend to be more accurate.
It's not some sort of deep mystical counter example. It's a clever tweak that comes from the observation that empirical observations are not uniformly reliable.
Given accurate probabilities as input, Bayes always gives the best result. That's a theorem. So the fact that this tweak improves the performance probably means that either the english model probabilities are systematically too low, or the translation probabilities are systematically too high. My guess is the latter.
As it's probability^1.5, it makes the probability lower, and not higher, i.e., x^1.5 < x if 0 <= x <=1.
I think this might be a tweak due to data scarcity. To compute p(e), there's tons of data to build a model. On the other hand, p(e|f) requires for you to get a lot of parallel corpus (texts that's written both in English and Foreign language) which is not easily doable.
As a result, p(e)^1.5 * p(e|f), lowers p(e) as it's supposed to be too high compared to p(e|f), imho.
I think it's more helpful to think of it as a a gamma correction with γ > 1, rather than saying higher or lower. Or as I said in another comment lower temperature, in that it concentrates probability around its peaks.
There are a few things that really bother me about this post and many of the comments regarding it.
First, let me address a common theme that I see underlying this discussion and many others: theory vs reality. The common claim seems to be that scientists or other members of the academic community are often disconnected with "the real world" and blindly hold to models that don't quite work. This is both a damaging idea and a false one.
Science and math are two different things and science has always been about the real world. The key difference between science and math (or theory as some like to call it) is empirical data. In the physical sciences the fact that collected data contains errors or biases is a fundamental part of the process. Characterizing and understanding the error or other problems with collected data is critical to doing good science.
It seems that people who study computer science often forget this. When does computer science change from mathematics to a proper science? When you start applying it to real data or to real systems that contain imperfections. Situations precisely like the one being described. In this case there is nothing wrong with Bayes' theory and as many have pointed out the theory is not being tweaking at all. Rather, change in the mathematics is to address a shortcoming of the data. This is simply how science should work. Science is the real world and if you think otherwise you are not understanding or practicing good science.
My second criticism is with the suggestion that we should not care why this change improves results. Now, John Cook is not actually advocating this, but a casual reading of his blog post could certainly be read that way in error. He is saying that it's ok to use an algorithm in practice even though we don't fully understand it. In so far as it's been properly tested, I doubt most people nor most academics would argue with this.
But it's still important to try to understand what is going wrong in the application of the mathematics such that a fudge factor is required. Very often, understanding the root cause can lead to better methodology or a better modelling of the data.
The biggest difficulty in Bayesian modeling is usually choosing the prior i.e. the P(W) term. There are a few maximum entropy techniques, but basically the rule is: "what works best".
In this case P(W) is a 1-gram probability which is clearly wrong, because it amounts to saying that every word has a probability of appearing that is independent of its surroundings. For example, while 'the' is very likely to appear you don't want it to appear next to another 'the'.
Even when using more sophisticated priors that take n-grams into account, one never has a real model of the English language, because that will depend on the context, the period in which it was written, the linguistic domain…
That is why tweaks are actually justified theoretically.
Another interesting thought that popped out in the comments to the original post is that if one uses a statistical mechanics interpretation of probability, where P(W) = exp(-H(W)/T) where H is some energy function and T is temperature, One can interpret the 1.5 factor as lowering the temperature and making the probability less fuzzy.
This happens all the time in machine learning applications. Or many other engineering disciplines I dare say. If theorems and laws never need some tweaks here and there in the real world, what do we need hackers and engineers for?
Often the tweaks are then used to inspire more solid mathematical footing. An interesting example of this is going on with the recent surge of interest in neural networks and deep learning at machine learning conferences. What used to be hacks and heuristics are being given a more rigorous narrative. Of course, as soon as we have a better model for neural networks, someone immediately finds a non-rigorous tweak to improve its performance. And the cycle goes on...
A good example is regularization. You have nice proofs saying that your classifier is optimal, then you tack on a regularization term to it, which breaks your optimality proof but improves your classification accuracy. It seems unexpected, but it's not really all that surprising when you get down to the details of it.
Oops hit the down-arrow without intending to, my bad, hope someone will fix that.
There is nothing tacked on about a regularizer though, it is very sound even in theory. There are several ways to look at it. One way is to see it as a natural consequence of Bayes law, it is just the log of the prior probability. There are certain things we know or assume about the model even before looking at the data, for example we expect the predictions to have a certain smoothness etc, all this knowledge can incorporated into the prior model, and that is what the regulaizer is. Another way to look at it from stability of the estimates of the parameters. I find the former more convincing.
Absolutely, there's a pretty clear mathematical justification for regularization. However, it is very literally tacked on at the end. Take logistic regression, if you minimize the cost function without regularization, you get a max-likelihood estimate of the regression parameters. But what we do is to add a regularization term to that cost function. Minimizing that cost-function will no longer give a MLE solution, but it will (likely) give a better solution. It all comes down to understanding that the MLE property is an asymptotic result. Same goes for covariance matrix estimates, where you have regularization procedures that are guaranteed to never be worse than the plain MLE solution.
As an engineer, you should also be aware of when discovering the basis of the tweak is crucial. Discovering that tweaking the beam bending equations gives a much better fit to your test results on the beams you would like to use for your building is one example.
In some cases, these tweaks provide better results for a small range of conditions. That small range may be big enough for you (given your task at hand), but without understanding the tweak, you can't actually know. So care must be taken.
This kind of thing is awesome in a way. I get the sense that machine learning really feeds on people attacking problems from both ends, the elegant probabilistic side and the practical optimisation hacks both inspire each other.
Some potential downsides for a hack which isn't backed with any theory though, just to demonstrate why it might be worth trying to do some theory after spotting one of these hacks, from a practical not just an aesthetic perspective:
- It may have an impact on convergence properties and numerical stability of any optimisation algorithm you're using to fit the model. Convergence speed, quality of local maxima attained, whether it even converges to a local minimum of your cost function at all, whether there are any guarantees that it doesn't sometimes blow up numerically in a horrible way...
- In general it may be brittle, with the circumstances under which it works well poorly understood. Will it break as your dataset grows? will it work on slightly different kinds of datasets?
- Too many arbitrary parameters to tweak can be expensive unless you have a smart way to optimise them (smarter than grid search + cross-validation)
- Maintainability. It can be frustrating trying to re-use work when people have been less than completely honest in documenting things like "this term/factor/constant was pulled out of my arse and seems to work well on this one dataset, caveat emptor".
One should often consider applying tweaks to a final system. If there's an obvious place to introduce a free parameter, it seems silly not to do so and cross-validate the parameter against application performance.
Things get out of hand if there are many such possible tweaks, or multiple components are combined, each with interacting tweaks. Then some principles behind the tweaks need identifying. Or at least a differentiable cost function to target.
You can't justify this using theoretical Bayesian statistical theory in any easy way that I see. However, one can view the problem as max log(P(W|A)) + lambda * log(P(W)) and here log(P(W)) is a smoothness or regularization term and log(P(W|A)) can be seen as relating to the expected loss of the function (e.g. use Markov's inequality). Regularization has plenty of theoretical justification in machine learning for improving performance on unseen data (that is reasonably similar to the training data).
This is very much a discriminative tactic in disguise.
I am sure you have heard that naive Bayes makes gratuitous assumptions of independence. What is not mentioned as often is that it also assumes the document has been generated by a memory-less process.
So if I were to generate a document according to the model assumed by naive Bayes, and I want to decide if I should concatenate another "the" in the document, then I dont need to keep track of how many "the"s that I have already added to the document. As a result the probability of multiple occurrence n_i of a word i goes down exponentially, like this
Many words do not behave like this. Their probability do not go down monotonically from the start, rather, for words like "the" their probabilities (conditioned on their history) climb initially and then go down.Naive Bayes works surprisingly well in spite of their grossly violated assumptions. There are many explanations for that. However, we usually forget that to make NB work, we have to throw away "stop-words". Among them are those exact "memory-full" words that violate NB's assumptions.
Lets get back to word frequencies: A model that fits word frequencies quite well is the power law. http://en.wikipedia.org/wiki/Power_law They look like this
where c is a constant for that word id i. For English, c usually lies in the ball park of -1.5 to -2.1The tweak that Norvig mentions is not an exact match for power law assumptions but it comes very close. Its using a power law assumption but with sub-optimal parameters. In fact using the power law assumption and with their parameters estimated from the data you could get an even better classifier. Though be careful of the industry of seeing power laws everywhere. Log normal distributions can look deceptively similar to a power law and is more appropriate on many such occasions.
Yeah Naive Bayes has a bad assumption of independence, but there is no reason that they have to be memory-less too and that is partly what the tweak fixes, and the theorem isn't really being violated.