Well, some weeks ago, I spent some time looking at Bayseian Probabilities. I learned it way back in high school, though I never gave it much thought. Then I stumbled on it again in college in my ECE courses, and nearly failed the class. And then I took it again in grad school, and though I did well enough in the class, I still felt weak in probability.
This time, when implementing a bayesian classfifier, I learned in and outs of a seemingly simple naive bayesian classifier, and I learned how to spell 'bayesian'. That 'e' after the 'y' gets me every time.
Anyway, I was looking at Paul Graham's famous a plan for spam, and I couldn't figure out where he got that last formula. Kinda mad at myself for not seeing it earlier, cuz, it really is pretty simple. Anyway, it took me a while, but I worked it out. Turns out I learned more when I implemented it...or rather, I would have learned it had I concentrated the first two times.
We know Bayes Theorem is as follows:
(1) P(a,b) = P(a|b) * P(b) = P(b|a) * P(a)
With some algebra of the above we can also derive
(2) P(a|b) = P(b|a) * P(a) / P(b)
But also note that if we take (1), and put a given 'd' behind it, it'd hold true if the probabily on the other side also had a given 'd' behind it. If you draw out the Venn Diagrams, you'll see this is true.
(3) P(a,b|d) = P(a|b,d) * P(b|d) = P(b|a,d) * P(a|d)
We also have the Total Probability Rule, which says that the total probability is made up of its parts. If you apply bayes rule, in (1), you'll see that it's true.
(4) P(b) = P(b|a) * P(a) + P(b|a') * P(a')
So this means that Baye's rule in (2) can be rewritten with (4) as:
(5) P(a|b) = P(b|a) * P(a) / (P(b|a) * P(a) + P(b|a') * P(a'))
We also need the Probability Chain Rule. It says that the joint probability of a, b, and c can be rewritten as the following due to equation (1), applied over and over again:
(6) P(a,b,c) = P(a,b|c) * P(c) = P(a|b) * P(b|c) * P(c)
And lastly, the Independence Rule, which makes the bayesian classifier naive:
(7) P(a,b) = P(a|b) * P(b) => P(a) * P(b) iff "a" indp. from "b"
Now, we can solve for what's the probability of spam given these joint probability of words, where each word is considered an orthogonal and independent dimension?
P(s|f0, f1) = P(f0, f1|s) * P(s) /
(1) P(f0, f1)
= P(f0, f1|s) * P(s) /
(4) (P(f0, f1|s) * P(s) + P(f0, f1|s') * P(s'))
= P(f0|f1,s) * P(f1|s) * P(s) /
(6) (P(f0|f1,s) * P(f1|s) * P(s) + P(f0|f1,s') * P(f1|s') * P(s')
= P(f0|s) * P(f1|s) * P(s) /
(6) (P(f0|s) * P(f1|s) * P(s) + P(f0|s') * P(f1|s') * P(s')
~= P(f0|s)*..*P(fn|s) * P(s) /
(7) (P(f0|s)*..*P(fn|s) * P(s) + P(f0|s')*..*P(fn|s') * P(s'))
~= P(f0|s)*..*P(fn|s) /
(P(f0|s)*..*P(fn|s) + P(f0|s')*..*P(fn|s'))
The last step needs a little explaining. all the P(s) and P(s') drop out of the equation when we're doing a classifier, since for any piece of evidence, f0...fn, the P(s), the probability of spam occurring, is alway the same across any classification. Since P(s) is constant, and P(s') is (1 - P(s)), it is also constant. Therefore, when we are comparing the values to determine if it belong in the spam or ham category, we can get rid of constants.
The actual hard part about bayesian classifiers is how to estimate the underlying probability distribution. If you've never seen a piece of evidence in training, you're going to say the probability of it occurring is zero, which isn't correct if the evidence shows up during classification. There's various techniques for dealing with this, mostly under the term 'smoothing'. I won't describe the various techniques here, but that should be enough to get you started.