Question : In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. Why are zero probabilities most often a problem in maximum a posteriori estimation? 1. Zero probabilities skew the model significantly towards rare events 2. Zero probabilities causes the model to be more susceptible to overfitting 3. Access Mostly Uused Products by 50000+ Subscribers 4. Zero probabilities cause divide-by-zero errors when calculating the normalization constant
Explanation: Sometimes we have a priori information about the physical process whose parameters we want to estimate. Such information can come either from the correct scientific knowledge of the physical process or from previous empirical evidence. We can encode such prior information in terms of a PDF on the parameter to be estimated. Essentially, we treat the parameter $\theta$ as the value of an RV. The associated probabilities $P (\theta)$ are called the prior probabilities. We refer to the inference based on such priors as Bayesian inference.The term on the left hand side of the equation is called the posterior. On the right hand side, the numerator is the product of the likelihood term and the prior term. The denominator serves as a normalization term so that the posterior PDF integrates to unity. Suppose there are three facts: 1. If a student was lazy, he has a probility of 0.3 to pass his exam. 2. If a student works hard, he has a probility of 0.8 to pass exam. 3. 10 percents students work hard, and 90 percents students were lazy in Tom's school.
by MAP: P(lazy|pass) = P(pass|lazy) * P(lazy) / P(pass) = 0.45/P(pass) P(hard|pass) = P(pass|hard) * P(hard) / P(pass) = 0.08/P(pass) P(lazy|pass) > P(hard|pass) So, Tom must be a lazy student.
Because of the chain rule, a zero probability will drive a MAP estimator to 0 regardless of how much evidence is added.
Question : Let's say, you have P(j) is the probability of seeing a given word (indexed by j) in a spam email, this is just a ratio of counts: p(j) =n(jc) / n(c), where n(jc) denotes the number of times that word appears in a spam email and n(c) denotes the number of times that word appears in any email. Here is the possibility that you can get the p(j) as 0 or 1. Which of the following method can help you to reduce the chances of getting this probability as 0 or 1?
If you think about it, this is just a ratio of counts p(j) =n(jc) / n(c), where n(jc) denotes the number of times that word appears in a spam email and n(c) denotes the number of times that word appears in any email. Laplace Smoothing refers to the idea of replacing our straight-up estimate of p (j) with something a bit fancier:
Counts p(j) =n(jc)+A / n(c)+B
We might fix A=1 and B=10, for example, to prevent the possibility of getting 0 or 1 for a probability, getting with the word "viagra."
Question : Let's say, you have P(j) is the probability of seeing a given word (indexed by j) in a spam email, this is just a ratio of counts: p(j) =n(jc) / n(c), where n(jc) denotes the number of times that word appears in a spam email and n(c) denotes the number of times that word appears in any email. Here is the possibility that you can get the p(j) as 0 or 1. Using the Laplace Smoothing method we can reduce the chances of getting this probability as 0 or 1? As below Counts p(j) =n(jc)+A / n(c)+B
Then which of the following statement true regarding A and B
1. As long as both A > 0 and B > 0, you want very few words to be expected to never appear in spam 2. As long as both A > 0 and B > 0, you want very few words to be expected to always appear in spam 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above.
Explanation: Recall that p(j) is the chance that a word is in spam if that word is in some email. On the one hand, as long as both A > 0 and B > 0, this distribution vanishes at both 0 and 1. This is reasonable: you want very few words to be expected to never appear in spam or to always appear in spam. On the other hand, when ? and ? are large, the shape of the distribution is bunched in the middle, which reflects the prior that most words are equally likely to appear in spam or outside spam. That doesn't seem true either. A compromise would have A and B be positive but small, like 1/5. That would keep your spam filter from being too overzealous without having the wrong idea. Of course, you could relax this prior as you have more and better data; in general, strong priors are only needed when you don't have sufficient data.
1. Requires the less memory 2. Less pass through the training data 3. Easily reverse engineer vectors to determine which original feature mapped to a vector location 4. 1 and 2 are correct 5. All 1,2 and 3 are correct