Cloudera Databricks Data Science Certification Questions and Answers (Dumps and Practice Questions)

Question : In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is a mode of the posterior distribution.
The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. Why are zero probabilities most often
a problem in maximum a posteriori estimation?

1. Zero probabilities skew the model significantly towards rare events
2. Zero probabilities causes the model to be more susceptible to overfitting
3. Access Mostly Uused Products by 50000+ Subscribers
4. Zero probabilities cause divide-by-zero errors when calculating the normalization constant

Correct Answer : Get Lastest Questions and Answer :

Explanation: Sometimes we have a priori information about the physical process whose parameters we want to estimate. Such information can come either from the correct scientific knowledge of the physical process or from previous empirical evidence. We can encode such prior information in terms of a PDF on the parameter to be estimated. Essentially, we treat the parameter $\theta$ as the value of an RV. The associated probabilities $P (\theta)$ are called the prior probabilities. We refer to the inference based on such priors as Bayesian inference.The term on the left hand side of the equation is called the posterior. On the right hand side, the numerator is the product of the likelihood term and the prior term. The denominator serves as a normalization term so that the posterior PDF integrates to unity. Suppose there are three facts:
1. If a student was lazy, he has a probility of 0.3 to pass his exam.
2. If a student works hard, he has a probility of 0.8 to pass exam.
3. 10 percents students work hard, and 90 percents students were lazy in Tom's school.

by MAP:
P(lazy|pass) = P(pass|lazy) * P(lazy) / P(pass) = 0.45/P(pass)
P(hard|pass) = P(pass|hard) * P(hard) / P(pass) = 0.08/P(pass)
P(lazy|pass) > P(hard|pass)
So, Tom must be a lazy student.

Because of the chain rule, a zero probability will drive a MAP estimator to 0 regardless of how much evidence is added.

Question : Let's say, you have P(j) is the probability of seeing a given word (indexed by j) in a spam email,
this is just a ratio of counts: p(j) =n(jc) / n(c), where n(jc) denotes the number of times that word appears
in a spam email and n(c) denotes the number of times that word appears in any email. Here is the possibility
that you can get the p(j) as 0 or 1. Which of the following method can help you to reduce the chances of getting this probability as 0 or 1?

1. Naive Bayes
2. k-nearest neighbors
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above

Correct Answer : Get Lastest Questions and Answer :

If you think about it, this is just a ratio of counts p(j) =n(jc) / n(c), where n(jc) denotes the number of times that word appears in a spam email and n(c) denotes the number of times that word appears in any email. Laplace Smoothing refers to the idea of replacing our straight-up estimate of p (j) with something a bit fancier:

Counts p(j) =n(jc)+A / n(c)+B

We might fix A=1 and B=10, for example, to prevent the possibility of getting 0 or 1 for a probability, getting with the word "viagra."

Question : Let's say, you have P(j) is the probability of seeing a given word (indexed by j) in a spam email, this is
just a ratio of counts: p(j) =n(jc) / n(c), where n(jc) denotes the number of times that word appears in a spam email
and n(c) denotes the number of times that word appears in any email. Here is the possibility that you can get the p(j)
as 0 or 1. Using the Laplace Smoothing method we can reduce the chances of getting this probability as 0 or 1? As below
Counts p(j) =n(jc)+A / n(c)+B

Then which of the following statement true regarding A and B

1. As long as both A > 0 and B > 0, you want very few words to be expected to never appear in spam
2. As long as both A > 0 and B > 0, you want very few words to be expected to always appear in spam
3. Access Mostly Uused Products by 50000+ Subscribers
4. All of the above.

Correct Answer : Get Lastest Questions and Answer :

Explanation: Recall that p(j) is the chance that a word is in spam if that word is in some email. On the one hand, as long as both A > 0 and B > 0, this distribution vanishes at both 0 and 1. This is reasonable: you want very few words to be expected to never appear in spam or to always appear in spam. On the other hand, when ? and ? are large, the shape of the distribution is bunched in the middle, which reflects the prior that most words are equally likely to appear in spam or outside spam. That doesn't seem true either. A compromise would have A and B be positive but small, like 1/5. That would keep your spam filter from being too overzealous without having the wrong idea. Of course, you could relax this prior as you have more and better data; in general, strong priors are only needed when you don't have sufficient data.

Related Questions

Question : In regards of Feature Hashing, with large vectors or with multiple locations per feature_____________

1. Is a problem with accuracy
2. It is hard to understand what classifier is doing
3. It is easy to understand what classifier is doing
4. Is a problem with accuracy as well as hard to understand what classifier us doing

Question : What are the advantages of the Hashing Features?

1. Requires the less memory
2. Less pass through the training data
3. Easily reverse engineer vectors to determine which original feature mapped to a vector location
4. 1 and 2 are correct
5. All 1,2 and 3 are correct

Question : In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick),
is a fast and space-efficient way of vectorizing features (such as the words in a language), i.e. turning arbitrary
features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash
values modulo the number of features as indices directly, rather than looking the indices up in an associative array.
So what is the primary reason of the hashing trick for building classifiers?

1. It creates the smaller models
2. It requires the lesser memory to store the coefficients for the model
3. It reduces the non-significant features e.g. punctuations
4. Noisy features are removed