Question : You have been given two training datasets for different courses in same subject as HEPop and HEPop. You need to find that both the population are same or not. Hence, you decided to use Hypothesis testing using the t-test. You assume that both the populations have equal distribution and their variance is not known. To conduct the t-test you need to calculate t-statistic, which of the following you would require for calculating t-statistic? A. Sample Mean B. Sample Size C. Standard Deviation of Sample D. Population size E. Mean of the each population F. Samples Average of Mean values
1. A,B,C 2. C,D,E 3. D,E,F 4. A,B,E 5. A,C,F
Correct Answer : 1 Explanation: When you conduct the Hypothesis testing using the t-test, you use samples and not the actual population. Because generally population is quite big and you cannot run the test on the population. Hence, option D and E is out. If each population is normally distributed and having the same mean (u1=u2) and the same variance. Then shape of t-distribution is similar to normal distribution. Even if degrees of freedom (n1+n2-2) i.e. (Sample Size from HEPop1 + Sample Size from HEPop2-2) is reaches close to 30 or more than t-distribution is nearly identical to the normal distribution. If t-statistics is high than, then you can reject the Null Hypothesis (U1=U2) One of the key assumption made in the t-test is that variance of both the sample is equal.
Question : You are working in a Healthcare company in America, and wanted to measure the weights. As you cannot measure the weight of all the people in entire population. Which of the following formula would be helpful for this requirement? 1. A 2. B 3. C 4. D
Correct Answer : 4 Explanation: To calculate the population weight, you can use sample variance. Because it is not possible to take the weight of entire population. Variance is the average of the squared difference from the mean. However, variance is not that useful to conclude what is weight of the population, because values are squared. Hence, you have to take square root of the variance and that value is called standard deviation. Suppose your standard deviation is = 80 then variance will be 1600. Suppose mean of the sample is 140. Then you can conclude that majority of the peoples weight will be (mean-sd) and (mean +sd). In this case it would be (140-80) to (140 + 80) = 60 to 210.
Question : Which of the following hypothesis testing would be more reliable when you have unequal variance and unequal sample sizes?
1. Student t-tests
2. Welch t-test
3. Logistic regression
4. Linear regression
5. Wilcoxon rank-sum test
Correct Answer : 2 Explanation: As you can see we don't use Logistic, Linear for the Hypothesis testing. So you can discard these three options. Remaining 3 options are actual t-tests. Let's see what the Welch's t-test from Wikipedia is In statistics, Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. Welch's t-test is an adaptation of Student's t-test, and is more reliable when the two samples have unequal variances and unequal sample sizes. These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping. Given that Welch's t-test has been less popular than Student's t-test [2] and may be less familiar to readers, a more informative name is "Welch's unequal variances t-test" or "unequal variances t-test" for brevity. Student t-test is used when we assume that sample has the equal variance.
1. {grape, apple, orange} must be a frequent itemset. 2. {banana, apple, grape, orange} must be a frequent itemset. 3. Access Mostly Uused Products by 50000+ Subscribers 4. {banana, apple} => {orange} must be a relevant rule.
similar interests. For example, association rules may suggest that those customers who have bought product A have also bought product B, or those customers who have bought products A, B, and C are more similar to this customer. These findings provide opportunities for retailers to cross-sell their products. Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is sometimes referred to as "Market Basket Analysis", since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect from a random sampling of all possibilities. The classic example of this is the famous Beer and Diapers association that is often mentioned in data mining books. The story goes like this: men who go to the store to buy diapers will also tend to buy beer at the same time. Let us illustrate this with a simple example. Suppose that a store's retail transactions database includes the following information:
There are 600,000 transactions in total. 7,500 transactions contain diapers (1.25 percent) 60,000 transactions contain beer (10 percent) 6,000 transactions contain both diapers and beer (1.0 percent) If there was no association between beer and diapers (i.e., they are statistically independent), then we expect only 10% of diaper purchasers to also buy beer (since 10% of all customers buy beer). However, we discover that 80% (=6000/7500) of diaper purchasers also buy beer. This is a factor of 8 increase over what was expected - that is called Lift, which is the ratio of the observed frequency of co-occurrence to the expected frequency. This was determined simply by counting the transactions in the database. So, in this case, the association rule would state that diaper purchasers will also buy beer with a Lift factor of 8. In statistics, Lift is simply estimated by the ratio of the joint probability of two items x and y, divided by the product of their individual probabilities: Lift = P(x,y)/[P(x)P(y)]. If the two items are statistically independent, then P(x,y)=P(x)P(y), corresponding to Lift = 1 in that case. Note that anti-correlation yields Lift values less than 1, which is also an interesting discovery - corresponding to mutually exclusive items that rarely co-occur together.
Question : When would you use a Wilcoxson Rank Sum test? 1. When the data can easily be sorted 2. When the populations represent the sums of other values 3. Access Mostly Uused Products by 50000+ Subscribers 4. When you cannot make an assumption about the distribution of the populations
1. It writes the output of the Map function to storage 2. It breaks the input into smaller components and distributes to other nodes in the cluster 3. Access Mostly Uused Products by 50000+ Subscribers 4. It distributes the input to multiple nodes for processing