Question : The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in their massively parallel database. Which tool should they use to export the structured data from Hadoop?
Explanation: Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
This document describes how to get started using Sqoop to move data between databases and Hadoop and provides reference information for the operation of the Sqoop command-line tool suite. This document is intended for:
System and application programmers System administrators Database administrators Data analysts Data engineers
Question : When would you prefer a Naive Bayes model to a logistic regression model for classification?
1. When some of the input variables might be correlated 2. When all the input variables are numerical. 3. Access Mostly Uused Products by 50000+ Subscribers 4. When you are using several categorical input variables with over 1000 possible values each.
Correct Answer : Get Lastest Questions and Answer : Explanation: Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features.
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers.[5] Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests.[6]
An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification
Question : Before you build an ARMA model, how can you tell if your time series is weakly stationary?
Correct Answer : Get Lastest Questions and Answer : Exp: In the statistical analysis of time series, autoregressive-moving-average (ARMA) models provide a parsimonious description of a (weakly) stationary stochastic process in terms of two polynomials, one for the auto-regression and the second for the moving average. Given a time series of data Xt, the ARMA model is a tool for understanding and, perhaps, predicting future values in this series. The model consists of two parts, an autoregressive (AR) part and a moving average (MA) part. The model is usually then referred to as the ARMA(p,q) model where p is the order of the autoregressive part and q is the order of the moving average part . There are a number of modelling options to account for a non-constant variance, for example ARCH (and GARCH, and their many extensions) or stochastic volatility models.
An ARCH model extend ARMA models with an additional time series equation for the square error term. They tend to be pretty easy to estimate (the fGRACH R package for example).
SV models extend ARMA models with an additional time series equation (usually a AR(1)) for the log of the time-dependent variance. I have found these models are best estimated using Bayesian methods (OpenBUGS has worked well for me in the past). You can fit ARIMA model, but first you need to stabilize the variance by applying suitable transformation. You can also use Box-Cox transformation. This has been done in the book Time Series Analysis: With Applications in R, page 99, and then they use Box-Cox transformation. Check this link Box-Jenkins modelling Another reference is page 169, Introduction to Time Series and Forecasting, Brockwell and Davis, "Once the data have been transformed (e.g., by some combination of Box-Cox and differencing transformations or by removal of trend and seasonal components) to the point where the transformed series X_t can potentially be fitted by a zero-mean ARMA model, we are faced with the problem of selecting appropriate values for the orders p and q." Therefore, you need to stabilize the variance prior to fit the ARIMA model.