Correct Answer : Get Lastest Questions and Answer : Exp: Business User: Someone who understands the domain area and usually benefits from the results. This person can consult and advise the project team on the context of the project, the value of the results, and how the outputs will be operationalized. Usually business analyst, line manager, or deep subject matter expert in the project domain fulfills this role.
Question : What is LOESS used for? 1. It plots a continuous variable versus a discrete variable, to compare distributions across classes. 2. It is a significance test for the correlation between two variables. 3. Access Mostly Uused Products by 50000+ Subscribers 4. It is run after a one-way ANOVA, to determine which population has the highest mean value.
Correct Answer : Get Lastest Questions and Answer : Exp: the loess() function can be used to fit a nonlinear line to the data. LOESS and LOWESS (locally weighted scatterplot smoothing) are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. "LOESS" is a later generalization of LOWESS; although it is not a true initialism, it may be understood as standing for "LOcal regrESSion". LOESS and LOWESS thus build on "classical" methods, such as linear and nonlinear least squares regression. They address situations in which the classical procedures do not perform well or cannot be effectively applied without undue labor. LOESS combines much of the simplicity of linear least squares regression with the flexibility of nonlinear regression. It does this by fitting simple models to localized subsets of the data to build up a function that describes the deterministic part of the variation in the data, point by point. In fact, one of the chief attractions of this method is that the data analyst is not required to specify a global function of any form to fit a model to the data, only to fit segments of the data.
The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would have been practically impossible to use in the era when least squares regression was being developed. Most other modern methods for process modeling are similar to LOESS in this respect. These methods have been consciously designed to use our current computational ability to the fullest possible advantage to achieve goals not easily achieved by traditional approaches. A smooth curve through a set of data points obtained with this statistical technique is called a Loess Curve, particularly when each smoothed value is given by a weighted quadratic least squares regression over the span of values of the y-axis scattergram criterion variable. When each smoothed value is given by a weighted linear least squares regression over the span, this is known as a Lowess curve; however, some authorities treat Lowess and Loess as synonyms. This is a method for fitting a smooth curve between two variables, or fitting a smooth surface between an outcome and up to four predictor variables.
The procedure originated as LOWESS (LOcally WEighted Scatter-plot Smoother). Since then it has been extended as a modelling tool because it has some useful statistical properties (Cleveland, 1998). This is a nonparametric method because the linearity assumptions of conventional regression methods have been relaxed. Instead of estimating parameters like m and c in y = mx +c, a nonparametric regression focuses on the fitted curve. The fitted points and their standard errors represent are estimated with respect to the whole curve rather than a particular estimate. So, the overall uncertainty is measured as how well the estimated curve fits the population curve. It is called local regression because the fitting at say point x is weighted toward the data nearest to x. The distance from x that is considered near to it is controlled by the span setting, a.When a is less than 1 it represents the proportion of the data that is considered to be neighbouring x, and the weighting that is used is proportional to 1-(distance/maximum distance)^3)^3, which is known as tricubic.When a is greater than 1 all of the points are used and the maximum distance is taken as a^(1/p) times the observed maximum distance for p predictors. The default span is a = 0.75. If you choose a span that is too small then there will be insufficient data near x for an accurate fit, resulting in a large variance. If the span is too large than the regression will be over-smoothed, resulting in a loss of information, hence a large bias. The trade-off between bias and variance also depends on the degree of the polynomial selected. A high degree will provide a better approximation of the population mean, so less bias, but there are more factors to consider in the model, resulting in greater variance. The default degree is 2 (quadratic). Higher degrees don't improve the fit much. The lower degree (i.e. 1, linear) has more bias but pulls back variance at the boundaries. There is no substitute for thinking carefully about what you are plotting, testing different settings of span and polynomial degree, and selecting the most plausible fit by eye. The summary statistics also give an indication of how well the model fits.
The concept of degrees of freedom for nonparametric models is complex. They approximate the parametric concept of degrees of freedom empirically and result in numbers that are not necessarily integers. The assumptions are: Around point x the mean of y can be approximated by a small class of parametric functions in polynomial regression. The errors in estimating y are independent and randomly distributed with a mean of zero. Bias and variance are traded off by the choices for the settings of span and degree of polynomial. Technical Validation : The LOESS function in the {stats} package of R is called. You must have R installed on the computer from which you are running StatsDirect. You can download and install R here. The R implementation is based on the cloess algorithm, for which the original authors have a NETLIB site.
Question : Which word or phrase completes the statement? Mahout is to Hadoop as MADlib is to ____________ .
Correct Answer : Get Lastest Questions and Answer : Exp: Key philosophies driving the architecture of MADlib are: Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily. Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details. Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability. Open implementation maintaining active ties into ongoing academic research. Classification
When the desired output is categorical in nature we use classification methods to build a model that predicts which of the various categories a new result would fall into. The goal of classification is to be able to correctly label incoming records with the correct class for the record. Example: If we had data that described various demographic data and other features of individuals applying for loans and we had historical data that included what past loans had defaulted, then we could build a model that described the likelihood that a new set of demographic data would result in a loan default. In this case the categories are "will default" or "won't default" which are two discrete classes of output. Regression : When the desired output is continuous in nature we use regression methods to build a model that predicts the output value. Example: If we had data that described properties of real estate listings then we could build a model to predict the sale value for homes based on the known characteristics of the houses. This is a regression because the output response is continuous in nature rather than categorical. (Ideally a link to a more developed use case here) Clustering : In which we are trying to identify groups of data such that the items within one cluster are more similar to each other than they are to the items in any other cluster. Example: In customer segmentation analysis the goal is to identify specific groups of customers that behave in a similar fashion so that various marketing campaigns can be designed to reach these markets. When the customer segments are known in advance this would be a supervised classification task, when we let the data itself identify the segments this is a clustering task. Topic Modeling : Topic modeling is similar to clustering in that it attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents. Association Rule Mining : Also called market basket analysis or frequent itemset mining, this is attempting to identify which items tend to occur together more frequently than random chance would indicate suggesting an underlying relationship between the items. Example: In an online web store association rule mining can be used to identify what products tend to be purchased together. This can then be used as input into a product recommender engine to suggest items that may be of interest to the customer and provide upsell opportunities. Descriptive Statistics : Descriptive statistics don't provide a model and thus are not considered a learning method, but they can be helpful in providing information to an analyst to understand the underlying data and can provide valuable insights into the data that may influence choice of data model. Example: Calculating the distribution of data within each variable of a dataset can help an analyst understand which variables should be treated as categorical variables and which as continuous variables as well as understanding what sort of distribution the values fall in. Validation : Using a model without understanding the accuracy of the model can lead to disastrous consequences. For that reason it is important to understand the error of a model and to evaluate the model for accuracy on testing data. Frequently in data analysis a separation is made between training data and testing data solely for the purpose of providing statistically valid analysis of the validity of the model and assessment that the model is not over-fitting the training data. N-fold cross validation is also frequently utilized.
Related Questions
Question : You have completed your model and are handing it off to be deployed in production. What should you deliver to the production team, along with your commented code? 1. The production team needs to understand how your model will interact with the processes they already support. Give them documentation on expected model inputs and outputs, and guidance on error-handling. 2. The production team are technical, and they need to understand how the processes that they support work, so give them the same presentation that you prepared for the analysts. 3. Access Mostly Uused Products by 50000+ Subscribers to understand how your model interacts with the processes they already support. Give them the same presentation that you prepared for the project sponsor. 4. The production team supports the processes that run the organization, and they need context to understand how your model interacts with the processes they already support. Give them the executive summary.
1. A larger sample size should be taken to determine if the plant is operating correctly 2. The manufacturing process is functioning properly and no further action is required 3. Access Mostly Uused Products by 50000+ Subscribers 4. There is a flaw in the quality assurance process and the sample should be repeated