Question : You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study? 1. Association rules 2. Decision trees 3. Access Mostly Uused Products by 50000+ Subscribers 4. K-means clustering
Explanation: kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations. Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing, medical, and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified, labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar behaviors and spending patterns.
Question : You are using MADlib for Linear Regression analysis. Which value does the statement return? SELECT (linregr(depvar, indepvar)).r2 FROM zeta1;
Correct Answer : Get Lastest Questions and Answer : Explanation: Ordinary least-squares (OLS) linear regression refers to a stochastic model in which the conditional mean of the dependent variable (usually denoted ) is an affine function of the vector of independent variables (usually denoted ). for some unknown vector of coefficients . The assumption is that the residuals are i.i.d. distributed Gaussians. That is, the (conditional) probability density of is given by OLS linear regression finds the vector of coefficients that maximizes the likelihood of the observations. Ordinary Least Squares Regression, also called Linear Regression, is a statistical model used to fit linear models. It models a linear relationship of a scalar dependent variable to one or more explanatory independent variables to build a model of coefficients. Training Function : linregr_train(source_table, out_table, dependent_varname, independent_varname, input_group_cols := NULL, heteroskedasticity_option := NULL) source_table : Text value. The name of the table containing the training data. out_table : Text value. Name of the generated table containing the output model. dependent_varname : Text value. Expression to evaluate for the dependent variable. independent_varname : Text value. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list. input_group_cols : Text value. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated. Default value: NULL. heteroskedasticity_option : Boolean value. When True, the heteroskedacity of the model is also calculated and returned with the results. Default value: False. Output Table : The output table produced by the linear regression training function contains the following columns. Any grouping columns provided during training. Present only if the grouping option is used. coef : Float array. Vector of the coefficients of the regression. r2 : Float. R-squared coefficient of determination of the model. std_err ": Float array. Vector of the standard error of the coefficients. t_stats : Float array. Vector of the t-statistics of the coefficients. p_values : Float array. Vector of the p-values of the coefficients. condition_no : Float array. The condition number of the matrix. A high condition number is usually an indication that there may be some numeric instability in the result yielding a less reliable model. A high condition number often results when there is a significant amount of colinearity in the underlying design matrix, in which case other regression techniques, such as elastic net regression, may be more appropriate. bp_stats : Float. The Breush-Pagan statistic of heteroskedacity. Present only if the heteroskedacity argument was set to True when the model was trained. bp_p_value : Float. The Breush-Pagan calculated p-value. Present only if the heteroskedacity parameter was set to True when the model was trained.
1. Bayesian probability and Bayes' rule gives us a way to estimate unknown probabilities from known values. 2. You can reduce the need for a lot of data by assuming conditional independence among the features in your data. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Only 1 and 2 5. All 1,2 and 3 are correct