You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it?
1. Create clusters based on the data and use them as model inputs 2. Force all 15 variables into the model as independent variables 3. Create interaction variables based only on variables A, B, and C 4. Break variables A, B, and C into their own univariate models
Correct Answer : 1
Explanation: In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. (This term should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.) In linear regression, data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. Such models are called linear models.[3] Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis.
Question : You have two tables of customers in your database. Customers in cust_table_ were sent an email promotion last year, and customers in cust_table_2 received a newsletter last year. Customers can only be entered in once per table. You want to create a table that includes all customers, and any of the communications they received last year. Which type of join would you use for this table?
1. Full outer join 2. Inner join 3. Left outer join 4. Cross join
Correct Answer : 1
Explanation: The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right table (table2).
The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.
SQL FULL OUTER JOIN Syntax SELECT column_name(s) FROM table1 FULL OUTER JOIN table2 ON table1.column_name=table2.column_name;
Question : In which lifecycle stage are initial hypotheses formed?
1. Model planning 2. Discovery 3. Model building 4. Data preparation
Correct Answer : 2
Explanation: Phase 1-Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
1. You have a completely developed model based on both a sample of the data and the entire set of data available. 2. You have presented the results of the model to both the internal analytics team and the business owner of the project. 3. Access Mostly Uused Products by 50000+ Subscribers results 4. You have written documentation, and the code has been handed off to the Data Base Administrator and business operations.
1. a subset of the provided data set selected at random and used to initially construct the model 2. a subset of the provided data set that is removed by the data scientist because it contains data errors 3. Access Mostly Uused Products by 50000+ Subscribers 4. a subset of the provided data set selected at random and used to validate the model