Explanation: Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data. Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data Model planning: Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models.
Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In addition, in this phase the team builds and executes models based on the work done in the model planning phase. The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable). Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment.
Question : Under which circumstance do you need to implement N-fold cross-validation after creating a regression model?
Correct Answer : Get Lastest Questions and Answer : Exp: In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables - that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution. Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation. Many techniques for carrying out regression analysis have been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.
The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results. EXAMPLE USES OF REGRESSION MODELS
Selecting Colleges : A high school student discusses plans to attend college with a guidance counselor. The student has a 2.04 grade point average out of 4.00 maximum and mediocre to poor scores on the ACT. He asks about attending Harvard. The counselor tells him he would probably not do well at that institution, predicting he would have a grade point average of 0.64 at the end of four years at Harvard. The student inquires about the necessary grade point average to graduate and when told that it is 2.25, the student decides that maybe another institution might be more appropriate in case he becomes involved in some "heavy duty partying." When asked about the large state university, the counselor predicts that he might succeed, but chances for success are not great, with a predicted grade point average of 1.23. A regional institution is then proposed, with a predicted grade point average of 1.54. Deciding that is still not high enough to graduate, the student decides to attend a local community college, graduates with an associates degree and makes a fortune selling real estate. If the counselor was using a regression model to make the predictions, he or she would know that this particular student would not make a grade point of 0.64 at Harvard, 1.23 at the state university, and 1.54 at the regional university. These values are just "best guesses." It may be that this particular student was completely bored in high school, didn't take the standardized tests seriously, would become challenged in college and would succeed at Harvard. The selection committee at Harvard, however, when faced with a choice between a student with a predicted grade point of 3.24 and one with 0.64 would most likely make the rational decision of the most promising student.
Pregnancy : A woman in the first trimester of pregnancy has a great deal of concern about the environmental factors surrounding her pregnancy and asks her doctor about what to impact they might have on her unborn child. The doctor makes a "point estimate" based on a regression model that the child will have an IQ of 75. It is highly unlikely that her child will have an IQ of exactly 75, as there is always error in the regression procedure. Error may be incorporated into the information given the woman in the form of an "interval estimate." For example, it would make a great deal of difference if the doctor were to say that the child had a ninety-five percent chance of having an IQ between 70 and 80 in contrast to a ninety-five percent chance of an IQ between 50 and 100. The concept of error in prediction will become an important part of the discussion of regression models. It is also worth pointing out that regression models do not make decisions for people. Regression models are a source of information about the world. In order to use them wisely, it is important to understand how they work.
Question : Your company has different sales teams. Each team's sales manager has developed incentive offers to increase the size of each sales transaction. Any sales manager whose incentive program can be shown to increase the size of the average sales transaction will receive a bonus. Data are available for the number and average sale amount for transactions offering one of the incentives as well as transactions offering no incentive. The VP of Sales has asked you to determine analytically if any of the incentive programs has resulted in a demonstrable increase in the average sale amount. Which analytical technique would be appropriate in this situation?
Explanation: The results of a one-way ANOVA can be considered reliable as long as the following assumptions are met:
Response variable residuals are normally distributed (or approximately normally distributed). Samples are independent. Variances of populations are equal. Responses for a given group are independent and identically distributed normal random variables (not a simple random sample (SRS)). ANOVA is a relatively robust procedure with respect to violations of the normality assumption.[2] If data are ordinal, a non-parametric alternative to this test should be used such as Kruskal-Wallis one-way analysis of variance.
1. If your surgery is to be a routine one, then surgeon B is actually the better surgeon 2. If your surgery is to be a routine one, then surgeon A is actually the better surgeon 3. Access Mostly Uused Products by 50000+ Subscribers 4. Data is not sufficient
Select the correct statement for AUC which is a commonly used evaluation method in measuring the accuracy and quality of a recommender system 1. is a commonly used evaluation method for binary choice problems, 2. It involves classifying an instance as either positive or negative 3. Access Mostly Uused Products by 50000+ Subscribers 4. 1 and 2 only 5. All 1,2 and 3 Ans :4 Exp : AUC is a commonly used evaluation method for binary choice problems, which involve classifying an instance as either positive or negative. Its main advantages over other evaluation methods, such as the simpler misclassification error, are: 1. It's insensitive to unbalanced datasets (datasets that have more installeds than not-installeds or vice versa). 2. For other evaluation methods, a user has to choose a cut-off point above which the target variable is part of the positive class (e.g. a logistic regression model returns any real number between 0 and 1 - the modeler might decide that predictions greater than 0.5 mean a positive class prediction while a prediction of less than 0.5 mean a negative class prediction). AUC evaluates entries at all cut-off points, giving better insight into how well the classifier is able to separate the two classes.
Question : You have created a recommender system for QuickTechie.com website, where you recommend the Software professional based on some parameters like technologies, location, companies etc. Now but you have little doubt that this model is not giving proper recommendation as Rahul is working on Hadoop in Mumbai and John from france is working on UI application created in flash, are recommended as a similar professional, which is not correct. Select the correct option which will be helpful to measure the accuracy and quality of a recommender system you created for QuickTechie.com?
Ans : 3 Exp : AUC is a commonly used evaluation method for binary choice problems, which involve classifying an instance as either positive or negative. Its main advantages over other evaluation methods, such as the simpler misclassification error, are: 1. It's insensitive to unbalanced datasets (datasets that have more installeds than not-installeds or vice versa). 2. For other evaluation methods, a user has to choose a cut-off point above which the target variable is part of the positive class (e.g. a logistic regression model returns any real number between 0 and 1 - the modeler might decide that predictions greater than 0.5 mean a positive class prediction while a prediction of less than 0.5 mean a negative class prediction). AUC evaluates entries at all cut-off points, giving better insight into how well the classifier is able to separate the two classes.
The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables. The equation is given in the library references. Expressed in words, the MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The MAE is a linear score which means that all the individual differences are weighted equally in the average.
The sum of absolute errors is a valid metric, but doesn't give any useful sense of how the recommender system is performing. Support vector count and cluster density do not apply to recommender systems. MAE and AUC are both valid and useful metrics for measuring recommender systems.
Ans :5 Exp : Scatter plots show the relationship between two variables by displaying data points on a two-dimensional graph. The variable that might be considered an explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis. Scatter plots are especially useful when there are a large number of data points. They provide the following information about the relationship between two variables Strength Shape - linear, curved, etc. Direction - positive or negative Presence of outliers A correlation between the variables results in the clustering of data points along a line. The following is an example of a scatter plot suggestive of a positive linear relationship.
Question : You are given a data set that contains information about tv advertisement placed between and of Zee News Channel (Total Asia continent information). With the following detailed information. Advertisement duration, Cost rate per minute of Advertissement, Country of the Advertisers, City from which addvertiser Country to which advertise needs to be shown., City to which advertise needs to be shown., Month total advertisement Days (of month) advertisement shown, Total hourds for which advertisement shown. , Total Minutes for which advertisement shown. From the data set you can determine the frequencies of all the advertisement shown in Asia continent. For example, between 1990 and 2014, 500 advertisement were given from China to Shown in India, While 2000 advertisement given by Russia to shown in Japan. Now you want to draw the pictue which shows the relation between Ad duration and cost per Minute, which technique you feel would be better.
Ans : 1 Exp : A scatter plot, scatterplot, or scattergraph is a type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. This kind of plot is also called a scatter chart, scattergram, scatter diagram, or scatter graph. A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat map provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand complex data sets. Another type of heat map, which is often used in business, is sometimes referred to as a tree map. This type of heat map uses rectangles to represent components of a data set. The largest rectangle represents the dominant logical division of data and smaller rectangles illustrate other sub-divisions within the data set. The color and size of the rectangles on this type of heat map can correspond to two different values, allowing the viewer to perceive two variables at once. Tree maps are often used for budget proposals, stock market analysis, risk management, project portfolio analysis, market share analysis, website design and network management. In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. To visualize correlations between two variables, a scatter plot is typically the best choice. By plotting the data on a scatter plot, you can easily see any trends in the correlation, such as a linear relationship, a log normal relationship, or a polynomial relationship. A heat map uses three dimensions and so would be a poor choice for this purpose. Box plots, bar charts, and tree maps do not provide the kind of uniform special mapping of the data onto the graph that is required to see trends.
Question :
Which of the following provide the kind of uniform special mapping of the data onto the graph that is required to see trends.
Ans 5 Exp : Box Plots: In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box plots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Boxplots can be drawn either horizontally or vertically. A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat map provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand complex data sets. In the United States, many people are familiar with heat maps from viewing television news programs. During a presidential election, for instance, a geographic heat map with the colors red and blue will quickly inform the viewer which states each candidate has won. Another type of heat map, which is often used in business, is sometimes referred to as a tree map. This type of heat map uses rectangles to represent components of a data set. The largest rectangle represents the dominant logical division of data and smaller rectangles illustrate other sub-divisions within the data set. The color and size of the rectangles on this type of heat map can correspond to two different values, allowing the viewer to perceive two variables at once. Tree maps are often used for budget proposals, stock market analysis, risk management, project portfolio analysis, market share analysis, website design and network management.
Question : You are given a data set that contains information about tv advertisement placed between and of Zee News Channel (Total Asia continent information). With the following detailed information. Advertisement duration, Cost rate per minute of Advertissement, Country of the Advertisers, City from which addvertiser Country to which advertise needs to be shown., City to which advertise needs to be shown., Month total advertisement Days (of month) advertisement shown, Total hourds for which advertisement shown. , Total Minutes for which advertisement shown. From the data set you can determine the frequencies of all the advertisement shown in Asia continent. For example, between 1990 and 2014, 500 advertisement were given from China to Shown in India, While 2000 advertisement given by Russia to shown in Japan. Now you want to draw the pictue which shows the relation between which contries given most advertisement in the other country. Select the correct option. 1. Heat map 2. Tree map 3. Access Mostly Uused Products by 50000+ Subscribers 4. Bar chart 5. Scatter plot
Ans :1 Exp : A scatter plot, scatterplot, or scattergraph is a type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. This kind of plot is also called a scatter chart, scattergram, scatter diagram, or scatter graph. A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat map provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand complex data sets. Another type of heat map, which is often used in business, is sometimes referred to as a tree map. This type of heat map uses rectangles to represent components of a data set. The largest rectangle represents the dominant logical division of data and smaller rectangles illustrate other sub-divisions within the data set. The color and size of the rectangles on this type of heat map can correspond to two different values, allowing the viewer to perceive two variables at once. Tree maps are often used for budget proposals, stock market analysis, risk management, project portfolio analysis, market share analysis, website design and network management. In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. To visualize correlations between two variables, a scatter plot is typically the best choice. By plotting the data on a scatter plot, you can easily see any trends in the correlation, such as a linear relationship, a log normal relationship, or a polynomial relationship. A heat map uses three dimensions and so would be a poor choice for this purpose. Box plots, bar charts, and tree maps do not provide the kind of uniform special mapping of the data onto the graph that is required to see trends.In order to effectively visualize the advertisement source and destination frequencies, you'll need a plot that gives at least three dimensions: the source, destination, and frequency. A heat map provides exactly that. Scatter plots, box plots, tree maps, and bar charts provide at most two dimensions. In theory, you could use a three-dimensional variant of one of the two dimensions graphs, but three-dimensional graphs are never a good idea. Three-dimensional graphs can only be shown in two dimensions in print and hence cause visual distortions to the data. They can also hide some data points, and they make it very difficult to compare data points from different parts of the graph.
Question :
Which of the following graph can be best presented in two-dimension
Ans : 5 Exp : A heat map provides exactly that. Scatter plots, box plots, tree maps, and bar charts provide at most two dimensions. In theory, you could use a three-dimensional variant of one of the two dimensions graphs, but three-dimensional graphs are never a good idea. Three-dimensional graphs can only be shown in two dimensions in print and hence cause visual distortions to the data. They can also hide some data points, and they make it very difficult to compare data points from different parts of the graph.
Question : You are given a data set that contains information about tv advertisement placed between and of Zee News Channel (Total Asia continent information). With the following detailed information. Advertisement duration, Cost rate per minute of Advertissement, Country of the Advertisers, City from which addvertiser Country to which advertise needs to be shown., City to which advertise needs to be shown., Month total advertisement Days (of month) advertisement shown, Total hourds for which advertisement shown. , Total Minutes for which advertisement shown. From the data set you can determine the frequencies of all the advertisement shown in Asia continent. For example, between 1990 and 2014, 500 advertisement were given from China to Shown in India, While 2000 advertisement given by Russia to shown in Japan. Now you want to draw the pictue which shows the relation between Ad dthat every city and country has of the overall ad data, which technique you feel would be better. 1. Scatter plot 2. Heat map 3. Access Mostly Uused Products by 50000+ Subscribers 4. Tree map Ans : 4 Exp : To show the share of advertisement originations for every city and state, you'll need a way to show hierarchical information. A tree map is a natural choice, since it's designed for exactly that purpose. You could, however, use a stacked bar chart to present the same information. A heat map has an extra, unneeded dimension, which would make the graph confusing. A scatter plot is for numeric data in both dimensions. A box plot is for groupings of multiple values. A scatter plot, scatterplot, or scattergraph is a type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. This kind of plot is also called a scatter chart, scattergram, scatter diagram, or scatter graph. A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat map provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand complex data sets. Another type of heat map, which is often used in business, is sometimes referred to as a tree map. This type of heat map uses rectangles to represent components of a data set. The largest rectangle represents the dominant logical division of data and smaller rectangles illustrate other sub-divisions within the data set. The color and size of the rectangles on this type of heat map can correspond to two different values, allowing the viewer to perceive two variables at once. Tree maps are often used for budget proposals, stock market analysis, risk management, project portfolio analysis, market share analysis, website design and network management. In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. To visualize correlations between two variables, a scatter plot is typically the best choice. By plotting the data on a scatter plot, you can easily see any trends in the correlation, such as a linear relationship, a log normal relationship, or a polynomial relationship. A heat map uses three dimensions and so would be a poor choice for this purpose. Box plots, bar charts, and tree maps do not provide the kind of uniform special mapping of the data onto the graph that is required to see trends. In order to effectively visualize the advertisement source and destination frequencies, you'll need a plot that gives at least three dimensions: the source, destination, and frequency. A heat map provides exactly that. Scatter plots, box plots, tree maps, and bar charts provide at most two dimensions. In theory, you could use a three-dimensional variant of one of the two dimensions graphs, but three-dimensional graphs are never a good idea. Three-dimensional graphs can only be shown in two dimensions in print and hence cause visual distortions to the data. They can also hide some data points, and they make it very difficult to compare data points from different parts of the graph.
Question :
Which of the following is a correct use case for the scatter plots
1. Male versus female likelihood of having lung cancer at different ages 2. technology early adopters and laggards' purchase patterns of smart phones 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above Ans :4 Exp : Looking to dig a little deeper into some data, but not quite sure how - or if - different pieces of information relate? Scatter plots are an effective way to give you a sense of trends, concentrations and outliers that will direct you to where you want to focus your investigation efforts further. When to use scatter plots: o Investigating the relationship between different variables. Examples: Male versus female likelihood of having lung cancer at different ages, technology early adopters' and laggards' purchase patterns of smart phones, shipping costs of different product categories to different regions.
Question :
Which of the following places where we cannot use Gantt charts
1. Displaying a project schedule. Examples: illustrating key deliverables, owners, and deadlines. 2. Showing other things in use over time. Examples: duration of a machine's use, 3. Access Mostly Uused Products by 50000+ Subscribers 4. None of the above Ans : 4 Exp : Gantt charts excel at illustrating the start and finish dates elements of a project. Hitting deadlines is paramount to a project's success. Seeing what needs to be accomplished - and by when - is essential to make this happen. This is where a Gantt chart comes in. While most associate Gantt charts with project management, they can be used to understand how other things such as people or machines vary over time. You could use a Gantt, for example, to do resource planning to see how long it took people to hit specific milestones, such as a certification level, and how that was distributed over time. When to use Gantt charts: o Displaying a project schedule. Examples: illustrating key deliverables, owners, and deadlines. o Showing other things in use over time. Examples: duration of a machine's use, availability of players on a team.
Question :
Which of the following is the best example where we can use Heat maps
1. Segmentation analysis of target market 2. product adoption across regions 3. Access Mostly Uused Products by 50000+ Subscribers 4. All of the above 5. None of 1,2 and 3 Ans : 4 Exp : Heat maps are a great way to compare data across two categories using color. The Effect is to quickly see where the intersection of the categories is strongest and weakest. When to use heat maps: Showing the relationship between two factors. Examples: segmentation analysis of target market, product adoption across regions, sales leads by Individual rep.
Question :
Which of the following cannot be presented using TreeMap?