Correct Answer : Get Lastest Questions and Answer : Explanation: In Phase 3, the data science team identifies candidate models to apply to the data for clustering, classifying, or finding relationships in the data depending on the goal of the project, It is during this phase that the team refers to the hypotheses developed in Phase 1, when they first became acquainted with the data and understanding the business problems or domain area. These hypotheses help the team frame the analytics to execute in Phase 4 and select the right methods to achieve its objectives. Some of the activities to consider in this phase include the following: Assess the structure of the datasets. The structure of the datasets is one factor that dictates the tools and analytical techniques for the next phase. Depending on whether the team plans to analyze textual data or transactional data, for example, different tools and approaches are required. Ensure that the analytical techniques enable the team to meet the business objectives and accept or reject the working hypotheses. Determine if the situation warrants a single model or a series of techniques as part of a larger analytic workflow. A few example models include association rules and logistic regression Other tools, such as Alpine Miner, enable users to set up a series of steps and analyses and can serve as a front-end user interface (UI) for manipulating Big Data sources in PostgreSQL.
Question : What is Hadoop? 1. Java classes for HDFS types and MapReduce job management and HDFS 2. Java classes for HDFS types and MapReduce job management and the MapReduce paradigm 3. Access Mostly Uused Products by 50000+ Subscribers 4. MapReduce paradigm and massive unstructured data storage on commodity hardware
Explanation: Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer. Why use Hadoop?
Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics.
Hadoop enables a computing solution that is:
Scalable A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications.
Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible Hadoop is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
Question : You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?
Explanation: ntroduction to k-Means Clustering k-means clustering is a partitioning method. The function kmeans partitions data into k mutually exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a single level of clusters. The distinctions mean that k-means clustering is often more suitable than hierarchical clustering for large amounts of data.
kmeans treats each observation in your data as an object having a location in space. It finds a partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. You can choose from five different distance measures, depending on the kind of data you are clustering.
Each cluster in the partition is defined by its member objects and by its centroid, or center. The centroid for each cluster is the point to which the sum of distances from all objects in that cluster is minimized. kmeans computes cluster centroids differently for each distance measure, to minimize the sum with respect to the measure that you specify.
kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations. By default, kmeans uses the k-means++ algorithm for cluster center initialization and the squared Euclidean metric to determine distances.
Create Clusters and Determine Separation The following example explores possible clustering in four-dimensional data by analyzing the results of partitioning the points into three, four, and five clusters.
Note Because each part of this example generates random numbers sequentially, i.e., without setting a new state, you must perform all steps in sequence to duplicate the results shown. If you perform the steps out of sequence, the answers will be essentially the same, but the intermediate results, number of iterations, or ordering of the silhouette plots may differ. overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.
1. Formula A and Formula B are about equally effective at promoting weight gain. 2. Formula A and Formula B are both effective at promoting weight gain. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Either Formula A or Formula B is effective at promoting weight gain.
1. The change in purchase size is not practically important, and the good p-value of the second study is probably a result of the large study size. 2. The change in purchase size is small, but may aggregate up to a large increase in profits over the entire customer base. 3. Access Mostly Uused Products by 50000+ Subscribers should run another, larger study. 4. The p-value of the second study shows a statistically significant change in purchase size. The new website is an improvement.