Question : In the MapReduce framework, what is the purpose of the Map Function? 1. It processes the input and generates key-value pairs 2. It collects the output of the Reduce function 3. It sorts the results of the Reduce function 4. It breaks the input into smaller components and distributes to other nodes in the cluster
Correct Answer : 1 Explanation: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel. MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.
Question : You have completed your model and are handing it off to be deployed in production. What should you deliver to the production team, along with your commented code? 1. The production team needs to understand how your model will interact with the processes they already support. Give them documentation on expected model inputs and outputs, and guidance on error-handling. 2. The production team are technical, and they need to understand how the processes that they support work, so give them the same presentation that you prepared for the analysts. 3. The production team supports the processes that run the organization, and they need context to understand how your model interacts with the processes they already support. Give them the same presentation that you prepared for the project sponsor. 4. The production team supports the processes that run the organization, and they need context to understand how your model interacts with the processes they already support. Give them the executive summary.
Correct Answer : 1
Explanation: Data Analytics Lifecycle 1-Discovery: 2-Data preparation: 3-Model planning: 4-Model building: 5-Communicate results: 6-Operationalize: the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment.
Question : While having a discussion with your colleague, this person mentions that they want to perform Kmeans clustering on text file data stored in HDFS. Which tool would you recommend to this colleague?
1. Sqoop 2. Scribe 3. HBase 4. Mahout
Correct Answer : 4 Explanation: Apache Mahout is a suite of machine learning libraries designed to be scalable and robust. k-Means is a simple but well-known algorithm for grouping objects, clustering. All objects need to be represented as a set of numerical features. In addition, the user has to specify the number of groups (referred to as k) she wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to the center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. What algorithms are implemented in Mahout? We are interested in a wide variety of machine learning algorithms. Many of which are already implemented in Mahout. You can find a list here. What algorithms are missing from Mahout? There are many machine learning algorithms that we would like to have in Mahout. If you have an algorithm or an improvement to an algorithm that you would like to implement, start a discussion on our mailing list. Do I need Hadoop to use Mahout? There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark.
1. A larger sample size should be taken to determine if the plant is operating correctly 2. The manufacturing process is functioning properly and no further action is required 3. A smaller sample size should be taken to determine if the plant is operating correctly 4. There is a flaw in the quality assurance process and the sample should be repeated
1. Communication skill 2. Scientific background 3. Domain expertise 4. Well Organized
Question : What describes the use of UNION clause in a SQL statement? 1. Operates on queries and potentially decreases the number of rows 2. Operates on queries and potentially increases the number of rows 3. Operates on tables and potentially decreases the number of columns 4. Operates on both tables and queries and potentially increases both the number of rows and columns