Question : You want to understand more about how users browse your public website. For example, you want to know which pages they visit prior to placing an order. You have a server farm of 200 web servers hosting your website. Which is the most efficient process to gather these web server across logs into your Hadoop cluster analysis?
A. Sample the web server logs web servers and copy them into HDFS using curl B. Ingest the server web logs into HDFS using Flume C. Channel these clickstreams into Hadoop using Hadoop Streaming D. Import all user clicks from your OLTP databases into Hadoop using Sqoop E. Write a MapReeeduce job with the web servers for mappers and the Hadoop cluster nodes for reducers
Explanation: Watch the training Module 21 from http://hadoopexam.com/index.html/#hadoop-training
Question : You need to analyze ,, images stored in JPEG format, each of which is approximately KB. Because you Hadoop cluster isn't optimized for storing and processing many small files, you decide to do the following actions: 1. Group the individual images into a set of larger files 2. Use the set of larger files as input for a MapReduce job that processes them directly with python using Hadoop streaming. Which data serialization system gives the flexibility to do this? A. CSV B. XML C. HTML D. Avro E. SequenceFiles F. JSON 1. AB 2. AC 3. Access Mostly Uused Products by 50000+ Subscribers 4. CD 5. EF
Explanation: Avro provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Schemas : Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved. Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries. Comparison with other systems : Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects. Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names. Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation. SequenceFile are large and do contain : enough data repetition to make compression desirable, you still have the option to use RECORD level compression. A SequenceFile containing images or other binary payloads would be a good example of when RECORD level compression is a good idea.
Question : Identify two features/issues that YARN is designated to address:
A. Standardize on a single MapReduce API B. Single point of failure in the NameNode C. Reduce complexity of the MapReduce APIs D. Resource pressure on the JobTracker E. Ability to run framework other than MapReduce, such as MPI F. HDFS latency 1. AB 2. AC 3. Access Mostly Uused Products by 50000+ Subscribers 4. CD 5. EF
Explanation: Hadoop MapReduce is not without its flaws. The team at Yahoo ran into a number of scalability limitations that were difficult to overcome given Hadoops existing architecture and design. In large-scale deployments such as Yahoo!’s Hammer cluster-a single, 4,000-plus node Hadoop cluster that powers various systems—the team found that the resource requirements on a single jobtracker were just too great. Further, operational issues such as dealing with upgrades and the single point of failure of the jobtracker were painful. YARN (or Yet Another Resource Negotiator) was created to address these issues. Rather than have a single daemon that tracks and assigns resources such as CPU and memory and handles MapReduce-specific job tracking, these functions are separated into two parts. The resource management aspect of the jobtracker is run as a new daemon called the resource manager,; a separate daemon responsible for creating and allocating resources to multiple applications. Each application is an individual Map-Reduce job, but rather than have a single jobtracker, each job now has its own jobtracker- equivalent called an application master that runs on one of the workers of the cluster. This is very different from having a centralized jobtracker in that the application master of one job is now completely isolated from that of any other. This means that if some catastrophic failure were to occur within the jobtracker, other jobs are unaffected. Further, because the jobtracker is now dedicated to a specific job, multiple jobtrackers can be running on the cluster at once. Taken one step further, each jobtracker can be a different version of the software, which enables simple rolling upgrades and multiversion support. When an application completes, its application master, such as the jobtracker, and other resources are returned to the cluster. As a result, there’s no central jobtracker daemon in YARN. Worker nodes in YARN also run a new daemon called the node manager in place of the traditional tasktracker. While the tasktracker expressly handled MapReduce-specific functionality such as launching and managing tasks, the node manager is more generic. Instead, the node manager launches any type of process, dictated by the application, in an application container. For instance, in the case of a MapReduce application, the node manager manages both the application master (the jobtracker) as well as individual map and reduce tasks. With the ability to run arbitrary applications, each with its own application master, it’s even possible to write non-MapReduce applications that run on YARN. Not entirely an accident, YARN provides a compute-model-agnostic resource management framework for any type of distributed computing framework. Members of the Hadoop community have already started to look at alternative processing systems that can be built on top of YARN for specific problem domains such as graph processing and more traditional HPC systems such as MPI. The flexibility of YARN is enticing, but it’s still a new system. At the time of this writing, YARN is still considered alpha-level software and is not intended for production use. Initially introduced in the Apache Hadoop 2.0 branch, YARN hasn’t yet been battletested in large clusters. Unfortunately, while the Apache Hadoop 2.0 lineage includes highly desirable HDFS features such as high availability, the old-style jobtracker and tasktracker daemons (now referred to as MapReduce version one, or MRv1) have been removed in favor of YARN. This creates a potential conflict for Apache Hadoop users that want these features with the tried and true MRv1 daemons. CDH4, however, in- cludes the HDFS features as well as both MRv1 and YARN
Watch the training from http://hadoopexam.com/index.html/#hadoop-training