Question : You are upgrading your Hadoop cluster from MRv to MRv, but while creating MRv cluster you have not considered or took seriously the configuration for disk drives on slave data node. But now you want the proper hardware configuration for slave node. Please select the correct configuration for data node.
4. With RAID configuration 24 slots 500GB disk drives is fine
Correct Answer : Get Lastest Questions and Answer : Explanation: The first step in choosing a machine configuration is to understand the type of hardware your operations team already manages. Operations teams often have opinions or hard requirements about new machine purchases, and will prefer to work with hardware with which they're already familiar. Hadoop is not the only system that benefits from efficiencies of scale. Again, as a general suggestion, if the cluster is new or you can't accurately predict your ultimate workload, we advise that you use balanced hardware. There are four types of roles in a basic Hadoop cluster: NameNode (and Standby NameNode), JobTracker, TaskTracker, and DataNode. (A node is a machine performing a particular task.) Most machines in your cluster will perform two of these roles, functioning as both DataNode (for data storage) and TaskTracker (for data processing). Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster: 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz 64-512GB of RAM Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed) The NameNode role is responsible for coordinating data storage on the cluster, and the JobTracker for coordinating data processing. (The Standby NameNode should not be co-located on the NameNode machine for clusters and will run on hardware identical to that of the NameNode.) Cloudera recommends that customers purchase enterprise-class machines for running the NameNode and JobTracker, with redundant power and enterprise-grade disks in RAID 1 or 10 configurations. The NameNode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of NameNode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides plenty of room to grow the cluster. We also recommend having HA configured on both the NameNode and JobTracker, features that have been available in the CDH4 line for some time. Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy: 4-6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node) 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz 64-128GB of RAM Bonded Gigabit Ethernet or 10Gigabit Ethernet Remember, the Hadoop ecosystem is designed with a parallel environment in mind. If you expect your Hadoop cluster to grow beyond 20 machines, we recommend that the initial cluster be configured as if it were to span two racks, where each rack has a top-of-rack 10 GigE switch. As the cluster grows to multiple racks, you will want to add redundant core switches to connect the top-of-rack switches with 40GigE. Having two logical racks gives the operations team a better understanding of the network requirements for intra-rack and cross-rack communication. With a Hadoop cluster in place, the team can start identifying workloads and prepare to benchmark those workloads to identify hardware bottlenecks. After some time benchmarking and monitoring, the team will understand how additional machines should be configured. Heterogeneous Hadoop clusters are common, especially as they grow in size and number of use cases - so starting with a set of machines that are not "ideal" for your workload will not be a waste of time. Cloudera Manager offers templates that allow different hardware profiles to be managed in groups, making it simple to manage heterogeneous clusters. DataNodes will typically have multiple disks installed. Disks should be configured in a JBOD (Just a Bunch of Disks) configuration; RAID is not necessary, since the cluster replicates blocks on multiple nodes. Hadoop is designed with the expectation that disks (and even nodes) will fail; it is for that reason that each block is replicated three times (by default) on different machines in the cluster. Using RAID will simply result in less disk space being available to HDFS; for instance, if you configure 12 2TB disks using RAID 1, you will only have 12TB of disk space available to HDFS, whereas if you configure the disks as JBOD you will have a full 24TB available. Given that each worker node in a cluster is responsible for both storage and computation, we need to ensure not only that there is enough storage capacity, but also that we have the CPU and memory to process that data. Consider the case where a system ingests new data at a rate of 1 TB per day. We know Hadoop will replicate this data three times by default, which means the hardware needs to accommodate 3 TB of new data every day Each machine also needs additional disk capacity to store temporary data during processing with MapReduce. A ballpark estimate is that 20-30% of the machine's raw disk capacity needs to be reserved for temporary data. If we had machines with 12 * 2 TB disks, that leaves only 18 TB of space to store HDFS data, or six days worth of data.
Question : You already have a cluster on the Hadoop MapReduce MRv, but now you have to upgrade the same on MRv but somehow your management is not agreeing to install Apache Hive. And you have to convince your management for installing the Apache Hive in Hadoop Cluster. Which is the correct statement which you can use to show the relationship between MapReduce and Apache Hive? 1. Apche Hive comes with the additional capabilities that allow certain types of data manipulation not possible with MapReduce. 2. Apache Hive programs can only rely on MapReduce but are extensible, allowing developers to do special-purpose processing not provided by MapReduce. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Apache Hive comes with no additional capabilities to MapReduce. Hive programs are executed as MapReduce jobs via the Hive interpreter.
Correct Answer : Get Lastest Questions and Answer : Explanation: The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.You already have a cluster on the Hadoop MapReduce MRv1, but now you have to upgrade the same on MRv2 but somehow your management is not agreeing to install Apache Hive. And you have to convince your management for installing the Apache Hive in Hadoop Cluster. Which is the correct statement which you can use to show the relationship between MapReduce and Apache Hive, Hive is a framework that translates queries written in Hive QL into jobs that are executed by the MapReduce framework. Hive does not provide any functionality that isn't provided by MapReduce, but it makes some types of data operations significantly easier to perform.
Question : What is HIVE? 1. HIVE is part of the Apache Hadoop project that enables in-memory analysis of real-time streams of data 2. Hive is a way to add data from local file system to HDFS 3. Access Mostly Uused Products by 50000+ Subscribers 4. Hive is a part of the Apache Hadoop project that provides SQL like interface for data processing
Hive is a project initially developed by facebook specifically for people with very strong SQL skills and not very strong Java skills who want to query data in Hadoop
1. 6 GB reserved for system memory + (if HBase) 8 GB for HBase 2. 4 GB reserved for system memory + (if HBase) 8 GB for HBase 3. 2 GB reserved for system memory + (if HBase) 8 GB for HBase 4. 12 GB reserved for system memory + (if HBase) 8 GB for HBase
1. The physical RAM limit for each Map and Reduce task 2. The JVM heap size limit for each task. 3. The amount of virtual memory each task will receive. 4. 1 and 3 5. All 1,2 and 3
1. Two active NameNodes and two Standby NameNodes 2. One active NameNode and one Standby NameNode 3. Two active NameNodes and on Standby NameNode 4. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number of NameNodes you can deploy