Select the correct statement which applies for below DDL
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE
1. the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime 2. an organization allows the user to do efficient sampling on the clustered column - in this case userid. 3. Access Mostly Uused Products by 50000+ Subscribers 4. The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table - only how it is read.
Explanation: Bucketed Sorted Tables In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps. The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table - only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
Explanation: public static int frequency(Collection ?> c, Object o) Returns the number of elements in the specified collection equal to the specified object. More formally, returns the number of elements e in the collection such that (o == null ? e == null : o.equals(e)). Parameters: c - the collection in which to determine the frequency of o o - the object whose frequency is to be determined
Question : Select the correct statement while reading/writing the data in RDBMS using MapReduce 1. In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with 2. The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer 3. Access Mostly Uused Products by 50000+ Subscribers 4. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import. 5. All of the above
Correct Answer : Get Lastest Questions and Answer : Explanation: First we'll cover how DBInputFormat interacts with databases. DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases. Links to popular drivers are listed in the resources section at the end of this post. To start using DBInputFormat to connect to your database, you'll need to download the appropriate database driver from the list in the resources section (see the end of this post), and drop it into the $HADOOP_HOME/lib/ directory on your Hadoop TaskTracker machines, and on the machine where you launch your jobs from. Reading Tables with DBInputFormat The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop's formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Most queries are supported, subject to a few limitations discussed at the end of this article. Writing results back to the database A companion class, DBOutputFormat, will allow you to write results back to a database. When setting up the job, callconf.setOutputFormat(DBOutputFormat.class); and then call DBConfiguration.configureDB() as before. The DBOutputFormat.setOutput() method then defines how the results will be written back to the database. Its three arguments are the JobConf object for the job, a string defining the name of the table to write to, and an array of strings defining the fields of the table to populate. e.g., DBOutputFormat.setOutput(job, "employees", "employee_id", "name");. The same DBWritable implementation that you created earlier will suffice to inject records back into the database. Thewrite(PreparedStatement stmt) method will be invoked on each instance of the DBWritable that you pass to the OutputCollector from the reducer. At the end of reducing, those PreparedStatement objects will be turned into INSERTstatements to run against the SQL database. Limitations of the InputFormat JDBC allows applications to generate SQL queries which are executed against the database; the results are then returned to the calling application. Keep in mind that you will be interacting with your database via repeated SQL queries. Therefore: " Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce's query via a clause such as " yesterday>insert_date" or dumping the data to a temporary table in the database before launching your MapReduce process. " In order to parallelize the processing of records from the database, Hadoop will execute SQL queries that useORDER BY, LIMIT, and OFFSET clauses to select ranges out of tables. Your results, therefore, need to be orderable by one or more keys (either PRIMARY, like the one in the example, or UNIQUE). " In order to set the number of map tasks, the DBInputFormat needs to know how many records it will read. So if you're writing an arbitrary SQL query against the database, you will need to provide a second query that returns the number of rows that the first query will return (e.g., by using COUNT and GROUP BY). With these restrictions in mind, there's still a great deal of flexibility available to you. You can bulk load entire tables into HDFS, or select large ranges of data. For example, if you want to read records from a table that is also being populated by another source concurrently, you might set up that table to attach a timestamp field to each record. Before doing the bulk read, pick the current timestamp, then select all records with timestamps earlier than that one. New records being fed in by the other writer will have later timestamps and will not affect the MapReduce job. Finally, be careful to understand the bottlenecks in your data processing pipeline. Launching a MapReduce job with 100 mappers performing queries against a database server may overload the server or its network connection. In this case, you'll achieve less parallelism than theoretically possible, due to starvation, disk seeks, and other performance penalties. Limitations of the OutputFormat The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer. The reducer'sclose() method then executes them in a bulk transaction. Performing a large number of these from several reduce tasks concurrently can swamp a database. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import
1. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 2. Pig comes with no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Pig comes with additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
1. The node containing the first TaskTracker to heartbeat into the JobTracker, regardless of the location of the input split 2. The node containing the first JobTracker to heartbeat into the Namenode, regardless of the location of the input split 3. Access Mostly Uused Products by 50000+ Subscribers 4. The node containing nearest location of the input split