Explanation: public static int frequency(Collection ?> c, Object o) Returns the number of elements in the specified collection equal to the specified object. More formally, returns the number of elements e in the collection such that (o == null ? e == null : o.equals(e)). Parameters: c - the collection in which to determine the frequency of o o - the object whose frequency is to be determined
Question : Select the correct statement while reading/writing the data in RDBMS using MapReduce 1. In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with 2. The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer 3. Access Mostly Uused Products by 50000+ Subscribers 4. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import. 5. All of the above
Correct Answer : Get Lastest Questions and Answer : Explanation: First we'll cover how DBInputFormat interacts with databases. DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases. Links to popular drivers are listed in the resources section at the end of this post. To start using DBInputFormat to connect to your database, you'll need to download the appropriate database driver from the list in the resources section (see the end of this post), and drop it into the $HADOOP_HOME/lib/ directory on your Hadoop TaskTracker machines, and on the machine where you launch your jobs from. Reading Tables with DBInputFormat The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop's formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Most queries are supported, subject to a few limitations discussed at the end of this article. Writing results back to the database A companion class, DBOutputFormat, will allow you to write results back to a database. When setting up the job, callconf.setOutputFormat(DBOutputFormat.class); and then call DBConfiguration.configureDB() as before. The DBOutputFormat.setOutput() method then defines how the results will be written back to the database. Its three arguments are the JobConf object for the job, a string defining the name of the table to write to, and an array of strings defining the fields of the table to populate. e.g., DBOutputFormat.setOutput(job, "employees", "employee_id", "name");. The same DBWritable implementation that you created earlier will suffice to inject records back into the database. Thewrite(PreparedStatement stmt) method will be invoked on each instance of the DBWritable that you pass to the OutputCollector from the reducer. At the end of reducing, those PreparedStatement objects will be turned into INSERTstatements to run against the SQL database. Limitations of the InputFormat JDBC allows applications to generate SQL queries which are executed against the database; the results are then returned to the calling application. Keep in mind that you will be interacting with your database via repeated SQL queries. Therefore: " Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce's query via a clause such as " yesterday>insert_date" or dumping the data to a temporary table in the database before launching your MapReduce process. " In order to parallelize the processing of records from the database, Hadoop will execute SQL queries that useORDER BY, LIMIT, and OFFSET clauses to select ranges out of tables. Your results, therefore, need to be orderable by one or more keys (either PRIMARY, like the one in the example, or UNIQUE). " In order to set the number of map tasks, the DBInputFormat needs to know how many records it will read. So if you're writing an arbitrary SQL query against the database, you will need to provide a second query that returns the number of rows that the first query will return (e.g., by using COUNT and GROUP BY). With these restrictions in mind, there's still a great deal of flexibility available to you. You can bulk load entire tables into HDFS, or select large ranges of data. For example, if you want to read records from a table that is also being populated by another source concurrently, you might set up that table to attach a timestamp field to each record. Before doing the bulk read, pick the current timestamp, then select all records with timestamps earlier than that one. New records being fed in by the other writer will have later timestamps and will not affect the MapReduce job. Finally, be careful to understand the bottlenecks in your data processing pipeline. Launching a MapReduce job with 100 mappers performing queries against a database server may overload the server or its network connection. In this case, you'll achieve less parallelism than theoretically possible, due to starvation, disk seeks, and other performance penalties. Limitations of the OutputFormat The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer. The reducer'sclose() method then executes them in a bulk transaction. Performing a large number of these from several reduce tasks concurrently can swamp a database. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import
1. Run all the nodes in your production cluster as virtual machines on your development workstation. 2. Run the hadoop command with the -jt local and the -fs file:///options. 3. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine. 4. Run simldooop, the Apache open-source software for simulating Hadoop clusters.
1. The keys given to a reducer aren't in a predictable order, but the values associated with those keys always are. 2. Both the keys and values passed to a reducer always appear in sorted order. 3. Neither keys nor values are in any predictable order. 4. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order