Question : You are given , , user profile pages of an online dating site in XML files, and they are stored in HDFS. You are assigned to divide the users into groups based on the content of their profiles. You have been instructed to try K-means clustering on this data. How should you proceed?
1. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively. 2. Run MapReduce to transform the data, and find relevant key value pairs. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Partition the data by XML file size, and run K-means clustering in each partition.
Explanation: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel. MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.
Question : The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in-depth insight. They have identified several popular product review blogs that historically have published thousands of user reviews of your company's products.
You have been asked to provide the desired analysis. You examine the RSS feeds for each blog and determine which fields are relevant. You then craft a regular expression to match your new product's name and extract the relevant text from each matching review. What is the next step you should take?
1. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product 2. Convert the extracted text into a suitable document representation and index into a review corpus 3. Access Mostly Uused Products by 50000+ Subscribers 4. Group the reviews using Naive Bayesian classification
Explanation: A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
> n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame Build-in Data Frame We use built-in data frames in R for our tutorials. For example, here is a built-in data frame in R, called mtcars.
> mtcars mpg cyl disp hp drat wt ... Mazda RX4 21.0 6 160 110 3.90 2.62 ... Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ... Datsun 710 22.8 4 108 93 3.85 2.32 ... ............ The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.
To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the coordinates begins with row position, then followed by a comma, and ends with the column position.