Correct Answer : Get Lastest Questions and Answer : Exp: MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel. MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel - though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[10] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled - assuming the input data is still available.
Another way to look at MapReduce is as a 5-step parallel and distributed computation:
Prepare the Map() input - the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value. Run the user-provided Map() code - Map() is run exactly once for each K1 key value, generating output organized by key values K2. "Shuffle" the Map output to the Reduce processors - the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value. Run the user-provided Reduce() code - Reduce() is run exactly once for each K2 key value produced by the Map step. Produce the final output - the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome. These five steps can be Logically thought of as running in sequence - each step starts only after the previous step is completed - although in practice they can be interleaved as long as the final result is not affected.
In many situations, the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.
Question : Which data type value is used for the observed response variable in a logistic regression model?
Correct Answer : Get Lastest Questions and Answer : Exp: Logistic regression is used widely in many fields, including the medical and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict whether a patient has a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.; age, blood cholesterol level, systolic blood pressure, relative weight, blood hemoglobin level, smoking (at 3 levels), and abnormal electrocardiogram.).Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[citation needed] In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing.
Question : A data scientist is given an R data frame, "empdata", with the columns Age, Salary, Occupation, Education, and Gender. The data scientist would like to examine only the Salary and Occupation columns for ages greater than 40. Which command extracts the appropriate rows and columns from the data frame?
Correct Answer : Get Lastest Questions and Answer : Exp: A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b. > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame Build-in Data Frame We use built-in data frames in R for our tutorials. For example, here is a built-in data frame in R, called mtcars. > mtcars mpg cyl disp hp drat wt ... Mazda RX4 21.0 6 160 110 3.90 2.62 ... Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ... Datsun 710 22.8 4 108 93 3.85 2.32 ... ............ The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell. To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator. The two coordinates are separated by a comma. In other words, the coordinates begins with row position, then followed by a comma, and ends with the column position. The order is important. Here is the cell value from the first row, second column of mtcars. > mtcars[1, 2] [1] 6 Moreover, we can use the row and column names instead of the numeric coordinates. > mtcars["Mazda RX4", "cyl"] [1] 6 Lastly, the number of data rows in the data frame is given by the nrow function. > nrow(mtcars) # number of data rows [1] 32 And the number of columns of a data frame is given by the ncol function. > ncol(mtcars) # number of columns [1] 11 Further details of the mtcars data set is available in the R documentation. > help(mtcars) Preview Instead of printing out the entire data frame, it is often desirable to preview it with the head function beforehand. > head(mtcars) mpg cyl disp hp drat wt ... Mazda RX4 21.0 6 160 110 3.90 2.62 ... ............