Question : Assume that you have a data frame in R. Which function would you use to display descriptive statistics about this variable?
1. levels 2. attributes 3. str 4. summary
Correct Answer : 4 Explanation: summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. Usage summary(object, ...) ## Default S3 method: summary(object, ..., digits = max(3, getOption("digits")-3)) ## S3 method for class 'data.frame' summary(object, maxsum = 7, digits = max(3, getOption("digits")-3), ...)
## S3 method for class 'factor' summary(object, maxsum = 100, ...) ## S3 method for class 'matrix' summary(object, ...) Arguments
object : an object for which a summary is desired. maxsum : integer, indicating how many levels should be shown for factors. digits : integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame). additional arguments affecting the summary produced. Details : For factors, the frequency of the first maxsum - 1 most frequent levels is shown, and the less frequent levels are summarized in "(Others)" (resulting in at most maxsum frequencies). The functions summary.lm and summary.glm are examples of particular methods which summarize the results produced by lm and glm.
Question : What is the mandatory Clause that must be included when using Window functions? 1. OVER 2. RANK 3. PARTITION BY 4. RANK BY
Correct Answers: 1 Explanation: A window function call always contains an OVER clause following the window function's name and argument(s). This is what syntactically distinguishes it from a regular function or aggregate function. The OVER clause determines exactly how the rows of the query are split up for processing by the window function. The PARTITION BY list within OVER specifies dividing the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.
Although avg will produce the same result no matter what order it processes the partition's rows in, this is not true of all window functions. When needed, you can control that order using ORDER BY within OVER. Here is an example:
SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;
Question : What is the purpose of the process step "parsing" in text analysis? 1. computes the TF-IDF values for all keywords and indices 2. executes the clustering and classification to organize the contents 3. performs the search and/or retrieval in finding a specific topic or an entity in a document 4. imposes a structure on the unstructured/semi-structured text for downstream analysis
Correct Answer : 4 Explanation: Parsing is the process that takes unstructured text and imposes a structure for further analysis. The unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText Markup Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders it in a more structured way for the subsequent steps.
1. Divide the data into sets of 1, 000 user profiles, and run K-means clustering in RHadoop iteratively. 2. Run MapReduce to transform the data, and find relevant key value pairs. 3. Run a Naive Bayes classification as a pre-processing step in HDFS. 4. Partition the data by XML file size, and run K-means clustering in each partition.
1. Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product 2. Convert the extracted text into a suitable document representation and index into a review corpus 3. Read the extracted text for each review and manually tabulate the results 4. Group the reviews using Naive Bayesian classification