Explanation: : DOUBLE covar_pop(col1, col2) Returns the population covariance of a pair of numeric columns in the group DOUBLE covar_samp(col1, col2) Returns the sample covariance of a pair of a numeric columns in the group DOUBLE corr(col1, col2) Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group
Question : Which of the following Hive function Returns the standard deviation of a numeric column in the group
Explanation:DOUBLE stddev_pop(col) Returns the standard deviation of a numeric column in the group DOUBLE stddev_samp(col) Returns the unbiased sample standard deviation of a numeric column in the group
Question : You have two tables in Hive that are populated with data: Employee emp_id int salary string
Employee_Detail; emp_id int name string
You now create a new table de-normalized one and populate it with the results of joining the two tables as follows: CREATE TABLE EMPLOYEE_FULL AS SELECT Employee_Detail.*,Employee.salary AS s FROM Employee JOIN Employee_Detail ON (Employee.emp_id== Employee_Detail.emp_id);
You then export the table and download the file: EXPORT TABLE EMPLOYEE_FULL TO '/hadoopexam/employee/Employee_Detail.data';
You have downloaded the file and read the file as a CSV in R. How many columns will the resulting variable in R have?
Ans : 1 Exp :The Stinger Initiative successfully delivered a fundamental new Apache Hive, which evolved Hive's traditional architecture and made it faster, with richer SQL semantics and petabyte scalability. We continue to work within the community to advance these three key facets of hive: Speed Deliver sub-second query response times. Scale : The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes SQL : Enable transactions and SQL:2011 Analytics for Hive Stinger.next is focused on the vision of delivering enterprise SQL at Hadoop scale, accelerating the production deployment of Hive for interactive analytics, reporting and and ETL. More explicitly, some of the key areas that we will invest in include: When exporting a table from Hive, the data file will use the delimiters form the table. Because table3 wasn't created with specific delimiters, it will use the default Hive delimiter, which is \001 or Control-A. When the file is imported into R as a CSV, there will be only 1 column because the file isn't actually comma delimited.adoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop's "schema on read" architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data-structured and unstructured-from a multitude of sources. Data analysts use Hive to explore, structure and analyze that data, then turn it into business insight. Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop: Feature Description, Familiar , Query data with a SQL-based language , Fast Interactive response times, even over huge datasets Scalable and Extensible As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance. The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data. Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets. Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
Question :You have a HadoopExam Log files directory containing a number of comma files. Each of the files has three columns and each of the filenames has a .log extension like 12012014.log, 13012014.log, 14012014.log etc. You want a single comma-separated file (with extension .csv) that contains all the rows from all the files that do not contain the word "HadoopExam", case-sensitive. 1. cat *.log | grep -v HadoopExam > final.csv 2. find . -name '*.log' -print0 | xargs -0 cat | awk 'BEGIN {print $1, $2, $3}' | grep -v HadoopExam > final.csv 3. Access Mostly Uused Products by 50000+ Subscribers 4. grep HadoopExam *.log > final.csv
Ans: 2 Exp : There are three variations of AWK: AWK - the (very old) original from AT&T NAWK - A newer, improved version from AT&T GAWK - The Free Software foundation's version , The essential organization of an AWK program follows the form: pattern { action } The pattern specifies when the action is performed. Like most UNIX utilities, AWK is line oriented. That is, the pattern specifies a test that is performed with each line read as input. If the condition is true, then the action is taken. The default pattern is something that matches every line. This is the blank or null pattern. Two other important patterns are specified by the keywords "BEGIN" and "END". As you might expect, these two words specify actions to be taken before any lines are read, and after the last line is read. The AWK program below: BEGIN { print "START" } { print } END { print "STOP" } adds one line before and one line after the input file. This isn't very useful, but with a simple change, we can make this into a typical AWK program: BEGIN { print "File\tOwner"} { print $8, "\t", $3} END { print " - DONE -" } I'll improve the script in the next sections, but we'll call it "FileOwner". But let's not put it into a script or file yet. I will cover that part in a bit. Hang on and follow with me so you get the flavor of AWK. The characters "\t" Indicates a tab character so the output lines up on even boundries. The "$8" and "$3" have a meaning similar to a shell script. Instead of the eighth and third argument, they mean the eighth and third field of the input line. You can think of a field as a column, and the action you specify operates on each line or row read in. There are two differences between AWK and a shell processing the characters within double quotes. AWK understands special characters follow the "\" character like "t". The Bourne and C UNIX shells do not. Also, unlike the shell (and PERL) AWK does not evaluate variables within strings. To explain, the second line could not be written like this: {print "$8\t$3" } That example would print "$8 $3". Inside the quotes, the dollar sign is not a special character. Outside, it corresponds to a field. What do I mean by the third and eight field? Consider the Solaris "/usr/bin/ls -l" command, which has eight columns of information. The System V version (Similar to the Linux version), "/usr/5bin/ls -l" has 9 columns. The third column is the owner, and the eighth (or nineth) column in the name of the file. This AWK program can be used to process the output of the "ls -l" command, printing out the filename, then the owner, for each file. I'll show you how. Update: On a linux system, change "$8" to "$9". One more point about the use of a dollar sign. In scripting languages like Perl and the various shells, a dollar sign means the word following is the name of the variable. Awk is different. The dollar sign means that we are refering to a field or column in the current line. When switching between Perl and AWK you must remener that "$" has a different meaning. So the following piece of code prints two "fields" to standard out. The first field printed is the number "5", the second is the fifth field (or column) on the input line. BEGIN { x=5 } { print x, $x} Originally, I didn't plan to discuss NAWK, but several UNIX vendors have replaced AWK with NAWK, and there are several incompatibilities between the two. It would be cruel of me to not warn you about the differences. So I will highlight those when I come to them. It is important to know than all of AWK's features are in NAWK and GAWK. Most, if not all, of NAWK's features are in GAWK. NAWK ships as part of Solaris. GAWK does not. However, many sites on the Internet have the sources freely available. If you user Linux, you have GAWK. But in general, assume that I am talking about the classic AWK unless otherwise noted. You can only specify a small number of files to a command like cat, so globbing 100,000 files will cause an error. In this instance, you must use find and xargs. The grep flag should be -v to invert the sense of matching. The -i ignores the case. You switch the delimiter with awk, though tr works also.
Question : In recommender systems, the ___________ problem is often reduced by adopting a hybrid approach between content-based matching and collaborative filtering. New items (which have not yet received any ratings from the community) would be assigned a rating automatically, based on the ratings assigned by the community to other similar items. Item similarity would be determined according to the items' content-based characteristics.
Question : The cold start" problem happens in recommendation systems due to
1. Huge volume of data to analyze 2. Very small amount of data to analyze 3. Access Mostly Uused Products by 50000+ Subscribers 4. Information is enough but less memory Ans :3 Exp : The cold start" problem happens in recommendation systems due to the lack of information, on users or items.
Question : Select the correct statement which applies to Collaborative filtering
Ans : 5 Exp : Collaborative filtering (CF aka Memory based) Make recommendation based on past user-item interaction Better performance for old users and old items Does not naturally handle new users and new items
Question : You are creating a model for the recommending the book at Amazon.com, so which of the following recommender system you will use you don't have cold start problem?
1. Naive Bayes classifier 2. K-Means Clustring 3. Access Mostly Uused Products by 50000+ Subscribers 4. Content-based filtering Ans : 4 Exp : The cold start problem is most prevalent in recommender systems. Recommender systems form a specific type of information filtering (IF) technique that attempts to present information items (movies, music, books, news, images, web pages) that are likely of interest to the user. Typically, a recommender system compares the user's profile to some reference characteristics. These characteristics may be from the information item (the content-based approach) or the user's social environment (the collaborative filtering approach). In the content-based approach, the system must be capable of matching the characteristics of an item against relevant features in the user's profile. In order to do this, it must first construct a sufficiently-detailed model of the user's tastes and preferences through preference elicitation. This may be done either explicitly (by querying the user) or implicitly (by observing the user's behaviour). In both cases, the cold start problem would imply that the user has to dedicate an amount of effort using the system in its 'dumb' state - contributing to the construction of their user profile - before the system can start providing any intelligent recommendations. Content-based filtering recommender systems use information about items or users to make recommendations, rather than user preferences, so it will perform well with little user preference data. Item-based and user-based collaborative filtering makes predictions based on users' preferences for items, os they will typically perform poorly with little user preference data. Logistic regression is not recommender system technique. Content-based filtering, also referred to as cognitive filtering, recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, typically the words that occur in a document. The user profile is represented with the same terms and built up by analyzing the content of items which have been seen by the user. Several issues have to be considered when implementing a content-based filtering system. First, terms can either be assigned automatically or manually. When terms are assigned automatically a method has to be chosen that can extract these terms from items. Second, the terms have to be represented such that both the user profile and the items can be compared in a meaningful way. Third, a learning algorithm has to be chosen that is able to learn the user profile based on seen items and can make recommendations based on this user profile. The information source that content-based filtering systems are mostly used with are text documents. A standard approach for term parsing selects single words from documents. The vector space model and latent semantic indexing are two methods that use these terms to represent documents as vectors in a multi dimensional space. Relevance feedback, genetic algorithms, neural networks, and the Bayesian classifier are among the learning techniques for learning a user profile. The vector space model and latent semantic indexing can both be used by these learning methods to represent documents. Some of the learning methods also represent the user profile as one or more vectors in the same multi dimensional space which makes it easy to compare documents and profiles. Other learning methods such as the Bayesian classifier and neural networks do not use this space but represent the user profile in their own way. The efficiency of a learning method does play an important role in the decision of which method to choose. The most important aspect of efficiency is the computational complexity of the algorithm, although storage requirements can also become an issue as many user profiles have to be maintained. Neural networks and genetic algorithms are usually much slower compared to other learning methods as several iterations are needed to determine whether or not a document is relevant. Instance based methods slow down as more training examples become available because every example has to be compared to all the unseen documents. Among the best performers in terms of speed are the Bayesian classifier and relevance feedback. The ability of a learning method to adapt to changes in the user's preferences also plays an important role. The learning method has to be able to evaluate the training data as instances do not last forever but become obsolete as the user's interests change. Another criteria is the number of training instances needed. A learning method that requires many training instances before it is able to make accurate predictions is only useful when the user's interests remain constant for a long period of time. The Bayesian classifier does not do well here. There are many training instances needed before the probabilities will become accurate enough to base a prediction on. Conversely, a relevance feedback method and a nearest neighbor method that uses a notion of distance can start making suggestions with only one training instance. Learning methods also differ in their ability to modulate the training data as instances age. In the nearest neighbor method and in a genetic algorithm old training instances will have to be removed entirely. The user models employed by relevance feedback methods and neural networks can be adjusted more smoothly by reducing weights of corresponding terms or nodes. The learning methods applied to content-based filtering try to find the most relevant documents based on the user's behavior in the past. Such approach however restricts the user to documents similar to those already seen. This is known as the over-specialization problem. As stated before the interests of a user are rarely static but change over time. Instead of adapting to the user's interests after the system has received feedback one could try to predict a user's interests in the future and recommend documents that contain information that is entirely new to the user. A recommender system has to decide between two types of information delivery when providing the user with recommendations: Exploitation. The system chooses documents similar to those for which the user has already expressed a preference. Exploration. The system chooses documents where the user profile does not provide evidence to predict the user's reaction.
Question : Stories appear in the front page of Digg as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members. Which of the following technique is used to make such recommendation engine?
1. Naive Bayes classifier 2. Collaborative filtering 3. Access Mostly Uused Products by 50000+ Subscribers 4. Content-based filtering Ans : 2 Exp : Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future. A person who wants to see a movie for example, might ask for recommendations from friends. The recommendations of some friends who have similar interests are trusted more than recommendations from others. This information is used in the decision on which movie to see. Most collaborative filtering systems apply the so called neighborhood-based technique. In the neighborhood-based approach a number of users is selected based on their similarity to the active user. A prediction for the active user is made by calculating a weighted average of the ratings of the selected users. To illustrate how a collaborative filtering system makes recommendations consider the example in movie ratings table below. This shows the ratings of five movies by five people. A "+" indicates that the person liked the movie and a "-" indicates that the person did not like the movie. To predict if Ken would like the movie "Fargo", Ken's ratings are compared to the ratings of the others. In this case the ratings of Ken and Mike are identical and because Mike liked Fargo, one might predict that Ken likes the movie as well.One scenario of collaborative filtering application is to recommend interesting or popular information as judged by the community. As a typical example, stories appear in the front page of Digg as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members. Selecting Neighbourhod : Many collaborative filtering systems have to be able to handle a large number of users. Making a prediction based on the ratings of thousands of people has serious implications for run-time performance. Therefore, when the number of users reaches a certain amount a selection of the best neighbors has to be made. Two techniques, correlation-thresholding and best-n-neighbor, can be used to determine which neighbors to select. The first technique selects only those neighbors who's correlation is greater than a given threshold. The second technique selects the best n neighbors with the highest correlation.
Question :
You are designing a recommendation engine for a website where the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user. What kind of this recommendation engine is ?
1. Naive Bayes classifier 2. Collaborative filtering 3. Access Mostly Uused Products by 50000+ Subscribers 4. Content-based filtering Ans : 2 Exp : Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future. A person who wants to see a movie for example, might ask for recommendations from friends. The recommendations of some friends who have similar interests are trusted more than recommendations from others. This information is used in the decision on which movie to see. Most collaborative filtering systems apply the so called neighborhood-based technique. In the neighborhood-based approach a number of users is selected based on their similarity to the active user. A prediction for the active user is made by calculating a weighted average of the ratings of the selected users. To illustrate how a collaborative filtering system makes recommendations consider the example in movie ratings table below. This shows the ratings of five movies by five people. A "+" indicates that the person liked the movie and a "-" indicates that the person did not like the movie. To predict if Ken would like the movie "Fargo", Ken's ratings are compared to the ratings of the others. In this case the ratings of Ken and Mike are identical and because Mike liked Fargo, one might predict that Ken likes the movie as well.One scenario of collaborative filtering application is to recommend interesting or popular information as judged by the community. As a typical example, stories appear in the front page of Digg as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members. Selecting Neighbourhod : Many collaborative filtering systems have to be able to handle a large number of users. Making a prediction based on the ratings of thousands of people has serious implications for run-time performance. Therefore, when the number of users reaches a certain amount a selection of the best neighbors has to be made. Two techniques, correlation-thresholding and best-n-neighbor, can be used to determine which neighbors to select. The first technique selects only those neighbors who's correlation is greater than a given threshold. The second technique selects the best n neighbors with the highest correlation.Another aspect of collaborative filtering systems is the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and help the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user
Question :
As a data scientist consultant at HadoopExam.com, you are working on a recommendation engine for the learning resources for end user. So Which recommender system technique benefits most from additional user preference data?
1. Naive Bayes classifier 2. Item-based collaborative filtering 3. Access Mostly Uused Products by 50000+ Subscribers 4. Logistic Regression Ans : 2 Exp : Item-based scales with the number of items, and user-based scales with the number of users you have. If you have something like a store, you'll have a few thousand items at the most. The biggest stores at the time of writing have around 100,000 items. In the Netflix competition, there were 480,000 users and 17,700 movies. If you have a lot of users, then you'll probably want to go with item-based similarity. For most product-driven recommendation engines, the number of users outnumbers the number of items. There are more people buying items than unique items for sale. Item-based collaborative filtering makes predictions based on users preferences for items. More preference data should be beneficial to this type of algorithm. Content-based filtering recommender systems use information about items or users, and not user preferences, to make recommendations. Logistic Regression, Power iteration and a Naive Bayes classifier are not recommender system techniques. Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future. A person who wants to see a movie for example, might ask for recommendations from friends. The recommendations of some friends who have similar interests are trusted more than recommendations from others. This information is used in the decision on which movie to see. Most collaborative filtering systems apply the so called neighborhood-based technique. In the neighborhood-based approach a number of users is selected based on their similarity to the active user. A prediction for the active user is made by calculating a weighted average of the ratings of the selected users. To illustrate how a collaborative filtering system makes recommendations consider the example in movie ratings table below. This shows the ratings of five movies by five people. A "+" indicates that the person liked the movie and a "-" indicates that the person did not like the movie. To predict if Ken would like the movie "Fargo", Ken's ratings are compared to the ratings of the others. In this case the ratings of Ken and Mike are identical and because Mike liked Fargo, one might predict that Ken likes the movie as well.One scenario of collaborative filtering application is to recommend interesting or popular information as judged by the community. As a typical example, stories appear in the front page of Digg as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members. Selecting Neighbourhod : Many collaborative filtering systems have to be able to handle a large number of users. Making a prediction based on the ratings of thousands of people has serious implications for run-time performance. Therefore, when the number of users reaches a certain amount a selection of the best neighbors has to be made. Two techniques, correlation-thresholding and best-n-neighbor, can be used to determine which neighbors to select. The first technique selects only those neighbors who's correlation is greater than a given threshold. The second technique selects the best n neighbors with the highest correlation.
Question :
While working with Netflix the movie rating websites you have developed a recommender system that has produced ratings predictions for your data set that are consistently exactly 1 higher for the user-item pairs in your dataset than the ratings given in the dataset. There are n items in the dataset. What will be the calculated RMSE of your recommender system on the dataset? 1. 1 2. 2 3. Access Mostly Uused Products by 50000+ Subscribers 4. n 5. n/2
Ans :1 Exp : The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed. Basically, the RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent. RMSE is calculated as the square root of the mean of the squares of the errors. The error in every case in this example is 1. The square of 1 is 1 The average of n items with value 1 is 1 The square root of 1 is 1 The RMSE is therefore 1
Question :
Select the correct statement which applies to MAE
1. The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction 2. The MAE measures the average magnitude of the errors in a set of forecasts, with considering their direction 3. Access Mostly Uused Products by 50000+ Subscribers 4. It measures accuracy for discrete variables 5. The MAE is a linear score which means that all the individual differences are weighted equally in the average. 6. The MAE is a non-linear score which means that all the individual differences are weighted equally in the average.
The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables. The equation is given in the library references. Expressed in words, the MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The MAE is a linear score which means that all the individual differences are weighted equally in the average.
Question :
You read that a set of temperature forecasts shows a MAE of 1.5 degrees and a RMSE of 2.5 degrees. What does this mean? Choose the best answer:
Explanation: 1. This is true, but not the best answer. If RMSE>MAE, then there is variation in the errors. 2. This is true too; the RMSE-MAE difference isn't large enough to indicate the presence of very large errors. 3. Access Mostly Uused Products by 50000+ Subscribers
1. Speed. Since multiplication is more expensive than addition, taking the product of a high number of probabilities is faster if they are represented in log form. (The conversion to log form is expensive, but is only incurred once.)