Question : You are working as a data scientists in a retail chain company. To you and your team have been given a project to implement recommendation engines for the products which company is selling online and you decided to create an analytics sandbox. So which of the following you are trying to achieve?
1. You are creating a Hive table in Hadoop Framework.
2. You are defining the SQL queries for extracting the data.
3. You are estimating the size of the datasets and planning in total 5 to 10 time storage size for the data.
4. You would be transforming your semi-structured data into well formatted data and saving into csv file.
5. You are selecting the Advanced Analytics model.
Correct Answer : 3 Explanation: Sandbox and workspace can be used interchangeably. And you will be exploring the datasets while designing the sandboxes. This is a separate copy of the data then your production data, so that you can run in-memory analytics on this data without affecting your production load. Generally a sandbox contains raw data, aggregated data or data with the various formats. You would be planning 5-10 time size of the data for the analytics sandboxes. Because you will be creating various copy of the data.
Question : You are working with a training company which provides online trainings in various profession. You have received the data for further analysis which are already transformed and structured. You find that there is a high correlation between course category, course watched and number of hours training watched. You need to use some technique to handle this highly co-related variable, which of the below you will be using?
1. You will take a square root of each variable, so that correlation can be removed.
2. You will be discarding these all three variables.
3. You would be using normalizing technique so that three variables become equal in size.
4. You will be creating a new variable which is a function of these three correlated variable.
Correct Answer : 4 Explanation: Handling correlated variables: Choose based on which one is more logically connected to what you're trying to predict, or else go with the one that correlates most strongly with the outcome variable. The other option is to create a new variable by combining them. If you go this route, I'd suggest converting the two variables onto a similar scale (maybe a z score), then summing them. This may give you a more valid measure of the underlying construct. For example, age and experience may both be related to judgment.
Question : You are doing advanced analytics for the one of the medical application using the regression and you have two variables which are weight and height and they are very important input variables, which cannot be ignored and they are also highly co-related. What is the best solution for that?
1. You will take cube root of height
2. You will take square root of weight
3. You will take square of the height.
4. You would consider using BMI (Body Mass Index)
Correct Answer : 4 Explanation: If multiple variables are highly co-related then it is better you consider using the either of the variable which correlates more (which is not in the given option) or go for the new variable which is a function of the both the variable in this case it could be BMI (Body Mass Index). Because it is a function of both weight and height as per the below formula. BMI = Weight/(Height * Height)