Explanation: Types of quasi-structured data and examples of each
totally unstructured data - google search results cover all websites, but are hard to further categorize without access the google database itself intuitive-structure - my wordtree algorithm accepts any pasted text and yields a network map based on similarity of langauge within the text, as well as proximity of words to each other within the text. But it is not "tagged" the way youtube and flickr track content in images emergent structure - algorithms to extract the main idea of groups of stories pseudo-structuring - looking at content and assigning structure to all possible variations of a single document type, such as I did with the auditing tool. guess, apply a rule, and refine - in this mode the algorithm tries an approach and refines it iteratively based on user feedback. IF the feedback is automated in the form of a score on the result, this approach becomes evolutionary programming.
These strategies for structuring Big Data have come about as a consequence of two trends. First - 100 times more content is added online each year than the sum of all books ever written in history. Second - most of this content is structured by institutions that for various reasons don't want to release the fully annotated version of the information. So pragmatic programmers like me build "wrappers" to restructure the parts that are available. Eventually there will be a universal wrapper for all content about financial records, and another one for all organization reports. These data sets will organize content into clusters that are similar enough for us to study patterns on a global scale. That's when "big data" begins to get interesting. Today, we're in the early stages of deconstructing the structure so that we can reconstruct larger data sets from the individual parts that each have unique yet "incompatible" structures. It is like taking apart all the cars in a junk yard so we can categorize all the parts and deliver them to customers that want to build fresh cars. You see cars go in and cars go out, but a lot happens in between.
Last year, if someone had asked you to track all the work you do on your computer, you would have probably filled out a survey (like the "time tracking" reports I fill out monthly at work). In the future your computer will fill them out for you and in greater detail, and these data will be "mashable" with other reporting systems. This will not happen because two systems are built to work together, but instead because someone build a third system that allows two systems to share information. Eventually we will build "genetic algorithms" that will write programs that can re-organize data into usable structures regardless of how the original data was structured. This is going to happen in the next ten years and we will ask ourselves why we didn't do it sooner.
Question : What would be considered "Big Data"?
1. An OLAP Cube containing customer demographic information about 100, 000, 000 customers
2. Aggregated statistical data stored in a relational database table
Explanation: Information sets that approach the size of all information known about "X". For example, instead of a sample of e-books, it means a comprehensive set of all e-books ever written (~70% to N=ALL). Big Data sets are noisier yet do not require us to know beforehand what questions we will pose of it. We can drill down in Big Data sets and ask arbitrary questions. It is a complementary method to statistics, which rely on sampling to eliminate bias through random sampling. Instead, Big Data assumes bias and quantifies what the biases are in the data set, so that they can be detected, inspected, and corrected.
Question : A data scientist plans to classify the sentiment polarity of , product reviews collected from the Internet. What is the most appropriate model to use? Suppose labeled training data is available.
Explanation: Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features.
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers.[5] Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests.[6]
An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification
1. Select one of the four datasets and begin planning and building a model 2. Combine the data from all four of the datasets and begin planning and bulding a model 3. Access Mostly Uused Products by 50000+ Subscribers 4. Visualize the data to further explore the characteristics of each data set
1. Run all the models again against a larger sample, leveraging more historical data. 2. Report that the results are insignificant, and reevaluate the original business question. 3. Access Mostly Uused Products by 50000+ Subscribers 4. Modify samples used by the models and iterate until a significant result occurs.