# Statistical And Machine-Learning Data Mining, T...

I suppose I needn't say much to explain what statistics is on this site, but perhaps I can say a few things. Classical statistics (here I mean both frequentist and Bayesian) is a sub-topic within mathematics. I think of it as largely the intersection of what we know about probability and what we know about optimization. Although mathematical statistics can be studied as simply a Platonic object of inquiry, it is mostly understood as more practical and applied in character than other, more rarefied areas of mathematics. As such (and notably in contrast to data mining above), it is mostly employed towards better understanding some particular data generating process. Thus, it usually starts with a formally specified model, and from this are derived procedures to accurately extract that model from noisy instances (i.e., estimation--by optimizing some loss function) and to be able to distinguish it from other possibilities (i.e., inferences based on known properties of sampling distributions). The prototypical statistical technique is regression.

## Statistical and Machine-Learning Data Mining, T...

Machine learning is usually much easier to evaluate: there is a goal such as score or class prediction. You can compute precision and recall. In data mining, most evaluation is done by leaving out some information (such as class labels) and then testing whether your method discovered the same structure. This is naive in the sense, as you assume that the class labels encode the structure of the data completely; you actually punish data mining algorithm that discover something new in your data. Another way of - indirectly - evaluating it, is how the discovered structure improves the performance of the actual ML algorithm (e.g. when partitioning data or removing outliers). Still, this evaluation is based on reproducing existing results, which is not really the data mining objective...

Firstly AI (although it could mean any intelligent system) has traditionally meant logic based approaches (eg expert systems) rather than statistical estimation.Statistics, based in maths depts, has had a very good theoretical understanding, together with strong applied experience in experimental sciences, where there is a clear scientific model, and statistics is needed to deal with the limited experimental data available. The focus has often been on squeezing the maximum information from very small data sets. furthermore there is a bias towards mathematical proofs: you will not get published unless you can prove things about your approach. This has tended to mean that statistics has lagged in the use of computers to automate analysis. Again, the lack of programming knowledge has prevented statisticians to work on large scale problems where computational issues become important (consider GPUs and distributed systems such as hadoop). I believe that areas such as bioinformatics have now moved statistics more in this direction. Finally I would say that statisticians are a more sceptical bunch: they do not claim that you discover knowledge with statistics- rather a scientist comes up with a hypothesis, and the statistician's job is to check that the hypothesis is supported by the data. Machine learning is taught in cs departments, which unfortunately do not teach the appropriate mathematics: multivariable calculus, probability, statistics and optimisation is not commonplace...one has vague 'glamorous' concepts such as learning from examples...rather than boring statistical estimation[ cf eg Elements of statistical learning page 30. This tends to mean that there is very little theoretical understanding and an explosion of algorithms as researchers can always find some dataset on which their algorithm proves better. So there are huge phases of hype as ML researchers chase the next big thing: neural networks, deep learning etc. Unfortunately there is a lot more money in CS departments (think google, Microsoft, together with the more marketable 'learning') so the more sceptical statisticians are ignored. Finally, there is an empiricist bent: basically there is an underlying belief that if you throw enough data at the algorithm it will 'learn' the correct predictions. Whilst I am biased against ML, there is a fundamental insight in ML which statisticians have ignored: that computers can revolutionise the application of statistics.

There are two ways- a) automating the application of standard tests and models. Eg running a battery of models ( linear regression, random forests, etc trying different combinations of inputs, parameter settings etc). This hasn't really happened- though I suspect that competitors on kaggle develop their own automation techniques. b) applying standard statistical models to huge data: think of eg google translate, recommender systems etc (no one is claiming that eg people translate or recommend like that..but its a useful tool). The underlying statistical models are straightforward but there are enormous computational issues in applying these methods to billions of data points.

So I would summarise that traditional AI is logic based rather than statistical, machine learning is statistics without theory and statistics is 'statistics without computers', and data mining is the development of automated tools for statistical analysis with minimal user intervention.

Data mining packages (for instance the open source Weka), have built in techniques for input selection, support vector machines classification, etc while these are for the most part just absent in statistical packages like JMP. I recently when to a course on "data mining in jmp" from the jmp people, and although it is a visually strong package, some essential data mining pre/post/mid techniques are just missing. Input selection was done manually, to get insight in the data, still in data mining, it is just your intention to release algorithms, smartly, on large data and automatically see what comes out. The course was obviously taught by statistics people, which emphasised the different mindset between the two.

In data mining, just like the name sounds, you mine data. Now mining means extracting knowledge from it, but also in general that usually means you are calculating some measures or statistics in the data, like Jaccard Index as an example.

A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.[6][7]

Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Some of the training examples are missing training labels, yet many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce a considerable improvement in learning accuracy.

In data mining, anomaly detection, also known as outlier detection, is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.[57] Typically, the anomalous items represent an issue such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are referred to as outliers, novelties, noise, deviations and exceptions.[58]

In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts of inactivity. This pattern does not adhere to the common statistical definition of an outlier as a rare object. Many outlier detection methods (in particular, unsupervised algorithms) will fail on such data unless aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro-clusters formed by these patterns.[59]

Three broad categories of anomaly detection techniques exist.[60] Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal, by looking for instances that seem to fit the least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherently unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set and then test the likelihood of a test instance to be generated by the model.

Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels, and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision-making.

Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[73]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. 041b061a72