In data mining, there are two types of learning processes, supervised learning and unsupervised learning. Regression and classification are the two broad categories of supervised learning. Both of these categories require training data and labels.
Regression is used for numeric prediction, hence requires a numeric label. Classification is used for categorial prediction, hence, requires a categorical label.
Supervised Learning
1. Regression Modelling
If a task requires numeric prediction, apply the regression technique. A regression model is mathematically represented as,
Regression Model : Response = f(variables)
Regression presents the relationship as a function of variables that predicts the response. The relationship between the response variable and predictors can be defined using the Data-Learn-Predict approach. Data are the set of variables, learn represents the model from the trained data, and predict aims to provide the value of the response variable.
Using the Data-Learn-Predict approach, the given set of variables is trained using regression to learn the model. The model presents the relationship as a function of variables that predicts the response.
2. Classification Modelling
The goal of classification modeling is to predict which category or class a subject belongs to. This kind of prediction is based on the notion of probability. The classification model is mathematically represented as,
Classification Model : Probability (Class) = f(Variables)
Classification in data mining is solved by predicting probabilities. The classification model is the probability of a class as the function of the variables. The model equates the probability of belonging to a class to a function of the other variables. This means that the chances of a case belonging to a class can be predicted through patterns that are identified from the rest of the cases in the dataset.
Some cases are needed for training. These are used to learn the model for cases belonging to the classes. The function will then be used to predict whether a case belongs to a class or not. If the probability is high, a case belongs to a class but if the probability is low, a case does not belong to a class.
In a supervised problem like in classification, a training set is needed. This training set has variables that help you train the model. The classes are also provided which means that the cases in the training set are labeled.
Unsupervised Learning
Unsupervised learning is a type of data mining process that is used to draw inferences from datasets consisting of input data without labeled responses. Specifically, unsupervised learning is used to group cases based on similar attributes, or naturally occurring trends, patterns, or relationships in the data. Clustering and outlier detection are broad categories of unsupervised learning. Datasets are provided for these two categories. However, the datasets do not have labels. The goal is to find structure or weird behavior or irregularities from the data without supervision.
1. Clustering
Detecting the structure in data means learning organizations of data. In this type of unsupervised learning, you are simply given a collection of data. The grouping of data points into smaller subsections is called clustering. The task is to look at the data point set and ask the learning algorithm to group them into smaller clusters. This is performed by applying the notion of distance. This means that the learning algorithm needs to know which data point is close in the dataset and group them.
The process of clustering data points can be viewed using the Data-Group-Prove approach. Data points are grouped based on their similarities. Using the notion of distance, the clustering finds the optimal groups in the data, usually in between the data points. Once identified, it will interpret and data points and prove what a group mean based on similarities and how it differs from the groups.
2. Anomaly Detection
Anomaly detection is the process of identifying unexpected items or events in data sets. This is often applied to unlabeled data, hence classified as an unsupervised learning type of data mining. Anomalies only occur very rarely in the data and with features that differ from normal instances.
3. Association
Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction. Association rule is represented as an expression of the form, X --> Y, which means that X implies Y, where X and Y are item sets.
Comments