Today, I recommend the 10 most commonly used machine learning algorithms that can be used on almost all data issues:
1, linear regression
Linear regression is often used to estimate actual values based on continuous variables (price, number of calls, total sales, etc.). We build the relationship between the independent and dependent variables by fitting the best straight line. This best line is called the regression line and is represented by the linear equation Y = a *X + b.
The best way to understand linear regression is to look back at childhood. Suppose you don’t ask the other person’s weight, let a fifth-grade child sort the class in the order from light to heavy. What do you think the child will do? He or she is likely to visualize people’s height and size and combine these visible parameters to rank them. This is an example of using linear regression in real life. In fact, the child found that height and size have a certain relationship with body weight, and the relationship looks very much like the above equation.
In this equation:
- Y: dependent variable
- a: slope
- x: independent variable
- b : intercept
The coefficients a and b can be obtained by least squares.
See the example below. We find the best fit straight line y=0.2811x+13.9 . Knowing the height of a person, we can use this equation to calculate the weight.
The two main types of linear regression are unary linear regression and multiple linear regression. The characteristic of unary linear regression is that there is only one independent variable. The characteristics of multiple linear regression are just as the name suggests, and there are multiple independent variables. When looking for the best fit straight line, you can fit to multiple or curve regression. These are called multiple or curve regression.
2, logistic regression
Don’t be confused by its name! This is a classification algorithm rather than a regression algorithm. The algorithm estimates discrete values based on a known set of dependent variables (such as binary values 0 or 1, yes or no, true or false). In simple terms, it predicts the probability of an event occurring by fitting the data into a logical function. Therefore, it is also called logistic regression. Because it predicts probability, its output value is between 0 and 1 (as expected).
Let us understand this algorithm again with a simple example.
Suppose your friend asks you to solve a puzzle. This will only have two results: you unlocked or you didn’t unlock it. Imagine you have to answer a lot of questions to find out what you are good at. The result of this research will look like this: Assume that the title is a decade-level trigonometric problem, and 70% of you may solve this problem. However, if the title is a fifth-grade history question, you have only 30% chance to answer correctly. This is the information that logistic regression can provide to you.
Mathematically, in the results, the logarithm of the probability uses a linear combination of predictors.
In the above equation, p is the probability that the feature we are interested in appears. It uses the value that maximizes the likelihood of observing the sample value as a parameter, rather than by calculating the minimum of the sum of squared errors (as used in general regression analysis).
Now you may have to ask, why do we ask for a logarithm? In short, this method is one of the best ways to copy a step function. I could have told you in more detail, but that goes against the main idea of this guide.
3. KNN (K – nearest neighbor algorithm)
This algorithm can be used to classify problems and regression problems. However, within the industry, the K-nearest neighbor algorithm is more commonly used for classification problems. K – The nearest neighbor algorithm is a simple algorithm. It stores all cases and divides new cases by most of the surrounding k cases. Based on a distance function, the new case is assigned to the most common of its K neighbors.
These distance functions can be Euclidean distance, Manhattan distance, Ming distance or Hamming distance. The first three distance functions are used for continuous functions, and the fourth function (Hamming function) is used for categorical variables. If K=1, the new case is directly assigned to the category to which the closest case belongs. Sometimes, when modeling with KNN, choosing the value of K is a challenge.
More information: K – Getting started with the nearest neighbor algorithm (simplified version)
We can easily apply to KNN in real life. If you want to know a completely stranger, you may want to find his good friends or his circle to get his information.
Things to consider before choosing to use KNN:
- KNN’s computational cost is high.
- Variables should be normalized first, or they will be biased by higher-range variables.
- Before using KNN, it takes a lot of effort to deal with the pre-processing such as wild value removal and noise removal.
4, support vector machine
This is a classification method. In this algorithm, we mark each data in points in N-dimensional space (N is the total number of all your features), and the value of each feature is the value of a coordinate.
For example, if we only have two characteristics of height and hair length, we will mark these two variables in two-dimensional space, each point has two coordinates (these coordinates are called support vectors).
Now we will find a straight line that separates the two sets of different data. The distance from the nearest two points in the two groups to this line is simultaneously optimized.
The black line in the above example optimizes the data classification into two groups, and the distance from the nearest point (points A and B in the figure) to the black line satisfies the optimal condition. This line is our dividing line. Next, on which side of the line the test data falls, we will classify it into which category.
See more: Simplification of Support Vector Machines
Think of this algorithm as playing JezzBall in an N-dimensional space. Need to make some minor changes to the game:
- Instead of drawing a straight line in the horizontal or vertical direction before, you can now draw lines or planes at any angle.
- The purpose of the game is to divide the balls of different colors into different spaces.
- The position of the ball does not change.
5. Naive Bayes
Under the premise that the variables are independent of each other, the classification method of naive Bayes can be obtained according to Bayes’ theorem. In simpler terms, a naive Bayesian classifier assumes that the characteristics of a classification are not related to other characteristics of the classification. For example, if a fruit is round and red ， And the diameter is about 3 inches, then the fruit may be apple. Even if these features are interdependent ， Or depending on the existence of other features, the Naive Bayes classifier will assume that these features independently imply that the fruit is an apple.
The naive Bayes model is easy to construct and very useful for large data sets. Although simple, Naive Bayes’ performance surpasses the very complicated classification method.
Bayes’ theorem provides a way to calculate the posterior probability P(c|x) from P(c), P(x), and P(x|c). Please see the following equation:
- P ( c|x ) is the posterior probability of the class (target) under the premise of the known predictive variable (attribute)
- P ( c Is the prior probability of the class
- P ( x|c Is the probability that the probability of the variable is predicted under the premise of the known class
- P ( x ) is the prior probability of the predictive variable
example: Let us use an example to understand this concept. Below, I have a weather training set and the corresponding target variable “Play”. Now, we need to classify the participants who will “play” and “do not play” depending on the weather. Let’s perform the following steps.
Step 1: Convert the data set to a frequency table.
Step 2: Create a Likelihood table with a probability like “When the probability of Overcast is 0.29, the probability of play is 0.64”.
Step 3: Now, use the naive Bayesian equation to calculate the posterior probability of each class. The class with the most posterior probability is the result of the prediction.
problem: If the weather is fine, the participants can play. Is this statement correct?
We can solve this problem using the methods discussed. So P (will play | sunny) = P (clear | will play) * P (will play) / P (clear)
We have P (clear | will play) = 3/9 = 0.33, P (clear) = 5/14 = 0.36, P (will play) = 9/14 = 0.64
Now, P (will play | sunny) = 0.33 * 0.64 / 0.36 = 0.60, with a higher probability.
Naïve Bayes uses a similar approach to predict the probability of different categories through different attributes. This algorithm is often used for text categorization as well as for problems involving multiple classes.
6, the decision tree
This is one of my favorite and most frequently used algorithms. This supervised learning algorithm is often used to classify problems. Surprisingly, it applies to both categorical and continuous dependent variables. In this algorithm, we divide the population into two or more homogeneous groups. This is divided into as many different groups as possible based on the most important attributes or independent variables. To know more, read: Simplify the decision tree.
As you can see in the above picture, according to various attributes, the crowd is divided into four different groups to judge “Will they play?” In order to divide the population into different groups, many technologies are needed, such as Gini, Information Gain, Chi-square, and entropy.
The best way to understand the workings of the decision tree is to play Jezzball, a classic Microsoft game (see below). The ultimate goal of this game is to create a space that is as large as possible without a ball in a room that can move walls.
So every time you use walls to separate rooms, you are trying to create two different totals in the same room. Similarly, the decision tree is also dividing the population into different groups as much as possible.
For more information, see: Simplification of Decision Tree Algorithms
7, K-means algorithm
The K – Mean algorithm is an unsupervised learning algorithm that solves clustering problems. The process of using K-means algorithm to group a single data into a certain number of clusters (assuming there are k clusters) is simple. The data points within a cluster are homogeneous and different from other clusters.
Remember the activity of finding shapes from ink stains? The K – mean algorithm is similar to this activity in some way. Observe the shape and extend the imagination to find out how many clusters or populations there are.
How the K – Mean algorithm forms a cluster:
- The K – Mean algorithm selects k points for each cluster. These points are called centroids.
- Each data point forms a cluster with the nearest centroid, which is k clusters.
- Find the centroid of each category based on the existing category members. Now we have a new heart.
- Repeat steps 2 and 3 when we have a new center of mass. Find the closest centroid from each data point and associate it with the new k-cluster. Repeat this process until the data is converged, that is, when the centroid is no longer changing.
How to determine the K value:
The K – mean algorithm involves clusters, each with its own centroid. The sum of the squares of the centroids within a cluster and the distances between the data points forms the sum of the squares of the cluster. At the same time, when the sum of the squared values of all the clusters is added, the sum of the square values of the cluster scheme is formed.
We know that as the number of clusters increases, the K value will continue to drop. However, if you use the chart to represent the results, you will see that the sum of the squares of the distance decreases rapidly. After a certain value k, the speed of the reduction is greatly reduced. Here we can find the optimal value for the number of clusters.
8, random forest
A random forest is a proper noun that represents the overall decision tree. In the random forest algorithm, we have a series of decision trees (hence the name “forest”). In order to classify a new object based on its attributes, each decision tree has a classification called the decision tree “voting” to the classification. This forest selection gets the most votes in the forest (in all trees).
Every tree is grown like this:
- If the number of cases in the training set is N, samples are randomly selected from the N cases using the reset sampling method. This sample will serve as a training set for the “nurturing” tree.
- If there are M input variables, define a number m<<M. m means that m variables are randomly selected from M, and the best one of the m variables is used to split the node. During the planting of the forest, the value of m remains unchanged.
- Plant each tree as large as possible without cutting branches.
9. Gradient Boosting and AdaBoost algorithms
When we have to process a lot of data to make a prediction with high predictive power, we will use the two boosting algorithms GBM and AdaBoost. The boosting algorithm is an integrated learning algorithm. It combines predictions based on multiple base estimates to improve the reliability of individual estimates. These boosting algorithms are often effective in data science competitions such as Kaggl, AV Hackathon, and CrowdAnalytix.
GradientBoostingClassifier and Random Forest are two different boosting tree classifiers. People often ask the difference between these two algorithms.
10. Dimensionality reduction algorithm
In the past four to five years, information capture has grown exponentially at every possible stage. Companies, government agencies, and research organizations are also capturing detailed information in response to new resources.
For example: e-commerce companies capture information about customers in more detail: personal information, web browsing history, their likes and dislikes, purchase records, feedback, and much more, more attention than you to the grocery store salesperson.
As a data scientist, the data we provide contains many features. This sounds like a good material for building a model that can stand up to the postgraduate exam, but there is a challenge: how to distinguish the most important variables from 1000 or 2000? In this case, the dimensionality reduction algorithm and other algorithms (such as decision trees, random forests, PCA, factor analysis) help us to find these important variables based on the correlation matrix, the proportion of missing values, and other factors.