Machine Learning Theory Learning

Posted Jun 16, 20204 min read

Machine Learning Theory Learning

K Nearest Neighbor Algorithm-Close to Zhu Zhe Chi Close to Mo Zhe Hei


For the K nearest neighbor algorithm, it is very simple to judge whether the new data belongs to red or black. The person who is closest to this point is the same type as who.
Obviously, it belongs to red. However, it is very easy to commit the problem of "a blindfold, not seeing Mount Tai". Can increase the number of nearest neighbors, assuming we increase to 3
Then you can clearly see that the new data will be classified as black.

Of course, the K nearest neighbor algorithm can also be used for regression algorithms. The principle is the same as for classification. The model selects the points in the training data set closest to the data point and averages their y values. This average is used as the predicted value of the new data.


The K nearest neighbor algorithm is a very classic algorithm with a very easy to understand principle.

However, it will have many problems in practical applications. For example, it requires careful preprocessing of the data, longer time to fit the larger data set, poor fitting to the high-dimensional data set, and sparse data No help at all.

Therefore, in the current various common application scenarios, the use of the K nearest neighbor algorithm is rare.

Generalized Linear Model--Geng Zhi's Algorithm Model




The black dots in the figure can be understood as training data, and the straight line is our linear model. With this straight line, you can now predict your new data.


The generalized linear model is divided into:

  1. The most basic linear model-linear regression
  2. Linear regression using L2 regularization-Ridge regression
    A linear regression that avoids overfitting. The model will retain all the characteristic variables, but will reduce the coefficients of the characteristic variables, so that the influence of the characteristic variables on the prediction results becomes smaller. Control by adjusting the alpha parameter
  3. Linear regression using L1 regularization-lasso regression
    Unlike L2 regularization, there are some features whose coefficients are exactly equal to zero. It can be understood that some features will be completely ignored by the model. Sometimes, changing a part of the coefficients to 0 helps to make the model easier to understand, and it can highlight the most important features of the model. Also use the alpha parameter to control

The smaller the alpha parameter, the worse the regularization effect, and even almost degenerates to be consistent with the linear regression model.

In practice, ridge regression is often preferred. But if your data has too many features, and only a small part of them are really important, then lasso regression should be a better choice.

When the data set features are relatively few, the performance of the linear model will be relatively weak.

Naive Bayes-Thunder, it's raining to collect clothes!

In scikit-learn, Naive Bayes has three methods

  1. Bayesian Naive Bayes
  2. Gauss Bayes
  3. Polynomial Naive Bayes

Bayesian Naive Bayes

This method is more suitable for the data set that conforms to the distribution of beijing effort, which is also known as the binomial distribution or 0-1 distribution. For example, tossing a coin will only have two results.

Gaussian Naive Bayes

As the name implies, it is the algorithm used when the sample features are assumed to be Gaussian, or normally distributed.

In fact, Gaussian Naive Bayes is indeed capable of most classification tasks, because in the natural and social sciences, a large number of phenomena are normally distributed.

Polynomial Naive Bayes

If the coin flip is two sides, the front and the back. Then the dice have 6 sides. When we roll the dice N times, the distribution of the number of times each face of the dice is a polynomial distribution.

But polynomial Naive Bayes is only suitable for classifying non-negative discrete numerical features. A typical example is the classification of text data converted into vectorization.


Compared with the linear model algorithm, the Naive Bayes algorithm is more efficient. This is because the Naive Bayes algorithm treats each feature in the data set as completely independent. It does not consider the relationship between features. But at the same time, the generalization ability of the model will be slightly weaker, but in general, it does not affect the actual use.

Decision Tree and Random Forest--Algorithm that can play mind reading

Decision tree is a very widely used algorithm in both classification and regression. Its principle is to achieve decision-making through the derivation of if/else of a series of problems.

Strengths and weaknesses of decision trees

Decision tree can easily visualize the model. In addition, because the decision tree processes each sample feature separately, there is no need to transform the data. There is little need to preprocess the data.

Although it is possible to pre-prune the decision tree with parameters like max_depth or max_leaf_nodes, it still inevitably suffers from overfitting. In order to avoid the problem of overfitting, we generally introduce random forests.

Random Forest

Decision tree algorithms are prone to overfitting problems, and random forests pack several different decision trees together. The parameters of each tree are different, and then the prediction results of each tree are averaged.

There are three important parameters

  • bootstrap
    Sampling with replacement
  • max_features
    Control the maximum number of selected features. If not set, the maximum feature number will be taken by default. When the parameter setting is larger, the decision tree will look more like, because there are more different features to choose from, and it will be easier to fit the data. If the setting is lower, the appearance will be very different, and more decision trees are needed to fit the data
  • n_estimators
    Controlling the number of decision trees.


In the field of machine learning, both classification and regression, random forest is one of the most widely used algorithms.

But for ultra-high dimensional data sets, sparse data sets, random forests are a bit stretched. In this case, the linear model performs better than random forests. In addition, the random forest consumes more memory and is slower than the linear model, so if the program wants to save more memory and time, it is recommended to choose the linear model.