In this project, we would tackle three different parts using Python programming language and JupyterLab. The first part is focusing on programming KNNs (K-nearest neighbors) and NBs (Naive Bayes) from scratch. Then, we would move on afterward to comparing the results obtained by these both algorithms for final evaluation. Therefore, we would consider which one is performing the best.

In the second part, we would use sklearn library to compare the two algorithms on a larger dataset, specifically in four different settings: Influence of reduced training set, influence of large training set, influence of absence of a teacher

and unknown distribution.

In the third part, we would compare the same algorithms for image classification on 10 different classes, using feature descriptors.

## Table of Contents

Introduction

1. Programming of a discrimination method: KNNs (K-nearest neighbors) and Naive Bayes (NB)

1.1 Data

1.2 Implementation of KNNs Algorithm

1.3 Implementation of NBs Algorithm

1.4 Experiment and Results

2. Comparison of the two methods (parametric and nonparametric)

2.1 Influence of the size of the training set: the case of reduced size

2.2 Influence of the size of the training set: the case of large size

2.3 In the case of the absence of a teacher

2.4 Distribution unknown

3. Approach based on Descriptors

3.1 Calculation of descriptors

3.2 Implementation of a classification system

Conclusion

References

## Introduction

In this project, we would tackle three different partsusing Python programming language and *JupyterLab*. The first part is focusing on programming KNNs (K-nearest neighbors) and NBs (Naive Bayes) from scratch. Then, we would move on afterward to comparing the results obtained by these both algorithms for final evaluation. Therefore, we would consider which one is performing the best.

In the second part, we would use *sklearn library* to compare the two algorithms on a larger dataset, specifically in four different settings:

1. Influence of reduced training set

2. Influence of large training set

3. Influence of absence of a teacher

4. Unknown distribution

In the third part, we would compare the same algorithms for image classification on 10 different classes, using feature descriptors.

## 1. Programming of a discrimination method: KNNs (K-nearest neighbors) and Naive Bayes (NB)

In this section, we would program KNNs algorithm besides also Naive Bayes (NBs) from the scratch.

### 1.1 Data

Before indulging in the algorithms, we would get to know much better our data. In general, we have 300 random data points whose half of them are to be classified, based on the difference in distance between these points and the truth we know from the rest of the data. Figure (1) specifies the distribution of data in the victor space before applying the algorithm of KNNsand NBs.

**Figure 1** - Data distribution in victor space

Abbildung in dieser Leseprobe nicht enthalten

### 1.2 Implementation of KNNs Algorithm

For the realization of KNNs algorithm, at first, we have our X data points which we strictly know their truth (we know their classes). On the hand, we have N points whose classes we would like to know. So, the first important step that we would do is browsing our N points in computing the distance between them with all the truth points. In our case, we would use the Euclidean distance as in taking the square root of the difference between each coordinate squared.

After computing the distance between the points, we would like to recover the indices of the weakest distance. And by this, we would sort into a table in distance by recovering their original indices. To retrieve and keep the first K values of this table, we would recover the most present class among these points. Therefore, we could simply the process in considering figure (2).

Abbildung in dieser Leseprobe nicht enthalten

After applying different instances of KNNs by the manipulation of different number of k points in initialization the algorithm, I have obtained various results, accordingly, based on the error rate function. For further illustration, error rate is mainly used to assess the whole efficiency of the classifier, but it is not a trust-worthy measure because it does not specify the performance in terms of information relevance and predication. It is computed by diving the total number of points (examples) over total number of correct classified points *(TP+TN).* Therefore, it is the opposite of accuracy, for it shows the percentage of errors that the algorithm commits in classification.

Abbildung in dieser Leseprobe nicht enthalten

### 1.3 Implementation of NBs Algorithm

The NBs algorithm is based on Bayes' Theorem, whichis a way of finding a probability when we know certain other probabilities.The formula is:

Abbildung in dieser Leseprobe nicht enthalten

Where:

- P(A|B): how often A happens given that B happens

- P(B|A): how often B happens given that A happens, written

- P(A): how likely A is on its own

- P(B): how likely B is on its own

Therefore, the decision rule for this method is set based on the parameters set during training.

### 1.4 Experiment and Results

By computing the error rate of the algorithm, table (1) summarizes the obtained results for k equal to 1, 3, 5, 7, 13 and 15. I have observed a slight improvement in results in increasing the number of k points in the phase of initialization of the algorithm. Therefore, the more data or nearest neighbors are considered to classify the unknow data are the better of the performance of the algorithm. The classifier is much sure when there are more data to rely on making the final decision. However, the algorithm could not perform much better when k exceeds 13. So, we could consider that k=13 is the optimal choice of the algorithm on our data.

Abbildung in dieser Leseprobe nicht enthalten

However, to best visualize the algorithm in the victor space after applying KNNs, we would consult figure (13) that well visualize the performance of KNNs after learning and testing of the algorithm in take different k values. We could noticeably see a difference and the effectiveness of increasing the k neighbors from 1 to 13 on the classification distribution. Taking, for example, 1 and 3 k neighbors, we could evidently see the difference in class assignments to the points that are well close to each other. By displaying the error rate curve on the dataset p1_data1, we can see that as the number of neighbors increases, the error rate decreases. We also see that increasing the parameter k from 13 is not useful.

*Figure 3- Error rate in function with k = 1..15*

**Error rate in fonction with K (KNN | K = 1..15)**

Abbildung in dieser Leseprobe nicht enthalten

**Number of neighbours K**

In comparison, I could conclude that KNNs are commutative in computing time, usefulness to the problem of classification, and most importantly the reduction of data assumptions in tuning several parameters to fit or build the model.

It is worth presentation to consider the diagram of the best choice of k when it is equal to 13. In figure 4, we could notice a clear separation between the points in three different colors green, blue, and red.

Figure 4- Best performance of KNNs, K= 13

Abbildung in dieser Leseprobe nicht enthalten

On the other hand, applying NBs gives us an error rate of 0.7 which is quite higher in comparison to KNNs. And so, it goes back to the uncertainty of classification on small size of points. In figure (4), wo notice the distribution of points to the 3 classes, but they are not seeming the perfect choice on behalf of classification.

On other hand, we could see that the Naive-Bayes algorithm is less robust because the individuals are misclassified, knowing that this test was obtained for an error of 0.7%. Figure (5) details the points distribution over 3 classes.

Abbildung in dieser Leseprobe nicht enthalten

## 2. Comparison of the two methods (parametric and nonparametric)

In this section, we could apply the same algorithms we used in the first part of the project but this time with the help of *sklearn library.* So forth, this section divided into four main parts.

- Influence of reduced training set

- Influence of large training set

- Influence of absence of a teacher

- Unknown distribution

For the evaluation of the algorithms, we would use accuracy measure. It is computed by diving total number of correct classified points *(TP+TN*) over the total number of points

Abbildung in dieser Leseprobe nicht enthalten

### 2.1 Influence of the size of the training set: the case of reduced size

*Figure 6- Influence of reduced size on algorithms' performances*

Abbildung in dieser Leseprobe nicht enthalten

With the reduced size dataset, we notice that the percentage of accuracy is lower than usual in the contrary of NBs. This is also reflected in the distribution of individuals which is very confusing. As a sum up, the KNNs algorithm needs a large learning base to have a good estimation of the classes, while NBs algorithm seems to be more robust on a small data set, since we observe its highest accuracy of 0.92 on the dataset.

### 2.2 Influence of the size of the training set: the case of large size

*Figure 7- The influence of large size on algorithms' performances*

Abbildung in dieser Leseprobe nicht enthalten

**[...]**