Recognizing Handwritten Digits with scikit-learn

Prathamesh Vikas Sankpal
5 min readApr 14, 2021

In this article, I am going to analyze and predicting handwritten digits using a Support Vector Machine (SVM). SVM is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems, which is called Support Vector Classifier (SVC).

I have used the scikit-learn library and matplotlib library of python to perform this project and used a scikit-learn predefined dataset load_digits for this project.

The hypothesis to be tested: The Digits data set of the scikit-learn library provides numerous data sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.

In this case, we have run 3 test cases, each case for a different range of training and testing sets.

I have divided the project into 3 parts.

  1. Getting dataset and analyze the dataset.
  2. Data Cleaning and preprocessing and using machine learning algorithm (SVM)
  3. Training algorithm and get a prediction of data.

1. Getting the dataset and analyze the dataset.

Before collecting the data we have to import the required libraries on python. Here I have used matplotlib and scikit-learn libraries for importing SVM, test_train_split, load_digits.

Saving digit data in a variable :

printing the length and shape of our array of image data set :

plot this data.images[0] array to visualize the data

Printing Testing Dataset :

we can conclude that the target array is consists of 1797 entries.

2. Data Cleaning and preprocessing and using machine learning algorithm (SVM)

we have gathered the data, analyze and visualize our data. Now, it’s time to preprocessing the data and preparing a machine learning algorithm. Knowing the shape of our data and perform any action needed to preprocess the data.

changing the shape of our image data from 3D to 2D.

splitting the data into 2 parts of Training and Testing.

The training part will feed to the algorithm and the output of the algorithm is compared with the testing part.

machine-learning algorithm (SVM) as per our need. I have used gamma as 0.001 and c as 100. One can change the value and check how the output will come out.

3 . Training algorithm and get a prediction of data.

I) Train our algorithm by feeding our training data to our algorithm.

Now, we are going to check how our algorithm working. We feed our testing data(X part) to the algorithm for prediction and check the output.

Now, look at the testing data (y part) before comparing it with the prediction.

Now, we compare these two data and check how accurate our algorithm is.

II) Before feeding our 2nd type of data set to the algorithm first visualize our output data by plotting them in a graph.

Feed our 2nd type of dataset to our algorithm. Here we use 1 to 1790th data to train the algorithm.

Now, we are going to check how our algorithm is working. We feed our testing data(X part) to the algorithm for prediction and check the output. Here we use 1791st to 1796th data for prediction.

we compare these two data and check how accurate our algorithm is.

we can see that our algorithm can predict 100% accurately, and thus, we can conclude that our hypothesis is accepted.

III) We are going to set our training and testing data for our prediction purpose. Here I use 1 to 1600th data as our training purpose and 1601 to 1796th data as our testing purpose.

Now we feed the training data to our algorithm.

Now, we are going to check how our algorithm is working. We feed our testing data(X part) to the algorithm for prediction and check the output.

Now, let’s look at the testing data (y part) before comparing it with the prediction.

Now, we compare these two data and check how accurate our algorithm is.

Conclusion :

We can conclude that the SVM algorithm can predict the digit accurately 95% or more than 95% of the time. So, our testing of the hypothesis is true and accepted.

--

--