A Tutorial on Traffic Sign Classification using PyTorch

9 min readOct 18, 2020

Traffic Sign Recognition (TSR) is undoubtedly one of the most important problems in the field of driverless cars and advanced driver assistance systems (ADAS). TSR enables the front-facing smart cameras mounted on the car to recognize the signboards so that the car can act accordingly. Some examples include recognizing stop signs, speed limit, turn signs etc.

More formally, TSR is expected to perform two tasks:
1. Traffic Sign Detection: Detect all the signs from a given video frame
2. Traffic Sign Recognition: Recognize all the detected signs

The focus of this blogpost is to introduce the second step alone i.e., the recognition part. I assume that the signboards are all detected and the sign part alone is cropped and stored as a separate image. However, detection and classification can be solved as a single problem, which will be addressed in a separate blogpost.

In this blogpost, I provide a step-by-step guide to train a deep neural network model for traffic sign classification.

If you’re someone who likes to directly dive into the code, here’s the link for the Github repo: https://github.com/surajmurthy/TSR_PyTorch

I use the German Traffic Sign Recognition Benchmark (GTSRB) dataset for demostrating TSR.

The GTSRB dataset consists of 39209 training images corresponding to 43 classes. An example image corresponding to each class of the dataset is shown below.

Sample images and the class labels of GTSRB dataset

It is important to note that these example images are not a part of the training data. The training data, in fact, is quite challenging — with varying sizes, contrast, noisy and blurred. A montage of training samples of all classes of the GTSRB dataset is provided in the image below.

Examples of training data of GTSRB dataset

Preparing data for training

I defined a simple transformation that performs only two operations i.e., resize the images to 112 x 112 and convert them to PyTorch tensor.

As mentioned earlier, The GTSRB dataset consists of 39209 training images corresponding to 43 classes.

I initially split the training dataset into two: training and validation, with ratio 80:20. So I used 31367 images for training and the remaining 7842 images as validation. Then data loaders for both training and validation datasets were created.

The distribution plot of both training and validation examples is shown in the figure below. As can be observed, the distribution of validation examples closely follows that of the training set.

Distribution plot of training and validation examples

Implementing a Convolutional Neural Network

To perform TSR, I built a CNN whose architecture is similar to that of the original AlexNet network used for image classification on ImageNet by Alex Krizhevsky et. al. [3]. I call this model AlexNetTS.

The code that describes AlexNetTS is provided in the gist below.

Once the architecture is defined, I created the network. I also printed the details of the network. In addition, to generate a summary of the model, I used this wonderful package torchsummary package [2]. Please check the Github link provided in the reference for more details. The “torchsummary” package requires the input size to make a forward pass through the network and print the output shape and number of parameters at each layer.

The code to create the model and print the summary is provided below.

Our model has 15,063,891 trainable parameters. The details of the model as printed by Line-10 of the above gist is provided below.

AlexnetTS(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): ReLU(inplace=True)
    (3): Conv2d(64, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): ReLU(inplace=True)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (12): ReLU(inplace=True)
  )
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=12544, out_features=1000, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=1000, out_features=256, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=256, out_features=43, bias=True)
  )
)

Since I’m providing RGB images of size 112x112 to the network for training, I use the input size (3, 112, 112) to print the summary, which is provided below. The summary contains the following information:

Output shape and parameter at each layer
Estimated total size of the network
Total number of parameters in the network, both trainable and non-trainable

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 64, 56, 56]           1,792
         MaxPool2d-2           [-1, 64, 28, 28]               0
              ReLU-3           [-1, 64, 28, 28]               0
            Conv2d-4          [-1, 192, 28, 28]         110,784
         MaxPool2d-5          [-1, 192, 14, 14]               0
              ReLU-6          [-1, 192, 14, 14]               0
            Conv2d-7          [-1, 384, 14, 14]         663,936
              ReLU-8          [-1, 384, 14, 14]               0
            Conv2d-9          [-1, 256, 14, 14]         884,992
             ReLU-10          [-1, 256, 14, 14]               0
           Conv2d-11          [-1, 256, 14, 14]         590,080
        MaxPool2d-12            [-1, 256, 7, 7]               0
             ReLU-13            [-1, 256, 7, 7]               0
          Dropout-14                [-1, 12544]               0
           Linear-15                 [-1, 1000]      12,545,000
             ReLU-16                 [-1, 1000]               0
          Dropout-17                 [-1, 1000]               0
           Linear-18                  [-1, 256]         256,256
             ReLU-19                  [-1, 256]               0
           Linear-20                   [-1, 43]          11,051
================================================================
Total params: 15,063,891
Trainable params: 15,063,891
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.14
Forward/backward pass size (MB): 6.63
Params size (MB): 57.46
Estimated Total Size (MB): 64.24
----------------------------------------------------------------

Training the network

The network was trained for 15 epochs. Adam optimizer with a learning rate of 0.001 was used. The loss function used was Cross Entropy Loss.

The code to perform training and validation is provided below.

The plots of loss and accuracy for both the sets are shown in the figure below. It can be noticed that though the network was trained for 15 epochs, the loss curve flattens after around 6 epochs. At the end of 15 epochs, both training and validation accuracy achieved above 99%.

The training time for each epoch was approximately less than a minute on a Windows 10 machine having RTX 2060 GPU. Evaluation on the validation set took slightly more than 8 seconds only i.e., evaluating each sample in the validation set took only 1 millisecond!

Epoch-14: 
Training: Loss = 0.0233, Accuracy = 0.9925, Time = 41.47 seconds
Validation: Loss = 0.0160, Accuracy = 0.9966, Time = 7.88 seconds

Plots of loss and accuracy of both training and validation sets

Inferencing on the Test Set

The GTSRB dataset also consists of 12630 test images on which inference was performed. These images were NOT a part of either training and validation sets.

Using the model trained earlier, I was able to achieve an accuracy of around 95%. Impressive, but there’s still plenty of room for improvement!

Number of correctly classified images = 12028
Number of incorrectly classified images = 602
Final accuracy = 0.952336

It is also important to note that the TSR problem I’m addressing here is a multi-class classification problem with an imbalanced training set. Hence relying only on accuracy for measuring the performance of the model isn’t the right thing to do.

Hence I generated the classification report using the scikit-learn package, and also plotted the confusion matrix for a better analysis. The classification report consists of Precision, Recall and their harmonic means i.e., the F1 score. More details about these measures can be found in the following Wikipedia article: https://en.wikipedia.org/wiki/Precision_and_recall

                precision    recall  f1-score   support

           0       0.97      0.95      0.96        60
           1       0.97      0.95      0.96       720
           2       0.94      0.98      0.96       750
           3       0.98      0.96      0.97       450
           4       0.95      0.95      0.95       660
           5       0.95      0.92      0.93       630
           6       0.99      0.89      0.94       150
           7       0.97      0.94      0.95       450
           8       0.87      0.98      0.92       450
           9       1.00      1.00      1.00       480
          10       1.00      0.98      0.99       660
          11       0.97      0.90      0.93       420
          12       0.96      0.92      0.94       690
          13       0.99      0.99      0.99       720
          14       1.00      1.00      1.00       270
          15       0.96      0.97      0.97       210
          16       1.00      0.99      0.99       150
          17       0.99      0.98      0.99       360
          18       0.96      0.87      0.91       390
          19       0.69      0.95      0.80        60
          20       0.89      0.97      0.93        90
          21       0.98      0.99      0.98        90
          22       0.94      0.90      0.92       120
          23       1.00      0.80      0.89       150
          24       0.85      0.94      0.89        90
          25       0.95      0.93      0.94       480
          26       0.76      0.90      0.82       180
          27       0.54      0.63      0.58        60
          28       0.99      0.90      0.94       150
          29       0.90      0.83      0.87        90
          30       0.88      0.79      0.83       150
          31       0.87      0.99      0.93       270
          32       0.91      1.00      0.95        60
          33       0.99      0.99      0.99       210
          34       0.94      0.99      0.96       120
          35       1.00      0.99      0.99       390
          36       0.98      1.00      0.99       120
          37       0.95      1.00      0.98        60
          38       1.00      0.99      0.99       690
          39       0.98      0.98      0.98        90
          40       0.70      1.00      0.83        90
          41       0.97      1.00      0.98        60
          42       0.92      0.93      0.93        90

    accuracy                           0.95     12630
   macro avg       0.93      0.94      0.93     12630
weighted avg       0.96      0.95      0.95     12630

Confusion matrix for the test data for TSR

From the confusion matrix, it is quite evident that the model is performing reasonably well. However, there are few non-zero entries in the non-diagonal locations of the matrix which needs to be further reduced.

For better understanding, the predicted v/s actual labels for the first 30 images in the test set are provided below. Despite few test images are of poor quality, the model is still able to classify them correctly.

Predicted v/s Actual labels for 30 images of test set

Conclusion and tips for improvement

In this tutorial, we looked at building a Convolutional Neural Network model to perform traffic sign recognition. A modification of the popular AlexNet architecture was used for performing the task, and the network was trained on the GTSRB dataset.

While we were able to achieve around 99% accuracy on both training and validation sets, an accuracy of around 95% was achieved on the test set. By employing other metrics such as Precision, Recall, F1-Score and the Confusion Matrix, we could notice that the model is indeed performing well on the test set.

Listed below are few points that provide directions for further improvements:

Since we are dealing with an imbalanced training data, we can perform oversampling on the minority class during training. Synthetic Minority Oversampling Technique (SMOTE) can be used for the same [4]
Data augmentation techniques can be used to improve the performance. Currently, I haven’t used any augmentation in this tutorial
Weighted cross entropy loss can be applied during the training, where the weights are inversely related to the proportion of the number of examples of a particular class
Focal loss can be employed to penalized “hard” examples
Adrian Roseblock suggests improving image contrast by applying an algorithm called Contrast Limited Adaptive Histogram Equalization (CLAHE)
A combination of some or all the above-mentioned methods can be applied
A technique called Curriculum Learning in which the training examples are fed to the model in a meaningful order is an active area of research. Works such as dynamic curriculum learning [5] are very powerful tools to combat imbalanced data classification.

References

[1] Adrian Roseblock’s tutorial on Traffic Sign Recognition. URL: https://www.pyimagesearch.com/2019/11/04/traffic-sign-classification-with-keras-and-deep-learning/

[2] Keras style model.summary() in PyTorch. URL: https://github.com/sksq96/pytorch-summary

[3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in Neural Information Processing Systems. 2012

[4] https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

[5] Wang, Yiru, et al. “Dynamic curriculum learning for imbalanced data classification.” Proceedings of the IEEE international conference on computer vision (ICCV). 2019. URL: https://openaccess.thecvf.com/content_ICCV_2019/html/Wang_Dynamic_Curriculum_Learning_for_Imbalanced_Data_Classification_ICCV_2019_paper.html

A Tutorial on Traffic Sign Classification using PyTorch

Written by Suraj Krishnamurthy