Quick (and free) experiment for CPU vs GPU for Deep Learning

Image courtesy: NVIDIA


Recently I presented at some NVIDIA conferences and got many questions on GPU vs CPU – Is the GPU really worth it? Does it really provide the benefits advertised? To be honest, we were skeptical before getting a desktop GPU for training our Video Analytics models. However, once we started using it and saw ours models zip through the training process in seconds compared to minutes (and sometimes hours) – we were definitely convinced. You need to have a solid case where you will be continuously training models to justify a local version – but for general experimentation you can always start with a Cloud instance.

In order to quantify the benefits, I decided to do a quick experiment on Google’s new Colab platform of training an image recognition model – first on CPU and the exact same code (in TensorFlow) on a GPU and compare training times. The whole experiment was free because Google is kind enough to give K80 GPU access with their Colab notebooks. Also, my code is available at link below so you can repeat the experiment independently. But before getting into that – a basics refresher.

Traditional Supervised Machine Learning (ML) is about using models like Linear Regression, SVM, Random forests, etc. to process input data (packaged as features) and extract patterns. These patterns are used to create Models that can start making predictions based on the relationships between these features that it has “learned” iteratively.

Deep Learning (DL) involves many learning arranged as a network and working together to create one large model. This model typically has manly layers of learning and each layer learns new patterns from data from previous layer. These DL network are very powerful for capturing complex relationships between data – especially for analysing Unstructured data – like images, videos, audio, etc. DL involves many complex and parallel computations and hence tend to get limited when being trained on CPUs. CPUs are great for serial computations with limited parallelisation – but when you have thousands of parallel computations – it greatly helps having a hardware accelerator like GPU.

GPU (Graphics Processing Unit) started as Graphics Cards for rendering complex Graphics and now have extended their application to Deep Learning applications. Especially the process of training DL models – which is highly iterative and parallel – here the GPU can provide huge performance and reduction in time. I decided to do a quick experiment on Google’s newly released Colab platform to compare CPU and GPU times for a simple DL model. We will use TensorFlow to create a Convolutional Neural Network model to classify input images from the openly available CIFAR dataset into 10 categories. Using Google Colab, we will do training on a CPU runtime and then on a GPU runtime (Tesla K80).

To run our test – go to colab.research.google.com and create a Notebook first with CPU runtime. We will use a popular DL Architecture called Convolutional Neural Network (CNN). Build the code cells following the below steps.

First – lets check if we have a CPU or GPU and load the data. We will use the popular and open CIFAR 10 dataset. The CIFAR-10 data consists of 60,000 32×32 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data.

import tensorflow as tf
import numpy as np

# Check if its a CPU or GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print('GPU device not found')
else:
  print('Found GPU at: {}'.format(device_name))
cifar = tf.keras.datasets.cifar10

(x_train, y_train),(x_test, y_test) = cifar.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

Now lets build a CNN model that reads the images as input arrays (or Tensors) and runs many layers and creates an output with 10 neurons for classifying the image into 10 categories.

model = tf.keras.models.Sequential([
    
  tf.keras.layers.Conv2D(32, (3, 3), padding='same', input_shape=x_train.shape[1:]),
  tf.keras.layers.Activation('relu'),
  tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
  tf.keras.layers.Dropout(0.25),

  tf.keras.layers.Conv2D(64, (3, 3), padding='same'),
  tf.keras.layers.Activation('relu'),
  tf.keras.layers.Conv2D(64, (3, 3)),
  tf.keras.layers.Activation('relu'),
  tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
  tf.keras.layers.Dropout(0.25),
    
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

So we have a CNN model with 1,667,594 weights that need to be adjusted during the training process. Now lets run the model Training process. This involves passing all the images (typically in smaller batches) through the model (initialised with random weights). Then as each batch is processed – the error with true values is calculated and using Gradient Descent, the weights are updated to improve the model accuracy. We will run the training process for 10 epochs. An epoch is basically when all the training data has been “seen” by the model, So it may take a few batch runs to complete an epoch. The training is a very iterative process and this where GPU contribution will be maximum. We will time the training process on CPU and GPU.

import datetime

# start training
st_time = datetime.datetime.now()

model.fit(x_train, y_train, epochs=10)

# record time after training
end_time = datetime.datetime.now()

print('Training time = %s'%(end_time-st_time))

Now on a CPU environment the training runs for 10 epochs. We see that this training takes about 36 minutes and 31 seconds. In DL world this is a relatively simple model with and still takes half hour to train. So for more complex models and Architectures the time can easily move to many hours or even few days. Now if you run into some memory issues or failures that means you start the training from scratch.

Epoch 1/10
49984/50000 [============================>.] - ETA: 0s - loss: 1.4530 - acc: 0.470950000/50000 [==============================] - 226s 5ms/step - loss: 1.4528 - acc: 0.4710
Epoch 2/10
11808/50000 [======>.......................] - ETA: 2:50 - loss: 1.1375 - acc: 0.595350000/50000 [==============================] - 220s 4ms/step - loss: 1.0827 - acc: 0.6149
Epoch 3/10
49984/50000 [============================>.] - ETA: 0s - loss: 0.9318 - acc: 0.671550000/50000 [==============================] - 218s 4ms/step - loss: 0.9317 - acc: 0.6715
Epoch 4/10
11808/50000 [======>.......................] - ETA: 2:45 - loss: 0.8214 - acc: 0.708750000/50000 [==============================] - 219s 4ms/step - loss: 0.8279 - acc: 0.7064
Epoch 5/10
49984/50000 [============================>.] - ETA: 0s - loss: 0.7465 - acc: 0.737650000/50000 [==============================] - 222s 4ms/step - loss: 0.7465 - acc: 0.7376
Epoch 6/10
11808/50000 [======>.......................] - ETA: 2:44 - loss: 0.6702 - acc: 0.763350000/50000 [==============================] - 215s 4ms/step - loss: 0.6835 - acc: 0.7582
Epoch 7/10
49984/50000 [============================>.] - ETA: 0s - loss: 0.6266 - acc: 0.780250000/50000 [==============================] - 218s 4ms/step - loss: 0.6267 - acc: 0.7802
Epoch 8/10
11808/50000 [======>.......................] - ETA: 2:47 - loss: 0.5536 - acc: 0.806050000/50000 [==============================] - 218s 4ms/step - loss: 0.5723 - acc: 0.7980
Epoch 9/10
49984/50000 [============================>.] - ETA: 0s - loss: 0.5334 - acc: 0.811350000/50000 [==============================] - 217s 4ms/step - loss: 0.5333 - acc: 0.8113
Epoch 10/10
11744/50000 [======>.......................] - ETA: 2:45 - loss: 0.4533 - acc: 0.840850000/50000 [==============================] - 217s 4ms/step - loss: 0.4906 - acc: 0.8257

Training time = 0:36:31.572622
[0.7466319052696228, 0.7561]

Next we run the same training process by initiating a new session on GPU. This training takes 4 minutes and 6 seconds. That is an improvement by factor of of 9x. So we can get the model trained in 9 times less time on a K80 GPU than a CPU. Both models have accuracy of 84% after 10 epochs.

Epoch 1/10
50000/50000 [==============================] - 27s 536us/step - loss: 1.4746 - acc: 0.4647
Epoch 2/10
38496/50000 [======================>.......] - ETA: 5s - loss: 1.1019 - acc: 0.610150000/50000 [==============================] - 24s 485us/step - loss: 1.0855 - acc: 0.6170
Epoch 3/10
50000/50000 [==============================] - 24s 485us/step - loss: 0.9314 - acc: 0.6710
Epoch 4/107584/50000 [===>..........................] - ETA: 20s - loss: 0.8288 - acc: 0.705750000/50000 [==============================] - 24s 488us/step - loss: 0.8274 - acc: 0.7072
Epoch 5/10
48352/50000 [============================>.] - ETA: 0s - loss: 0.7552 - acc: 0.734750000/50000 [==============================] - 24s 485us/step - loss: 0.7542 - acc: 0.7351
Epoch 6/10
50000/50000 [==============================] - 24s 483us/step - loss: 0.6840 - acc: 0.7591
Epoch 7/10
10368/50000 [=====>........................] - ETA: 19s - loss: 0.6017 - acc: 0.790450000/50000 [==============================] - 24s 483us/step - loss: 0.6297 - acc: 0.7792
Epoch 8/10
48864/50000 [============================>.] - ETA: 0s - loss: 0.5809 - acc: 0.796350000/50000 [==============================] - 25s 494us/step - loss: 0.5825 - acc: 0.7953
Epoch 9/10
50000/50000 [==============================] - 24s 487us/step - loss: 0.5377 - acc: 0.8109
Epoch 10/10
10592/50000 [=====>........................] - ETA: 19s - loss: 0.4653 - acc: 0.835750000/50000 [==============================] - 24s 484us/step - loss: 0.4959 - acc: 0.8240

Training time = 0:04:06.330451

So in conclusion, yes – a hardware accelerator like GPU will definitely help your Deep Learning models Train much faster and you can experiment with different hyper-parameters faster to find the optimal cases. Training is the most time consuming and you will find the biggest bang for the buck here. For inference, you could use a CPU or if there is too much data – a Cloud-based CPU or even newer Technologies like TPU (from Google) and FPGA (backed by Microsoft). Having some hardware accelerator will greatly improve your DL model training and inference times.


The link for the above code worksheet is available at: https://drive.google.com/file/d/1NVFjR72yWyMHm3xK6FLA6l3ttoIS8UjI/view?usp=sharing

Leave a Reply

Your email address will not be published. Required fields are marked *