Crisis in reproducibility for deep learning

The Reproducibility Crisis In Deep Learning – Six practical observations

 

This post discusses some practical points for achieving a measure of reproducibility in your deep learning experiments.  It is written for those developing in Python/Numpy with Tensorflow / Keras under Linux / Ubuntu.

1. Set your seeds!

2. In multi-gpu machines, set batch size carefully!

3. Maintain similar environments

4. Be cautious with GPU memory allocation

5.   Comment and note your experiments.

6.  Remember your code may not be equivalent to someone else’s.

Reproducibility in deep learning is made challenging by the large number of moving parts involved and the memory allocation between CPU, GPU, DRAM and storage. Memory allocation is deterministic, meaning it happens “as needed”, and that can cause problems.

So,

1. Set your seeds.

I set four : one in OS, one in python, one in Tensorflow, and one in Numpy:

import numpy as np
import tensorflow as tf
import random as rn
import os
os.environ[‘PYTHONHASHSEED’] = ‘0’
np.random.seed(1234)
rn.seed(1234)
tf.set_random_seed(1234)

Consistently setting the same seed at the start of your experiments will aid in reproducibility.  Choose your own seeds if you prefer!  I’m suspicious that this type of random seed setting is most important with smaller, curated datasets where training order affects accuracy.

2.  Pay attention to minibatch size on multi-gpu machines

So if you start investigating how multi-GPU works in Tensorflow/Keras, it uses single-machine multi-GPU Data parallelism.  This means (from the Keras docs):

  • Divide the model’s input(s) into multiple sub-batches.
  • Apply a model copy on each sub-batch. Every model copy is executed on a dedicated GPU.
  • Concatenate the results (on CPU) into one big batch.

So there is one corrollary that you need to think about.  If you have a batch size of 32 and 2 GPU’s, each GPU will receive a segmented sub-batch of 16 and all is fine.  But use a batch size of 32 and 3 GPU’s – sub-batch of 10.66!  One of the GPU’s will get one less in the sub-batch and in my experience, this can cause variability.  Of course, not an issue if you only have one GPU. The point is that your batch size should always be divisible by the number of GPU’s that you are running.

 

3.  One of the biggest problems is maintaining a constant environment.

This typically deals with the versions of utilized programs and packages: Python, numpy, pandas, Tensorflow, Keras.  But it can also refer to CUDA, cuDNN, and even the Linux kernel version.

Its best to note these in a comments section at the beginning of the experiment, so that you can reproduce it in the future.

$ pip freeze

to list your packages and versions

#Linux Kernel:

#Python Version:

#Tensorflow Version:

etc….

for really vexing reproduciblity problems.

4.  Be aware that open windows and programs will also use GPU memory!

GPU memory allocation is dynamic.  Run Nvidia-smi to check it.  Here’s my screen with Firefox running Jupyter notebook, tabs, and two terminals.

Screen1

Here’s a standard run with memory allocated pretty much only to the display and the GPU processes.   Then I ran the same experiment, but I added a few GPU utilizing processes.  (I don’t play the games – they just have a moving display on the title screen)

Memory effect of multiple GPU processes on GPU

 

Look at NVIDIA-SMI in upper right hand of screen. Took about 1GB of one of the 11GB cards.  If the GPU memory allocation is sufficient to force you to change your minibatch size, training may be altered from other runs.  Its also slower!  So don’t run unnecessary processes that use GPU memory when training, and more importantly don’t run them inconsistently!   Surf the web on your phone while training.

This point also applies when moving between a multi-GPU environment and a single GPU environment.  Or GPU-CPU thread juggling.

5.  Write it down in your comments!

Do at least some manual form of versioning for training if you are not using more sophisticated version and data control.  When you are doing manual hyperparameter searches, it is easy to forget what learning rate was used or which optimizer was selected.  Try to keep the relevant details written down, either in a comment section or in a markup cell in Jupyter.  With so many moving parts, it is really easy to make a mistake and forget that something was changed that is not explicitly noted.  You can also paste training graphs in Jupyter and save them for future comparison.

DB500 Test Acc = Run 1: 78.63%     30 epochs       Run 1: 79.36% with 20 epochs
Run 2: 78.63%
Run 3: 78.63%

SGD LR =0.0001 Momentum=0.9

Additionally, a non-reproducible result may be a not-so-subtle indication that you have screwed up somewhere in your code.  Not enough to crash, but a misassigned variable, circular reference, something along those lines.  For example, I reproduce some of  my codebase with different datasets.  When copying and pasting among the different datasets, some of my observations matched my initial ones, but others didn’t.  It turned out that I had forgotten to modify the number of samples of batches for variable sizes of databases.  As a result, my epochs were incomplete (about half the size they should have been) and I was running far fewer training samples through my algorithm than I thought.  Simple fix.

 

6. Everyone codes differently

When coding from scratch, be cognizant that your representation of a particular neural network model may not be that other person’s.  In other words, your version of FasterR-CNN may not exactly replicate the code of the author, and subtle variations in codebase and style may introduce variability into the training algorithm.  It is unreasonable to expect everyone to code the same way across the board.  Some variation in reproduced code is to be expected, without the use of github repositories.

One of the older resources I read suggested that you might have to test your model 30-50 times to make sure it was turning out the same results.  I typically test 3, but if I was using a production model for something mission critical, I think I would test more.

I generally prefer running GPU tests on dedicated hardware because I feel its more controllable, but this might be one space where cloud beats physical as I (assume) its easier to set versions and running processes in your cloud space without having to run more mundane processes.  Ultimately these details could be important in successful deployment.

Thanks for reading!

Crisis in reproducibility for deep learning