2020 Update II – Building a High Performance GPU Computing Workstation Part VI – docker configuration

Instructions on how to configure NGC docker deep learning containers for use with non-cloud GPU systems

I’ll go through the steps to install docker and work with the NGC framework. I presume you already have an account at NVIDIA’s NGC and you are running Ubuntu 18.04LTS, installed in my previous post.

A lot I am borrowing from Puget Systems. They sell deep learning / HPC systems and seem knowledgeable. You might want to consider them if you need a customizable, pre-tested & pre-built solution. (full disclosure: no financial relationship with Puget Systems)

First, uninstall docker if already installed. Adding the Nvidia drivers is next. Previously this was tricky, and driver/Cuda version/cuDNN versioning with Tensorflow was maddening. Docker solves this for us. If the driver is insufficient, it bombs out, tells what’s required, so just replace 440 in the last line with the version number needed.

sudo apt-get remove docker docker- engine docker.io containerd runc
sudo apt-get purge nvidia-docker2
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install build-essential dkms
sudo apt-get install nvidia-driver-440*  (used at time of writing)

To install Docker CE:

sudo apt-get install apt-transport-https
sudo apt-get install ca-certificates
sudo apt-get install curl
sudo apt-get install gnupg-agent
sudo apt-get install software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
[note: these 2 above are one line]
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

To check:

sudo docker run --rm hello-world

Install NVIDIA Container Toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -  [these two lines are one command]
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list |  sudo tee /etc/apt/sources.list.d/nvidia-docker.list 
[the above two lines are one command]
sudo apt-get update 
sudo apt-get install nvidia-container-toolkit 
sudo systemctl restart docker
sudo docker run --gpus all --rm nvidia/cuda nvidia-smi

and you should see your GPU’s, driver version and CUDA version at top. Nvidia-smi running lets you know you have GPU access..

Next is some extremely useful customization from the Puget systems page – I’m not a linux expert so I’m just following directions. You need your user name & user-id #. You can find out your user id with id : mine was 1000.
Your-user-name is the account name you log into with in Ubuntu. Mine is “Dad” so sudo usermod -aG Dad becomes :
sudo usermod -aG docker your-user-name
Then:
sudo usermod -v 1000-1000 Dad
sudo usermod -w 1000-1000 Dad
sudo -s
gedit /etc/subuid
move the line that says Dad : 1000:1 above any of the other lines that start with Dad. Save and Exit.
gedit /etc/subgid & do the same thing with that line that says Dad….
cd /etc/docker
gedit daemon.json (should be empty – if not backup)- and insert:

{
“userns-remap”: “Dad”,
“default-shm-size”: “1G”,
“default-ulimits”: {
“memlock”: { “name”:”memlock”, “soft”: -1, “hard”: -1 },
“stack” : { “name”:”stack”, “soft”: 67108864, “hard”: 67108864 }
}
}

exit & save, and then:
chmod 600 daemon.json
sudo systemctl restart docker.service
Evidently docker completely underallocates memory. FYI this isn’t optimized in any way – just what I’ve gotten from both Puget’s & NVIDIA’s websites.

Installing Tensorflow, Pytorch and RAPIDS containers

Presumably, you have made yourself a NVIDA NGC account. After logging in, navigate to ‘setup’ & then ‘generate API key’. Copy the API key and:

docker login nvcr.io
username: $oauthtoken
PW: API KEY (cut/paste)
sudo systemctl restart docker

Get the NGC containers for Pytorch and Tensorflow 2.X . I am including RAPIDS as well. Presumption is that we are all using python 3 here.

docker pull nvcr.io/nvidia/tensorflow:20.01-tf2-py3
docker pull nvcr.io/nvidia/pytorch:20.01-py3
docker pull nvcr.io/nvidia/rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04

Alright. We’re getting there. To assure ourselves that we downloaded everything, execute & you’ll see this below:
sudo docker images

Let’s run the Tensorflow docker. This is tricky, largely because the documentation is all over the place here. I have my CXR-14 dataset parsed in TF/Keras format in directory /data/SIIM/A1000 so I am going to run the TF2.01 container above as follows (all one command):

docker run --gpus all --rm -it -p "8888:8888" -v "/data/SIIM/A1000/:/data/" nvcr.io/nvidia/tensorflow:20.01-tf2-py3 [note my /data/SIIM/A1000 directory will not work on your install!]

If successful, the typical ubuntu command line prompt is replaced by a new prompt :/workspace# . Type ls to see the directory contents, which will be different. Then type Python3 to get into the python shell – and your prompt will change again. Type import tensorflow as tf and if it loads, your install is probably successful.

Open up a 2nd terminal window in Ubuntu and type:
sudo docker ps to see the alphanumeric string container_ID as well as a random name docker assigned – a silly combination of two unrelated words (like ‘zen_potatoes’ – yours will be different). For controlling the docker container, you can use either. To stop the container, you would type docker stop zen_potatoes. Type docker ps to confirm you are no longer running anything.

Try running the docker run command above again in the first window. Change to the 2nd window and type sudo docker ps again. Notice that the name has changed because you restarted the process. We started the docker in interactive mode (-it) so let’s leave it running but escape out. From the /workspace# prompt, type <ctrl-p> <ctrl-q> and your ubuntu prompt ($) reappears. You have escaped the container, but it is still running in the background – type docker ps to confirm. To re-enter the docker container, type docker attach <name> where name is the new name or container_ID you pulled from docker ps.

Here’s how to kill the container. We have already used docker stop <name>. You can stop all dockers by docker stop $(docker ps -a -q). To REMOVE the container, use docker rm <name>. Or if you want to get rid of everything, docker container prune.

So by now you probably typed jupyter notebook and spent a few minutes in frustration trying to get in. You might have even realized the docker container doesn’t contain a browser, so you tried to install one there and it still wouldn’t work. That’s because the container doesn’t have any graphics interface (X11) and you’d have to install that too, and other stuff, and then… maybe someone smarter than me can make it work, but I can’t. Feel hopeless or helpless? Docker container purge & read on.

To access jupyter notebook through your browser (firefox) create a ssh tunnel to it. This is why we included the -p “8888:8888” flag in the run command. Apparently, 18.04LTS doesn’t include everything necessary to run ssh, so:

sudo apt update
sudo apt install openssh-server
sudo systemctl status ssh
hit q to escape
sudo ufw allow ssh (in case it didn't work 1st time)
hostname -I  

Pay attention to the IP addresses hostname spits back at you. These are your PC’s IP addresses. You’ll need to use them from the browser to access the container. Either go back into the container or run a new one. Inside the container at the /workspace # prompt, type:
jupyter notebook --no-browser --allow-root
Jupyter spits back http://hostname:8888/?token=673aeXXXXXXXXXXXXXXX

Now, launch a NEW browser instance and copy/paste that address from the output of the jupyter notebook terminal. Replace hostname in the URL with the IP address of your computer which you got from hostname -I, or 172.17.0.1 which is ‘supposed’ to work but apparently bad networking form. Hit enter and enjoy happy jupyter deep learning joy.

As an experiment, I decided to instantiate a new docker with the fast.ai extensions from my friend Jeremy Howard. I think this can be made even more sophisticated, but its the first time I’ve done this so be kind!

Creating a pytorch+fast.ai docker container using NGC Nvidia Cloud docker

So when I tried to install packages for the first time in docker, I hit a brick wall as I couldn’t apt-get or pip things like expected. The simple and stupid solution is to first:
apt update
and you’ll be able to start installing packages. If you think about it, the container doesn’t know where to look for the packages, but I certainly wouldn’t have figured it out unless some kind soul on an instructional video somewhere showed me first (sorry! lost the URL!). The Pytorch container has conda already which makes things a bit simpler. We’re going to install missing items in conda and then the fast.ai material. Here’s the full sequence.

docker run --gpus all --rm -it -p "8888:8888" -v "/data/SIIM/A1000/:/data/" nvcr.io/nvidia/pytorch:20.01-py3

[switch to another terminal window and type docker ps to get the container ID discussed above & switch back here]

apt update
conda install -c conda-forge ipywidgets=7.4
conda install -c pytorch -c fastai fastai
sudo docker commit <container_ID> pytorch-fastai
sudo docker images

And you’ll see the new image which you can run directly with fast.ai . I am sure there is some creative programming that can be done to tweak the docker, run jupyter notebook automatically and have the browser open up, but I haven’t figured it out yet. For now I can live with this. If you decide to do something similar with TF2.1 I would probably add pillow and matplotlib. Enjoy!