Update – Part V of Building a High Performance GPU Computing Workstation – lessons learned and partitioning

So, having lived with this particular system for about two years since Summer of ’17, we’ve learned a bunch about its strengths, weaknesses, and reliability.

A second 1080Ti and 8TB HDD in RAID 1 configuration were added (on the way to RAID 5). Even with a plethora of data, backups, and on-disk image augmentation experiments, its been hard to crack over a TB of use on the data Hard disk.

The main system bottleneck is not GPU throughput, but memory. The two 1080Ti’s (only) have 22GB of resident GPU DDR5 memory. That has practical impacts on image-based deep learning, particularly in algorithm depth and batch size. Batch size is important because many algorithms (particularly residual networks like ResNet50) use BatchNorm, and batch sizes <16 perform worse.(ref) The largest determinant of batch size is image size, with larger images decreasing batch size. The only relevant upgrade for the 11GB 1080Ti’s is two 24GB NVIDIA TITAN RTX’s with a NV-link connector, for a cool $5K, or two 48GB Quadro 8000’s for about 11K. At those prices, data-center charges start seeming reasonable.

It was anticipated that deep learning environments would run in Windows 10 via anaconda and Ubuntu 16.04LTS. That rarely happened, and most experimentation was done in Ubuntu. Anaconda was frustrating.

Our implementation of Ubuntu worked well enough, but as we installed packages at the root level, over time, similar to the old Windows 95 Registry clutter problem, broken packages and dependencies caused persistent “system errors” which were impossible to track down.

The system update and upgrade comprised the following:

1. Repartitioning Windows 10 to free up SSD space.

2. Upgrading Ubuntu 16.04LTS to 18.04LTS and migrating the $HOME folder from the HDD to the freed-up SSD.

3. Adopting a new workflow with dockerized containers. This may potentially address the reproducibility issues I’ve written about before and decrease the degree of variance in SGD retraining, which I think is a bigger issue than people care to admit. It also may minimize the risk of corruption of the base ubuntu OS with package add-ons and removals which can be now done inside a dockerized container.

Re-partitioning

First, we booted Windows and entered disk management to resize the main Windows Partition from this:

Pre partitioning
Base Configuration
post configuration
To This

Note we reclaimed about 100 GB from Windows in NTFS C: Then we shutdown /s /f /t 0 in windows admin prompt and rebooted selecting Ubuntu from GRUB. Launching GPARTED, we see this:

Immediately after Windows Partition

so, we click on the unallocated resource with 344.19Gib and tell GPARTED we want to resize it to XXXXX and reformat it as ext4 and give it the name Home Partition. Hit execute thereafter and see this:

New /ext partition of 300GB – ready for $HOME

I then followed the steps given for moving /home to a new partition, backing old home to a separate directory on the HDD. Honestly, in retrospect, I don’t advise it. Things were very, very wonky afterwards and I wasn’t sure if I was using the new partition, the old partition, or both.

I ended up re-installing Ubuntu 18 on the SSD, specifying nvme0n1p7 as the new /home.

While I had read it would erase everything in $HOME destructively, it didn’t completely, so I had to go through and hand delete the stuff I thought shouldn’t be there. It also broke GRUB, so I had to repair grub with boot-repair:

sudo add-apt-repository ppa:yannubuntu/boot-repair

sudo apt-get update

sudo apt-get install -y boot-repair

Then, I had to re-customize my GRUB with grub customizer.

And start the firewall.

sudo ufw allow ssh

sudo ufw enable