before cabling & final HD install

Building a high-performance GPU computing workstation for deep learning – Part I

With Tensorflow released to the public, the NVidia Pascal Titan X GPU, along with (relatively) cheap storage and memory, the time was right to take the leap from CPU-based computing to GPU accelerated machine learning.

My venerable Xeon W3550 8GB T3500 running a 2GB Quadro 600 was outdated. Since a DGX-1 was out of the question ($129,000), I decided to follow other pioneers building their own deep learning workstations. I could have ended up with a multi-thousand dollar doorstop – fortunately, I did not.

Criteria:

1.  Reasonably fast CPU
2.  Current ‘Best’ NVidia GPU with large DDR5 memory
3.  Multi-GPU potential
4.  32GB or more stable RAM
5.  SSD for OS
6.  Minimize internal bottlenecks
7.  Stable & Reliable – minimize hardware bugs
8.  Dual Boot Windows 10 Pro & Ubuntu 16.04LTS
9.  Can run: R, Rstudio, Pycharm, Python 3.5, Tensorflow

 

Total:                                                   $3725

 

Asus X99 E 10G WS Motherboard. Retail $699

A Motherboard sets the capabilities and configuration of your system. While newer Intel Skylake and Kaby Lake CPU architectures & chipsets beckon, reliability is important in a computationally intensive build, and their documented complex computation freeze bug makes me uneasy. Also, both architectures remain PCIe 3.0 at this time.

Therefore, I chose the ASUS X99 motherboard. The board implements 40 PCIe 3.0 lanes which will support three 16X PCIe 3.0 cards (i.e. GPU’s) and one 8x card. The PCIe 3.0-CPU lanes are the largest bottleneck in the system, so making these 16X helps the most.  It also has a 10G Ethernet jack somewhat future-proofing it as I anticipate using large datasets in the Terabyte size. It supports up to 128GB of DDR4. The previous versions of ASUS X99 WS have been well reviewed.

Intel Core i7 6850K Broadwell-E CPU Socket Retail $649

Socket LGA2011-v3 on the motherboard guides the CPU choice – the sweet spot in the Broadwell-E lineup is the overclockable 3.6Ghz 6850K with 6 cores and 15MB of L3 cache, permitting 40 PCIe lanes. $359 discounted is attractive compared to the 6900K, reviewed to offer minimal to no improvement at a $600 price premium. The 6950X is $1200 more for 4 extra cores, unnecessary for our purposes. Avoid the $650 6800K – pricier and slower with less (28) lanes. A stable overclock to 4.0Ghz is easily achievable on the 6850K.

NVidia GeForce 1080Ti 11GB – EVGA FTW3 edition Retail: $800

Last year, choosing a GPU was easy – the Titan X Pascal, a 12GB 3584 CUDA-core monster. However, by spring 2017 there were two choices: The Titan Xp, with slightly faster memory speed & internal bus, and 256 more CUDA cores; and the 1080Ti, the prosumer enthusiast version of the Titan X Pascal, with 3584 cores. The 1080Ti differs in its memory architecture – 11GB DDR5 and a slightly slower, slightly narrower bandwidth vs. the Xp.
The 1080Ti currently wins on price/performance. You can buy two 1080Ti’s for the price of one Titan Xp. Also, at time of purchase, Volta architecture was announced. As the PCIe bus is the bottleneck, and will remain so for a few years, batch size into DDR5 memory & CUDA cores will be where performance is gained. A 16GB DDR5 Volta processor would be a significant performance gain from a 12GB Pascal for deep learning. Conversely, 12GB Pascal to 11GB Pascal is a relative lesser performance hit. As I am later in the upgrade cycle, I’ll upgrade to the 16GB Volta and resell my 1080Ti in the future – I anticipate only taking a loss of $250 per 1080Ti on resell.
The FTW3 edition was chosen because it is a true 2-slot card (not 2.5) with better cooling than the Founder’s Edition 1080Ti. This will allow 3 to physically fit onto this motherboard.

64 GB DDR4-2666 DRAM – Corsair Vengeance low profile Retail : $600

DDR4 runs at 2133mhz unless overclocked. Attention must be paid to the size of the DRAM units to ensure they fit under the CPU cooler, which these do. From my research, DRAM speeds over 3000 lose stability. For Broadwell there’s not much evidence that speeds above 2666mhz improves performance. I chose 64GB because 1) I use R which is memory resident so the more GB the better and 2) There is a controversial rule of thumb that your RAM should equal 2x the size of your GPU memory to prevent bottlenecks. Implementing 3 1080Ti’s, 3x 11GB = 33 GB. Implementing 2 16GB Voltas would be 32GB.

Samsung 1TB 960 EVO M2 NVMe SSD Retail $500

The ASUS motherboard has a fast M2 interface, which, while using PCIe lanes, does not compete for slots or lanes. The 1TB size is sufficient for probably anything I will throw at it (all apps/programs, OS’s, and frequently used data and packages. Everything else can go on other storage. I was unnecessarily concerned about SSD heat throttling – on this motherboard, the slot’s location is in a good place which allows for great airflow over it. The speed in booting up Windows 10 or Ubuntu 16.04 LTS is noticeable.

EVGA Titanium 1200 power supply Retail $350

One of the more boring parts of the computer, but for a multi GPU build you need a strong 1200 or 1600W power supply. The high Titanium rating will both save on electricity and promote stability over long compute sessions.

Barracuda 8TB Hard Drive Retail $299

I like to control my data, so I’m still not wild about the cloud, although it is a necessity for very large data sets. So here is a large, cheap drive for on-site data storage. For an extra $260, I can Raid 1 the drive and sleep well at night.

Strike FUMA CPU Cooler. Retail $60

This was actually one of the hardest decisions in building the system – would the memory will fit under the fans? The answer is a firm yes. This dual fan tower cooler was well-rated, quiet, attractive, fit properly, half the price of other options, and my overclocked CPU runs extremely cool – 35C with full fan RPM’s, average operating temperature 42C and even under a high stress test, I have difficulty getting the temperature over 58C. Notably, the fans never even get to full speed on system control.

Corsair 750 D Airflow Edition Case. Retail $250

After hearing the horror stories of water leaks, I decided at this level of build not to go with water cooling. The 750D has plenty of space (enough for a server) for air circulation, and comes installed with 3 fans – two air intake on the front and one exhaust at upper rear. It is a really nice, sturdy, large case. My front panel was defective – the grating kept falling off – so Corsair shipped me a replacement quickly and without fuss.

Cougar Vortex 14” fans – Retail $20 ea.

Two extra cougar Vortex 14” fans were purchased, one as an intake fan at the bottom of the case, and one as a 2nd exhaust fan at the top of the case. These together create excellent airflow at noise levels I can barely hear. Two fans on the CPU Heat Sink plus Three Fans on the GPU plus five fans on the case plus one in the power supply = 11 fans total! More airflow at lower RPM = silence.

Windows 10 Pro USB edition Retail $199

This is a dual boot system so, there you go

Specific limitations with this system are as follows. While it will accept four GPU’s physically, the slots are limited to 16X/16X/16X/8X with the M2 drive installed which may affect performance on the 4th GPU (& therefore deep learning model training and performance). Additionally, the CPU upgrade path is limited – without going to a Xeon, the only reasonable upgrade from the 6850K’s 14,378 passmark is the 6950X, with a passmark of 20,021. In the future if more than 128GB DDR4 is required, that will be a problem with this build.

Finally, inherent bandwidth limitations exist in the PCIe 3.0 protocol and aren’t easily circumvented. PCIe 3.0 throughput is 8GB/s. Compare this to NVidia’s proprietary NVlink that allows throughput of 20-25GB/s (Pascal vs. Volta). Note that current NVlink speeds will not be surpassed until PCIe5.0 is implemented at 32GB/s in 2019. NVidia’s CUDA doesn’t implement SLI, either, so at present that is not a solution. PCIe 4.0 has just been released with only IBM adopting, doubling transfer vs. 3.0, and 5.0 has been proposed, doubling yet again. However, these faster protocols may be difficult and/or expensive to implement. A 4 slot PCIe 5.0 bus will probably not be seen until into the 2020’s. This means that for now, dedicated NVlink 2.0 systems will outperform similar PCIe systems.

With that said, this system approaches a best possible build considering price and reliability, and should be able to give a few years of good service, especially if the GPU’s are upgraded periodically. Precursor systems based upon the Z97 chipset are still viable for deep learning, albeit with slower speeds, and have been matched to older NVidia 8GB 1070 GPU’s which are again half the price of the 1080Ti.

In part II, I will describe how I set up the system configuration for dual boot and configured deep learning with Ubuntu 16.04LTS. Surprisingly, this was far more difficult than the actual build itself, for multiple reasons I will explain & detail with the solutions.  And yes, it booted up.  On the first try.

see, it works

This is cross posted to www.n2value.com/blog .  For discussions on healthcare, data analytics, systems and processes, and administrative leadership, please visit www.n2value.com!