Multiple GPUs Not Showing Up in CUDA

We recently upgraded our server, adding 6 Tesla GPUs to our existing 580 GTX. However, once we had them installed, we noticed some issues, namely, they weren’t fully being recognized.

1
2
$ deviceQuery --noprompt | grep "^Device"
3
Device 0: "GeForce GTX 580"
4

However, they were being detected and the devices were being setup:

01
02
$ nvidia-smi -a | grep "^GPU"
03
GPU 0000:08:00.0
04
GPU 0000:0A:00.0
05
GPU 0000:0D:00.0
06
GPU 0000:8B:00.0
07
GPU 0000:8D:00.0
08
GPU 0000:96:00.0
09
GPU 0000:98:00.0
10
 
11
$ sudo lspci | egrep "3D controller|VGA compatible"
12
08:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)
13
0a:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)
14
0d:00.0 VGA compatible controller: nVidia Corporation GF110 [GeForce GTX 580] (rev a1)
15
10:04.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a)
16
8b:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)
17
8d:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)
18
96:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)
19
98:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)
20
 
21
$ ls -lha /dev/nv*
22
crw-rw-rw- 1 root root 195, 0 2012-05-17 16:29 /dev/nvidia0
23
crw-rw-rw- 1 root root 195, 1 2012-05-17 16:29 /dev/nvidia1
24
crw-rw-rw- 1 root root 195, 2 2012-05-17 16:29 /dev/nvidia2
25
crw-rw-rw- 1 root root 195, 3 2012-05-17 16:29 /dev/nvidia3
26
crw-rw-rw- 1 root root 195, 4 2012-05-17 16:29 /dev/nvidia4
27
crw-rw-rw- 1 root root 195, 5 2012-05-17 16:29 /dev/nvidia5
28
crw-rw-rw- 1 root root 195, 6 2012-05-17 16:29 /dev/nvidia6
29
crw-rw-rw- 1 root root 195, 255 2012-05-17 16:29 /dev/nvidiactl
30

Many users in various forums had a similar problem which was solved by setting the 666 permissions, which were already correct in our setup. However, we luckily found the resolution on http://ambermd.org/gpus/

01
02
$ echo $CUDA_VISIBLE_DEVICES
03
0
04
 
05
$ export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6"
06
$ deviceQuery --noprompt | grep "^Device"
07
Device 0: "GeForce GTX 580"
08
Device 1: "Tesla T10 Processor"
09
Device 2: "Tesla T10 Processor"
10
Device 3: "Tesla T10 Processor"
11
Device 4: "Tesla T10 Processor"
12
Device 5: "Tesla T10 Processor"
13
Device 6: "Tesla T10 Processor"
14

I never did find where this variable was being set, but now we know to include this in our scripts.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.