We recently upgraded our server, adding 6 Tesla GPUs to our existing 580 GTX. However, once we had them installed, we noticed some issues, namely, they weren’t fully being recognized.
1 2 $ deviceQuery --noprompt | grep "^Device"3 Device 0: "GeForce GTX 580"4
However, they were being detected and the devices were being setup:
01 02 $ nvidia-smi -a | grep "^GPU"03 GPU 0000:08:00.004 GPU 0000:0A:00.005 GPU 0000:0D:00.006 GPU 0000:8B:00.007 GPU 0000:8D:00.008 GPU 0000:96:00.009 GPU 0000:98:00.010 11 $ sudo lspci | egrep "3D controller|VGA compatible"12 08:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)13 0a:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)14 0d:00.0 VGA compatible controller: nVidia Corporation GF110 [GeForce GTX 580] (rev a1)15 10:04.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a)16 8b:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)17 8d:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)18 96:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)19 98:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)20 21 $ ls -lha /dev/nv*22 crw-rw-rw- 1 root root 195, 0 2012-05-17 16:29 /dev/nvidia023 crw-rw-rw- 1 root root 195, 1 2012-05-17 16:29 /dev/nvidia124 crw-rw-rw- 1 root root 195, 2 2012-05-17 16:29 /dev/nvidia225 crw-rw-rw- 1 root root 195, 3 2012-05-17 16:29 /dev/nvidia326 crw-rw-rw- 1 root root 195, 4 2012-05-17 16:29 /dev/nvidia427 crw-rw-rw- 1 root root 195, 5 2012-05-17 16:29 /dev/nvidia528 crw-rw-rw- 1 root root 195, 6 2012-05-17 16:29 /dev/nvidia629 crw-rw-rw- 1 root root 195, 255 2012-05-17 16:29 /dev/nvidiactl30
Many users in various forums had a similar problem which was solved by setting the 666 permissions, which were already correct in our setup. However, we luckily found the resolution on http://ambermd.org/gpus/
01 02 $ echo $CUDA_VISIBLE_DEVICES03 004 05 $ export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6"06 $ deviceQuery --noprompt | grep "^Device"07 Device 0: "GeForce GTX 580"08 Device 1: "Tesla T10 Processor"09 Device 2: "Tesla T10 Processor"10 Device 3: "Tesla T10 Processor"11 Device 4: "Tesla T10 Processor"12 Device 5: "Tesla T10 Processor"13 Device 6: "Tesla T10 Processor"14
I never did find where this variable was being set, but now we know to include this in our scripts.